Most people may never run have a need to do something as rash as massive parallel reverse DNS lookups, but in the event that you do, you may end up being like me (DNS is one of those black arts technologies).

I do a lot of reverse lookups (millions and millions and millions a day; don't ask) and like everything, you gotta start somewhere, so I started with a single name server.

The first problem with having a single name server is that it tends to get put under load (duh). This problem is magnified if the name server you're querying is the primary name server for your organization :-). In addition to putting the server under load, you just physically can't query the server fast enough. There's a delay as your request goes out over the network and the name server answers. In my case, I do hundreds of thousands of queries in parallel every second. This quickly bogs down both my scripts and the name server it's self.

So the single server idea is out. Next up was to consider 2 servers.

The setup was modified to have a semi-caching secondary name server set up. It would query the primary name server in the event it wasn't able to resolve something. It would then cache some of the requests. My scripts were modified to point to this secondary server.

This certainly made things better, but I still have further to go. Having a secondary server available who's only purpose in life is to talk to me alleviated the primary NS and the networking folk were happy again. Occasionally the secondary server would need to query the primary, but overall things were better.

I still had that network latency problem though.

Which brings me to solution #3; local caching name server.

Up to this point in time the solution I've put in place is thus

  • local caching name server on the machine</li>
  • caching server talks to secondary server if it cant resolve</li>
  • dedicated secondary name server</li>
  • secondary name server talks to primary server if it can't resolve</li>
    </ul>
    When my scripts run, they almost always find a hit on the local NS. There's a performance hit if you restart the local name server though because the cache is flushed. This isn't a terribly big deal because it will be re-built almost instantly when the scripts run again.

    Having a local cache is also advantageous because I rarely go out over the network anymore. There's a fairly regular trickle of traffic to the secondary NS, but that is to be expected since, like I said, I resolve a lot of IPs.

    On my local NS, I think it's important to keep the cache big. BIND by default makes it practically unlimited. There is, however, one parameter in BIND that I needed to change. With all this resolving (recursive resolution) I needed to boost the allowed recursive clients. BIND sets this to 1000 by default. I would think you suffer a performance hit if you boost this, but whoopity-do. I bumped it to 100,000 and waited to see what would happen.

    BIND kept ticking away no problem, and more importantly, I stopped getting errors like this in my syslog

    client 127.0.0.1#49254: no more recursive clients: quota reached

    I'm no BIND expert, but my guess here is that if the quota was reached, the attempt to resolve would be flat-out rejected (bad for my script's intentions)

    So a multi-level name server installation helps. Tweaking BIND parameters also helps, and should I need to work through IP lists more quickly in the future, I should be able to divvy up the workload across a number of servers so that code on each can read IP lists and get to resolving.