DNS queries slow up when hit Authority Section

Ealric · 02-08-2008, 07:43 PM

Hello, all.

We're having an annoyingly odd problem at one of our datacenters where DNS queries slow up horribly when they hit the Authority Section (seen from the partial Dig output below). The snap up to that section is as quick as normal, but when it hits that point it goes consistently up to the 2000+ mark, with every request w/in 5-9 ms of each other for every first, non-cached, attempt.

There's more to this problem's background but for now I was wondering if anyone's ever seen anything like this, and whether or not there might be some good guesses as to how/why something like this could happen. The service was working good at the top of the week, but went bad a couple of days ago and hasn't improved much since. On another machine we have here in this office (different location) the response time is initially 18ms (both of these examples are from before the server caches the result).

It's driving us crazy, but short of just saying "the internet connection must be slow", which we believe it is not, we're having a hard time figuring it out.

The command syntax we're using is "dig @<dnsserver> +search +qr <host>".

;; AUTHORITY SECTION:
. 10800 IN SOA A.ROOT-SERVERS.NET. NSTLD.VERISIGN-GRS.COM. 2008020800 1800 900 604800 86400

;; Query time: 2039 msec

I understand that this might be very light on giving config details, but I'd have to hack out all the hostnames/IP's and I think my eyes are bleeding from working on this for the last 2 days.

Any help would be greatly appreciated.

JimBass · 02-08-2008, 11:19 PM

There has been high level routing issues between some of the major carriers in the past 48 or so hours. Earlier today, I tried looking at a website, http://bernardcornwell.net. The request timed out. I started doing some digging, and found that whois would show me the 2 authoritative DNS servers for that site, ns.media3.net and ns2.media3.net. I couldn't dig addresses for either of those servers. Getting suspicious, I ssh'd to another machine, at a remote location. I tried to dig dernardcornwell.net again, and it came right up. I was also able to get a non-cached answer by querying ns.media3.net directly. I copied the IP address of ns.media3.net onto my machine, which couldn't resolve it at all. I tried to dig bernardcornwell.net at the IP of the server, and again it failed. When I tried to traceroute to the IP, it failed. And this wasn't like a B.S. ISP screwing up, these were 2 of the major carriers, alter.net and quest.net. Here is the traceroute from the machine that had no issues, and the second is from the machine that couldn't reach it.

Code:

4  0.so-3-1-0.XT1.NYC9.ALTER.NET (152.63.10.37)  47.809 ms  44.220 ms  35.288 ms
5  0.so-6-1-0.CL1.BOS1.ALTER.NET (152.63.19.173)  38.913 ms  33.755 ms  33.505 ms
6  POS6-0.GW8.BOS1.ALTER.NET (152.63.25.121)  37.222 ms  34.754 ms  33.766 ms
7  BuyersUnit3d.customer.alter.net (208.222.13.82)  43.612 ms  35.617 ms  38.694 ms
8  ns.media3.net (208.249.122.250)  36.428 ms  35.711 ms  35.502 ms

Failure

Code:

3  x403b3139.ip.e-nt.net (64.59.49.57)  73.982 ms  92.789 ms  81.441 ms
4  x4034ddc1.ip.e-nt.net (64.52.221.193)  79.846 ms  64.873 ms  70.058 ms
5  xd84b5ac3.ip.e-nt.net (216.75.90.195)  79.222 ms  56.946 ms  66.540 ms
6  xd84b5a0a.ip.e-nt.net (216.75.90.10)  93.747 ms  66.229 ms  59.634 ms
7  x403b04c7.ip.e-nt.net (64.59.4.199)  89.676 ms  69.988 ms  71.182 ms
8  so-2-0-0.ar1.NYC1.gblx.net (204.246.205.65)  71.953 ms  71.917 ms  74.761 ms
9  so0-0-0-2488M.ar3.JFK1.gblx.net (67.17.108.113)  78.603 ms  75.767 ms  73.096 ms
10  qwest-1.ar3.JFK1.gblx.net (208.50.13.170)  104.540 ms  93.823 ms  79.081 ms
11  jfk-core-01.inet.qwest.net (205.171.30.13)  91.246 ms  70.236 ms  67.464 ms
12  bos-core-02.inet.qwest.net (205.171.8.17)  80.571 ms  58.766 ms  56.458 ms
13  bos-edge-02.inet.qwest.net (205.171.28.30)  45.346 ms  37.671 ms  25.707 ms
14  * * *
15  * * bos-core2.media3.net (67.130.100.218)  41.094 ms !A
16  * * bos-core2.media3.net (67.130.100.218)  85.822 ms !A
17  * * *
18  * * *
19  * * *
20  * * *
21  bos-core2.media3.net (67.130.100.218)  93.684 ms !A * *
22  * * *
23  *

So the problems you're having may well have nothing to do with your particular server, it is possible that it is carriers having issues passing traffic between themselves. As a test, try the same thing I did. Get the address of something that is timing out or being slow, and run a traceroute from a location that reaches it, and one that doesn't or is very slow. See if like mine it is lost in space. I suspect that is the case, as there really isn't much that would cause some queries to be slow and others fast.

Peace,
JimBass

Ealric · 02-09-2008, 07:20 PM

Thanks a lot for the info. I'm going to try your suggestions when I'm back in the office. One thing though, we don't have this problem on DNS servers we have in NY and CA.. only on two that we have in MA. Given that, would you still have the same opinion?

JimBass · 02-09-2008, 08:19 PM

Yes. Both of my traceroutes were executed from the isle of Manhattan, part (the main part) of NYC. One went out on Verizon DSL, and the other on cat5 connection to InfoHighway. The problem doesn't lie with your server or where it is on the globe, it is who your upstream provider is, and where they exchange data with other upstream providers. The traceroute tool on linux or tracert on winblows will tell you specifically where the problem is.

Peace,
JimBass

Ealric · 02-11-2008, 05:50 PM

Well, traceroutes didn't get me too far b/c one site can do them and the other times out on the responses. Methinks this "might" be b/c of differing firewall rules per site (as a first-guess). Unfortunately, like many other people out there, I'm troubleshooting a problem for the same people who don't give much information on the network/security topology for me to make better progress in this effort.

Anyhow, this same issue is exhibiting another odd characteristic. Normally , I understand that DNS-related utils typically don't refer to nsswitch.conf and subsequent files down the chain, but more "OS-level" apps do. We're finding that a program which uses the glibc (gethostbyaddr)reference is actually (well, apparently) skipping the search order in nsswitch.conf-->hosts file, and going to DNS. The servers we're looking for are certainly in the hosts file, so there shouldn't be a need for DNS to be consulted. (nsswitch.conf --> hosts files dns nisplus).

What we've found as a workaround is to comment-out the 127.0.0.1 reference in resolv.conf and let the service refer to it's parent servers only. The "problem" we're suspecting with the authoritative servers is something we're investigating as part of this overall effort.

Has any ever, ever heard of anything like this? We're pretty stumped, but also suspecting it's one very small, very stupid thing.

Thanks for the help so far, and any more would be appreciated again.