Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
We're having an annoyingly odd problem at one of our datacenters where DNS queries slow up horribly when they hit the Authority Section (seen from the partial Dig output below). The snap up to that section is as quick as normal, but when it hits that point it goes consistently up to the 2000+ mark, with every request w/in 5-9 ms of each other for every first, non-cached, attempt.
There's more to this problem's background but for now I was wondering if anyone's ever seen anything like this, and whether or not there might be some good guesses as to how/why something like this could happen. The service was working good at the top of the week, but went bad a couple of days ago and hasn't improved much since. On another machine we have here in this office (different location) the response time is initially 18ms (both of these examples are from before the server caches the result).
It's driving us crazy, but short of just saying "the internet connection must be slow", which we believe it is not, we're having a hard time figuring it out.
The command syntax we're using is "dig @<dnsserver> +search +qr <host>".
;; AUTHORITY SECTION:
. 10800 IN SOA A.ROOT-SERVERS.NET. NSTLD.VERISIGN-GRS.COM. 2008020800 1800 900 604800 86400
;; Query time: 2039 msec
I understand that this might be very light on giving config details, but I'd have to hack out all the hostnames/IP's and I think my eyes are bleeding from working on this for the last 2 days.
There has been high level routing issues between some of the major carriers in the past 48 or so hours. Earlier today, I tried looking at a website, http://bernardcornwell.net. The request timed out. I started doing some digging, and found that whois would show me the 2 authoritative DNS servers for that site, ns.media3.net and ns2.media3.net. I couldn't dig addresses for either of those servers. Getting suspicious, I ssh'd to another machine, at a remote location. I tried to dig dernardcornwell.net again, and it came right up. I was also able to get a non-cached answer by querying ns.media3.net directly. I copied the IP address of ns.media3.net onto my machine, which couldn't resolve it at all. I tried to dig bernardcornwell.net at the IP of the server, and again it failed. When I tried to traceroute to the IP, it failed. And this wasn't like a B.S. ISP screwing up, these were 2 of the major carriers, alter.net and quest.net. Here is the traceroute from the machine that had no issues, and the second is from the machine that couldn't reach it.
Code:
4 0.so-3-1-0.XT1.NYC9.ALTER.NET (152.63.10.37) 47.809 ms 44.220 ms 35.288 ms
5 0.so-6-1-0.CL1.BOS1.ALTER.NET (152.63.19.173) 38.913 ms 33.755 ms 33.505 ms
6 POS6-0.GW8.BOS1.ALTER.NET (152.63.25.121) 37.222 ms 34.754 ms 33.766 ms
7 BuyersUnit3d.customer.alter.net (208.222.13.82) 43.612 ms 35.617 ms 38.694 ms
8 ns.media3.net (208.249.122.250) 36.428 ms 35.711 ms 35.502 ms
Failure
Code:
3 x403b3139.ip.e-nt.net (64.59.49.57) 73.982 ms 92.789 ms 81.441 ms
4 x4034ddc1.ip.e-nt.net (64.52.221.193) 79.846 ms 64.873 ms 70.058 ms
5 xd84b5ac3.ip.e-nt.net (216.75.90.195) 79.222 ms 56.946 ms 66.540 ms
6 xd84b5a0a.ip.e-nt.net (216.75.90.10) 93.747 ms 66.229 ms 59.634 ms
7 x403b04c7.ip.e-nt.net (64.59.4.199) 89.676 ms 69.988 ms 71.182 ms
8 so-2-0-0.ar1.NYC1.gblx.net (204.246.205.65) 71.953 ms 71.917 ms 74.761 ms
9 so0-0-0-2488M.ar3.JFK1.gblx.net (67.17.108.113) 78.603 ms 75.767 ms 73.096 ms
10 qwest-1.ar3.JFK1.gblx.net (208.50.13.170) 104.540 ms 93.823 ms 79.081 ms
11 jfk-core-01.inet.qwest.net (205.171.30.13) 91.246 ms 70.236 ms 67.464 ms
12 bos-core-02.inet.qwest.net (205.171.8.17) 80.571 ms 58.766 ms 56.458 ms
13 bos-edge-02.inet.qwest.net (205.171.28.30) 45.346 ms 37.671 ms 25.707 ms
14 * * *
15 * * bos-core2.media3.net (67.130.100.218) 41.094 ms !A
16 * * bos-core2.media3.net (67.130.100.218) 85.822 ms !A
17 * * *
18 * * *
19 * * *
20 * * *
21 bos-core2.media3.net (67.130.100.218) 93.684 ms !A * *
22 * * *
23 *
So the problems you're having may well have nothing to do with your particular server, it is possible that it is carriers having issues passing traffic between themselves. As a test, try the same thing I did. Get the address of something that is timing out or being slow, and run a traceroute from a location that reaches it, and one that doesn't or is very slow. See if like mine it is lost in space. I suspect that is the case, as there really isn't much that would cause some queries to be slow and others fast.
Thanks a lot for the info. I'm going to try your suggestions when I'm back in the office. One thing though, we don't have this problem on DNS servers we have in NY and CA.. only on two that we have in MA. Given that, would you still have the same opinion?
Yes. Both of my traceroutes were executed from the isle of Manhattan, part (the main part) of NYC. One went out on Verizon DSL, and the other on cat5 connection to InfoHighway. The problem doesn't lie with your server or where it is on the globe, it is who your upstream provider is, and where they exchange data with other upstream providers. The traceroute tool on linux or tracert on winblows will tell you specifically where the problem is.
Well, traceroutes didn't get me too far b/c one site can do them and the other times out on the responses. Methinks this "might" be b/c of differing firewall rules per site (as a first-guess). Unfortunately, like many other people out there, I'm troubleshooting a problem for the same people who don't give much information on the network/security topology for me to make better progress in this effort.
Anyhow, this same issue is exhibiting another odd characteristic. Normally , I understand that DNS-related utils typically don't refer to nsswitch.conf and subsequent files down the chain, but more "OS-level" apps do. We're finding that a program which uses the glibc (gethostbyaddr)reference is actually (well, apparently) skipping the search order in nsswitch.conf-->hosts file, and going to DNS. The servers we're looking for are certainly in the hosts file, so there shouldn't be a need for DNS to be consulted. (nsswitch.conf --> hosts files dns nisplus).
What we've found as a workaround is to comment-out the 127.0.0.1 reference in resolv.conf and let the service refer to it's parent servers only. The "problem" we're suspecting with the authoritative servers is something we're investigating as part of this overall effort.
Has any ever, ever heard of anything like this? We're pretty stumped, but also suspecting it's one very small, very stupid thing.
Thanks for the help so far, and any more would be appreciated again.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.