LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 10-06-2016, 12:48 PM   #1
wastingtime
Member
 
Registered: Sep 2004
Posts: 86

Rep: Reputation: 17
Bind / named slow response


I setup bind 9.10.4-P3 as a caching DNS server. It is very slow to respond.

There seem to be two problems.

When a query result is not in the cache it takes a very long time to query other servers, often returning SRVRFAIL. The overall response time is over two seconds

Even when the result already is cached it somtimes takes a long time for it to response.

The machine it runs on is a quad-core and is not busy at all.

Below is rndc stats report. I think it shows decent response times, so I don't understand why the actual response time is so different.
Quote:
+++ Statistics Dump +++ (1475775635)
++ Incoming Requests ++
57260 QUERY
++ Incoming Queries ++
56715 A
4 SOA
105 PTR
412 AAAA
6 SRV
6 DS
12 DNSKEY
++ Outgoing Queries ++
[View: default]
148468 A
1159 NS
281 CNAME
9 SOA
803 AAAA
23 SRV
9579 DS
1305 DNSKEY
5617 DLV
[View: _bind]
++ Name Server Statistics ++
57260 IPv4 requests received
4269 requests with EDNS(0) received
6 TCP requests received
46850 responses sent
6 truncated responses sent
4266 responses with EDNS(0) sent
26350 queries resulted in successful answer
105 queries resulted in authoritative answer
26901 queries resulted in non authoritative answer
141 queries resulted in nxrrset
19844 queries resulted in SERVFAIL
515 queries resulted in NXDOMAIN
45326 queries caused recursion
10410 duplicate queries received
57254 UDP queries received
6 TCP queries received
++ Zone Maintenance Statistics ++
++ Resolver Statistics ++
[Common]
[View: default]
167244 IPv4 queries sent
47814 IPv4 responses received
3036 NXDOMAIN received
13 FORMERR received
13 EDNS(0) query failures
9019 truncated responses received
6 lame delegations received
111587 query retries
120013 query timeouts
12476 IPv4 NS address fetches
3156 IPv4 NS address fetch failed
26262 DNSSEC validation attempted
18358 DNSSEC validation succeeded
7675 DNSSEC NX validation succeeded
185 DNSSEC validation failed
37561 queries with RTT 10-100ms
10183 queries with RTT 100-500ms
57 queries with RTT 500-800ms
10 queries with RTT 800-1600ms
2 queries with RTT > 1600ms
31 bucket size
 
Old 10-06-2016, 01:25 PM   #2
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: Fedora 34
Posts: 3,666

Rep: Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059
Quote:
803 AAAA
I suspect you are set up to try IPv6, timeout, then try IPv4.
 
Old 10-07-2016, 04:23 AM   #3
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 1,815

Rep: Reputation: 816Reputation: 816Reputation: 816Reputation: 816Reputation: 816Reputation: 816Reputation: 816
I think the root cause is a bad DNS peer (that returns SERVFAIL) or a network/link problem (truncated responses, query timeouts).
Please run ping to the peers (i.e. the configured forwarders) and watch the errors.
 
Old 10-07-2016, 10:30 AM   #4
wastingtime
Member
 
Registered: Sep 2004
Posts: 86

Original Poster
Rep: Reputation: 17
Thanks smallpond, I re-configured the server by adding OPTIONS="-4" to /etc/sysconfig/named and filter-aaaa-on-v4 yes; to the options section in /etc/named.conf. It already was not listening for ipv6 since that line was commented out (# listen-on-v6 port 53 { ::1; }. I restarted the server and will see if it works better.

Thanks MadeInGermany, this server does not forwards requests, rather it resolves them by working its way from the root servers.

One more data tidbit. With the new configuration and an empty cache, notice how running dig from another machine (connected on the same network, it takes dig 7 seconds to return the result, even though it reports the query took 1.7 seconds.

Quote:
$ time dig cnn.com

; <<>> DiG 9.10.4-P1 <<>> +noadflag cnn.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49638
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;cnn.com. IN A

;; ANSWER SECTION:
cnn.com. 300 IN A 157.166.226.26
cnn.com. 300 IN A 157.166.226.25

;; Query time: 1776 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
;; WHEN: Fri Oct 07 07:52:22 PDT 2016
;; MSG SIZE rcvd: 68


real 0m7.061s
user 0m0.006s
sys 0m0.008s
I was running tcpdump on the machine performing the query to see the actual request response times to the server
Quote:
07:52:15.798770 IP 10.0.0.16.47953 > 10.0.0.2.53: 49638+ [1au] A? cnn.com. (36)

07:52:20.796882 IP 10.0.0.16.47953 > 10.0.0.2.53: 49638+ [1au] A? cnn.com. (36)
E..@.J..@..Q
...

07:52:22.573496 IP 10.0.0.2.53 > 10.0.0.16.47953: 49638 2/0/1 A 157.166.226.26, A 157.166.226.25 (68)
E..`y...@...
...
....5.Q.L...............cnn.com..............,...............,........)........
As you can see it did take 7 seconds for the response to arrive (the requesting machine asked a second time after waiting 5 seconds for a response to the first request)

On the server side, looking in named log file, you can see the first request arrives immediately after it was sent from the client machine, nothing happens for 5 seconds, the second request arrives (client machine retries), named detects it is a duplicate.

Quote:
07-Oct-2016 07:52:15.797 client 10.0.0.16#47953: UDP request
07-Oct-2016 07:52:15.797 client 10.0.0.16#47953: request is not signed
07-Oct-2016 07:52:15.797 client 10.0.0.16#47953: recursion available
07-Oct-2016 07:52:15.797 client 10.0.0.16#47953: query
07-Oct-2016 07:52:15.797 client 10.0.0.16#47953 (cnn.com): query (cache) 'cnn.com/A/IN' approved
07-Oct-2016 07:52:15.797 client 10.0.0.16#47953 (cnn.com): replace
07-Oct-2016 07:52:15.797 clientmgr @0x7fd0d9a53458: get client
07-Oct-2016 07:52:15.797 clientmgr @0x7fd0d9a53458: recycle
07-Oct-2016 07:52:15.797 fetch: cnn.com/A
07-Oct-2016 07:52:15.797 client @0x7fd0cc12e370: udprecv

07-Oct-2016 07:52:20.795 client 10.0.0.16#47953: UDP request
07-Oct-2016 07:52:20.795 client 10.0.0.16#47953: request is not signed
07-Oct-2016 07:52:20.795 client 10.0.0.16#47953: recursion available
07-Oct-2016 07:52:20.795 client 10.0.0.16#47953: query
07-Oct-2016 07:52:20.795 client 10.0.0.16#47953 (cnn.com): query (cache) 'cnn.com/A/IN' approved
07-Oct-2016 07:52:20.795 client 10.0.0.16#47953 (cnn.com): replace
07-Oct-2016 07:52:20.795 clientmgr @0x7fd0d9a53458: get client
07-Oct-2016 07:52:20.795 clientmgr @0x7fd0d9a53458: recycle
07-Oct-2016 07:52:20.795 fetch: cnn.com/A
07-Oct-2016 07:52:20.795 client 10.0.0.16#47953 (cnn.com): next
07-Oct-2016 07:52:20.795 client @0x7fd0c8000cd0: udprecv
07-Oct-2016 07:52:20.795 client 10.0.0.16#47953 (cnn.com): request failed: duplicate query
07-Oct-2016 07:52:20.795 client 10.0.0.16#47953 (cnn.com): endrequest
Then suddenly named wakes up smells the coffee and goes to work

Quote:
07-Oct-2016 07:52:22.403 fetch: cnn.com/DS
07-Oct-2016 07:52:22.515 validating cnn.com/DS: starting
07-Oct-2016 07:52:22.515 validating cnn.com/DS: attempting negative response validation
...
07-Oct-2016 07:52:22.571 validating cnn.com/A: marking as answer (dlvfetched (3))
07-Oct-2016 07:52:22.571 client 10.0.0.16#47953 (cnn.com): send
07-Oct-2016 07:52:22.571 client 10.0.0.16#47953 (cnn.com): sendto
07-Oct-2016 07:52:22.571 client 10.0.0.16#47953 (cnn.com): senddone
07-Oct-2016 07:52:22.571 client 10.0.0.16#47953 (cnn.com): next
07-Oct-2016 07:52:22.571 client 10.0.0.16#47953 (cnn.com): endrequest
As you can see from the moment named went to work, it took 168 milliseconds for it to return a result. However, named reported the query time as 1.7 seconds, that is, from the time of the second request. (That's a named bug imho.) But what was it doing for the previous almost 1.5 seconds from the second request and almost 7 seconds from the first request?
 
Old 10-08-2016, 12:02 AM   #5
wastingtime
Member
 
Registered: Sep 2004
Posts: 86

Original Poster
Rep: Reputation: 17
smallpond, I made the changes, no observable improvement. I found a workaround which helps some: I enabled prefetch. That keeps cached entries up to date, helps with frequent requests.

Still, when named needs to query an external DNS server, it always takes 6-7 seconds. Interestingly, if I query that same server directly, for example using dig, it returns the answer quite quickly. For example, from named log when looking for pop.googlemail.com

Quote:
07-Oct-2016 21:46:08.851 client 10.0.0.16#58146: UDP request
07-Oct-2016 21:46:08.851 client 10.0.0.16#58146: using view '_default'
07-Oct-2016 21:46:08.851 client 10.0.0.16#58146: request is not signed
07-Oct-2016 21:46:08.851 client 10.0.0.16#58146: recursion available
07-Oct-2016 21:46:08.851 client 10.0.0.16#58146: query
07-Oct-2016 21:46:08.851 client 10.0.0.16#58146 (pop.googlemail.com): query (cache) 'pop.googlemail.com/A/IN' approved
07-Oct-2016 21:46:08.851 client 10.0.0.16#58146 (pop.googlemail.com): replace
07-Oct-2016 21:46:08.851 clientmgr @0x7f5e073f0458: get client
07-Oct-2016 21:46:08.851 clientmgr @0x7f5e073f0458: recycle
07-Oct-2016 21:46:08.851 fetch: googlemail-pop.l.google.com/A
07-Oct-2016 21:46:08.851 client @0x7f5df812e370: udprecv
07-Oct-2016 21:46:08.851 expiring v4 for name 0x7f5dff2e4600
07-Oct-2016 21:46:08.851 dns_adb_createfind: found A for name ns1.google.com (0x7f5dff2e4600) in db
07-Oct-2016 21:46:08.851 expiring v4 for name 0x7f5dff2e44d0
07-Oct-2016 21:46:08.851 dns_adb_createfind: found A for name ns2.google.com (0x7f5dff2e44d0) in db
07-Oct-2016 21:46:08.851 expiring v4 for name 0x7f5dff2e8ac0
07-Oct-2016 21:46:08.851 dns_adb_createfind: found A for name ns3.google.com (0x7f5dff2e8ac0) in db
07-Oct-2016 21:46:08.851 expiring v4 for name 0x7f5dff2e8990
07-Oct-2016 21:46:08.851 dns_adb_createfind: found A for name ns4.google.com (0x7f5dff2e8990) in db
07-Oct-2016 21:46:09.651 dns_adb_destroyfind on find 0x7f5dff2bbb50
07-Oct-2016 21:46:09.651 dns_adb_destroyfind on find 0x7f5dff2c3f10
07-Oct-2016 21:46:09.651 dns_adb_destroyfind on find 0x7f5dff2bb1f0
07-Oct-2016 21:46:09.651 dns_adb_destroyfind on find 0x7f5dff2cf5b0
07-Oct-2016 21:46:10.452 dns_adb_destroyfind on find 0x7f5dff2bbb50
07-Oct-2016 21:46:10.452 dns_adb_destroyfind on find 0x7f5dff2c3f10
07-Oct-2016 21:46:10.452 dns_adb_destroyfind on find 0x7f5dff2bb1f0
07-Oct-2016 21:46:10.452 dns_adb_destroyfind on find 0x7f5dff2cf5b0
07-Oct-2016 21:46:12.052 dns_adb_destroyfind on find 0x7f5dff2c3f10
07-Oct-2016 21:46:12.052 dns_adb_destroyfind on find 0x7f5dff2bb1f0
07-Oct-2016 21:46:12.053 dns_adb_destroyfind on find 0x7f5dff2bbb50
07-Oct-2016 21:46:12.053 dns_adb_destroyfind on find 0x7f5dff2cf5b0
07-Oct-2016 21:46:15.253 dns_adb_destroyfind on find 0x7f5dff2bb1f0
07-Oct-2016 21:46:15.253 dns_adb_destroyfind on find 0x7f5dff2bbb50
07-Oct-2016 21:46:15.253 dns_adb_destroyfind on find 0x7f5dff2c3f10
07-Oct-2016 21:46:15.253 dns_adb_destroyfind on find 0x7f5dff2cf5b0
07-Oct-2016 21:46:15.344 received packet from 216.239.34.10#53 (no opt):
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14831
;; flags: qr aa; QUESTION: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;googlemail-pop.l.google.com. IN A

;; ANSWER SECTION:
googlemail-pop.l.google.com. 300 IN A 74.125.28.16
As you can see named spent over 6 seconds doing something, logging "dns_adb_destroyfind" before it bothered to query google's name server.

Querying the same name server directly

Quote:
$ dig pop.googlemail.com @216.239.34.10

; <<>> DiG 9.10.4-P1 <<>> +noadflag pop.googlemail.com @216.239.34.10
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36706
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;pop.googlemail.com. IN A

;; ANSWER SECTION:
pop.googlemail.com. 600 IN CNAME googlemail-pop.l.google.com.
googlemail-pop.l.google.com. 300 IN A 173.194.203.16

;; Query time: 457 msec
;; SERVER: 216.239.34.10#53(216.239.34.10)
;; WHEN: Fri Oct 07 21:52:53 PDT 2016
;; MSG SIZE rcvd: 90
 
Old 10-08-2016, 06:47 AM   #6
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: Fedora 34
Posts: 3,666

Rep: Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059
It could be a firewall and source port randomization.
 
Old 10-08-2016, 06:47 PM   #7
wastingtime
Member
 
Registered: Sep 2004
Posts: 86

Original Poster
Rep: Reputation: 17
I ran wireshark on the dns machine looking for communication between it, the client machine and the outside world
Below see the request comming from 10.0.0.16 (client) to 10.0.0.2 (dns server) The server immediately queries the resolver for cnn.com and receives a response. It then queries another resolve and recieves the same response, etc. This goes on for a while (seems like named bug?) until the client times out and retires the request, at which point named sends back SRVFAIL (another bug?)
Quote:

No. Time Source Destination Protocol Length Info
429 2016-10-08 12:28:11.884293 10.0.0.16 10.0.0.2 DNS 80 Standard query 0xc7c8 A cnn.com OPT
430 2016-10-08 12:28:11.884768 10.0.0.2 205.251.196.62 DNS 80 Standard query 0xeb71 A cnn.com OPT
431 2016-10-08 12:28:11.928284 205.251.196.62 10.0.0.2 DNS 248 Standard query response 0xeb71 A cnn.com A 157.166.226.26 A 157.166.226.25 NS ns-1086.awsdns-07.org NS ns-1630.awsdns-11.co.uk NS ns-47.awsdns-05.com NS ns-576.awsdns-08.net OPT
432 2016-10-08 12:28:12.685049 10.0.0.2 205.251.198.94 DNS 80 Standard query 0x9947 A cnn.com OPT
433 2016-10-08 12:28:12.748541 205.251.198.94 10.0.0.2 DNS 248 Standard query response 0x9947 A cnn.com A 157.166.226.26 A 157.166.226.25 NS ns-1086.awsdns-07.org NS ns-1630.awsdns-11.co.uk NS ns-47.awsdns-05.com NS ns-576.awsdns-08.net OPT
436 2016-10-08 12:28:13.485345 10.0.0.2 205.251.192.47 DNS 80 Standard query 0xb364 A cnn.com OPT
437 2016-10-08 12:28:13.568526 205.251.192.47 10.0.0.2 DNS 248 Standard query response 0xb364 A cnn.com A 157.166.226.26 A 157.166.226.25 NS ns-1086.awsdns-07.org NS ns-1630.awsdns-11.co.uk NS ns-47.awsdns-05.com NS ns-576.awsdns-08.net OPT
440 2016-10-08 12:28:15.085678 10.0.0.2 205.251.194.64 DNS 80 Standard query 0x2801 A cnn.com OPT
441 2016-10-08 12:28:15.168584 205.251.194.64 10.0.0.2 DNS 248 Standard query response 0x2801 A cnn.com A 157.166.226.25 A 157.166.226.26 NS ns-1086.awsdns-07.org NS ns-1630.awsdns-11.co.uk NS ns-47.awsdns-05.com NS ns-576.awsdns-08.net OPT
444 2016-10-08 12:28:18.286007 10.0.0.2 205.251.198.94 DNS 80 Standard query 0x9567 A cnn.com OPT
445 2016-10-08 12:28:18.348275 205.251.198.94 10.0.0.2 DNS 248 Standard query response 0x9567 A cnn.com A 157.166.226.26 A 157.166.226.25 NS ns-1086.awsdns-07.org NS ns-1630.awsdns-11.co.uk NS ns-47.awsdns-05.com NS ns-576.awsdns-08.net OPT
454 2016-10-08 12:28:21.881474 10.0.0.16 10.0.0.2 DNS 80 Standard query 0xc7c8 A cnn.com OPT
455 2016-10-08 12:28:21.884999 10.0.0.2 10.0.0.16 DNS 80 Standard query response 0xc7c8 Server failure A cnn.com OPT
 
Old 10-16-2016, 01:51 PM   #8
wastingtime
Member
 
Registered: Sep 2004
Posts: 86

Original Poster
Rep: Reputation: 17
This seemed to have solved the problem

http://serverfault.com/questions/722...recursion-slow

I do not run the OS in a VM, however, the default setting was to offload checksum calculations

Before
Quote:
# ethtool -k eth0 | grep on
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: on
Disable offloading
Quote:
# ethtool -K eth0 tx off rx off
# /etc/init.d/network restart
After
Quote:
# ethtool -k eth0 | grep on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: on
 
Old 10-16-2016, 05:53 PM   #9
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 1,815

Rep: Reputation: 816Reputation: 816Reputation: 816Reputation: 816Reputation: 816Reputation: 816Reputation: 816
Thanks for sharing this!
If your NIC produces wrong checksums, you can as well go fo a NIC firmware update.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Help with named - error sending response: host unreachable rootaccess Linux - General 3 03-04-2014 04:42 PM
(bind) named: couldn't open pid file '/var/run/named/named.pid' - any help? samengr Linux - Server 6 04-01-2009 06:22 AM
BIND/named Startup Error - named.root:1: '}' expected near ';' acutchin Linux - Server 4 11-10-2008 09:43 AM
X server slow response when connected to slow Wifi bluesmanu Linux - Software 0 04-09-2008 06:56 AM
BIND 9: Slow response from root servers. Local is ok. Apollo77 Linux - Networking 3 01-14-2005 10:22 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 01:38 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration