Network Slowdown: difficult to diagnose
Something definitely disrupted my network over Memorial Day weekend May 28-30.
Upon arriving back to the office, I noticed that my usual tar backups and rsync between primary server and backup server were not completed. My devices are as follows: 1. slackware 13.0 router/firewall box 2. slackware 13.1 primary server with samba 3. gentoo backup/fail-over server (rsyncs hourly with primary server and performs nightly backups) 4. copier/scanner that transfers scanned docs. directly to primary server. 5. 40-50 windows clients During the past week, I've experienced the dreaded, nagging, "network slowdown" and have not been able to discern the cause. Here's the list of symptoms: Copier/scanner used to complete transfer of scanned documents to the samba server in less than a second; now it takes a solid 10-12 seconds. Windows clients get intermittant slowness when trying to open different directories and sub-folders on the samba server. When I ssh into the slackware router/firewall box from an external location over the internet, there is no pause or delay. When I ssh from inside the LAN to either the routerbox, primary server, or backup server I get the same 10-12 second delay upon login -- immediate access but upon providing my user name, there is now a 10-12 second delay before prompted to provide user password. The weird thing is everything works; only slower: I've pinged every box/server/client from every other box/server/client for hours with no packet loss. Users can access all their directories and documents on the samba server. Accounting people are using the data sets on the samba server. Everyone has consistent internet access. The rsync between primary and backup server works, as well as nightly backup. I've cycled power on the 48 port network switch, and the routerbox, primary server, backup server, scanner, and all client boxes. I've run hdparm on routerbox, primary server, and backup server, and they all report normal speeds for older SATA 1 raid drives: Timing cached reads: 1842 MB in 2.00 seconds = 920.90 MB/sec Timing buffered disk reads: 190 MB in 3.00 seconds = 63.30 MB/sec I've run "top" for hours on each box and the cpu loads are typically 2%-3%, with only occasional spikes up to 30% on the samba server. There's almost no use of swap. I'm not a tcpdump expert, but I've been logging data traffic on the routerbox and don't see any obvious culprit like a piece of spyware that is crushing all the available bandwidth like a denial of service attack. The office tends toward the permissive side regarding internet use, so I see a decent amount of data traffic from constantly updating browser apps. from The Weather Channel, Facebook, Marketwatch, etc. I've also run iptraf to observe packets and bytes per interface and per LAN client and I'm not seeing any obvious abuse of bandwidth. ----------------------------------------- What is the next course of action? Do I need to take a class in Wireshark? I've hit the limit of what I know to look for. Thank you for reading such a long post; any guidance greatly appreciated. |
@ Reply
Hi there,
As you said Quote:
You said that you have got a 48 port switch. Is every system connected to this switch? As of now the image of your network that appears in my mind is as follows: Code:
Also, it would be great if you can let us know how if the above diagram defines your infrastructure or is there a difference because if there are other switches involved we have to look at them as well. Is this happening for each and every system within the LAN or with few of the systems. Did you configure VLAN on the switch, if yes how it is configured and is this happening with particular VLAN or with all of them. |
Quote:
1. Yes, every system is connected to the switch. 2. Yes, network image is exactly right. 3. Yes, this is happening for each and every system within the LAN. 4. VLAN is not configured on the switch, and no other switches exist on the network. |
@ Reply
Now I am getting a better picture. Try the following things:
1. As you said you have around 40-50 Windows client, configure a shared folder on one of the client and try to access from another Windows machine. This will show us if the problem occurs only when accessing data on servers or it irrespective of that. 2. If you have 8 port switch for testing (perform this only when you 48 port switch is not acting as DHCP) then put that switch between the servers and 2-3 clients and see if you see any difference because as of now it appears to be like a hardware issue instead of software. 3. Install wireshark on client/server and perform a simple ping request and see at what time client sends the ping request and at what time server sees it. This will tell us exact difference in time. I hope this helps. |
Quote:
Quote:
Quote:
I need to do some reading on how to use wireshark too. T3RM1, thank you very much for the guidance. Will report back once I complete the testing mentioned above. |
@ Reply
You're welcome.
You can get information about wireshark from here: http://www.wireshark.org/ , you can also download it from the same link. Edit: Forgot to mention that for linux you can go with ethereal: http://www.ethereal.com/download.html |
Quote:
|
@ Reply
Great.
Now we know that the issue is not with the client system communication among themselves. As we already know that the server response is normal when accessed externally this implies that the issue is not with server either. To further narrow down the issue we can try following things: 1. Perform ping from server to different clients, keep a track of reply time in ms. 2. Perform a ping between clients to compare the latency. 3. If possible connect server on a different port on switch and see the difference. This will clear up the problematic port issue if any. 4. If above steps does not give us any clue then LAN trace will be only option to go deeper. |
T3RM1,
Thanks for hanging in with me. I couldn't stay late tonight to try the 8-port switch experiment; some personal demands to take care of. I did try an experiment earlier in the day today -- I ran tcpdump on routerbox LAN_nic and Primary/Samba_Server_nic while I scanned a document. I was able to date-sync the two boxes to within a second of each other and will try to piece together the segment-to-segment response from Scanner to LAN_nic to Samba_Server_nic. Wouldn't it figure, though; I saw a noticeable difference in the scanner notification today -- it was much shorter -- more like 3-5 secs. today as opposed to 10-12 secs. the day before. This is real "ghost in the machine" stuff. |
Quote:
Since tcpdump and TShark(commandline wireshark) both use pcap reporting; I thought I would stick with tcpdump for now. Here's the first timestamp of the document scanner hitting the routerbox LAN_nic: Code:
2011-06-06 09:34:32.934962 IP (tos 0x0, ttl 64, id 15075, offset 0, flags [none], proto UDP (17), le The first timestamp and beginning transaction on the SERVER_nic: Code:
2011-06-06 09:34:34.284477 IP (tos 0x0, ttl 64, id 15075, offset 0, flags [none], proto UDP (17), l$ From this output, it appears at most a 2 second delay between them. Of course, this is frustrating because it seems to undermine what I am seeing with my own eyes and what other users on the network are reporting to me. I'll try to perform the "switch test" after 6 pm today. |
I've experienced general network slowdowns in the past, one time it turned out that one of the DNS servers (primary) wasn't running properly so it was taking several seconds to failover to the secondary, once the issue was corrected bam all network services started to work again.
|
Quote:
Thanks for your response. I'm doing some research to find out about using tcpdump to discover dns problems. |
@ Reply
For checking dns server problem you can also use nslookup and see from which server you get authoritative answer, if there is something wrong with primary dns server then you will get authoritative answer from secondary dns. You can use the following command:
nslookup - to get into nslookup set debug - this will show you the query you are performing and the response you are getting, will also display how much time it took to resolve the query. You can use "server xxx.xxx.xxx.xxx" or "server dns_name" without quotes in nslookup prompt to change the dns server using which you want to perform dns query. |
Quote:
Testing from several different clients obtains the same answer (the ISP's primary dns) with no delay. sidenote: still cannot complete the substitute router test due to some co-workers staying late to finish a project. |
Quote:
Wireshark on client and tcpdump on server. Shown below is the last request/reply in a 500 count series. No packet loss. Clock Sync Issue: unable to sync clocks between client/server; at any given time there was observable difference of 1-3 seconds. Client Ping to Server: Logged on Client nic: 10:34:33.291392 IP 10.10.10.185 > 10.10.10.199: ICMP echo request, id 512, seq 34818, length 40 10:34:33.291398 IP 10.10.10.199 > 10.10.10.185: ICMP echo reply, id 512, seq 34818, length 40 Logged on Server nic: 10.10.10.185(Client) 10.10.10.199(Server) ICMP 74 Echo (ping) request id=0x0200, seq=34818/648, ttl=128 Arrival Time: Jun 8, 2011 10:34:30.823649000 Eastern Daylight Time 10.10.10.199(Server) 10.10.10.185(Client) ICMP 74 Echo (ping) reply id=0x0200, seq=34818/648, ttl=64 Arrival Time: Jun 8, 2011 10:34:30.823797000 Eastern Daylight Time |
@ Reply
Hi Sum1,
As you said that there is a time difference of 1-3 seconds between the server and the client may I know how the server is taking time? Is server taking time from NTP source or local? Also how clients are configured to receive time from the server? Are they configured to use Slackware server as their NTP source? The packet capture you pasted is limited and I will not be able to analyze/suggest much on that. Also I can see that you took a packet capture of ping request which may come normal because what we are trying to figure out here is when client request for data, is packet reaches late or server takes time to respond as it has to recursively search the requested data in the directory structure. Did you perform the switch test? Also if you can paste the full output of packet capture (Not just ICMP, data request from client to server) that will be more useful for diagnosis. |
Hi T3RM1,
I had a feeling this output would not be helpful. I will set up an ntp server on the router/firewall box since it is the dhcp server for the LAN and provides dns to clients. Have not completed the 8-port switch test yet. I apologize for this dragging on so long -- it's a long story but I wear many hats at my job. :-) I'll complete the ntp-sync'ed data collection and switch test. |
@ Reply
Hi Sum1,
I can understand, I am in the same boat as you are :-) |
Quote:
I hope you're still out there. :-) I finally got a chance to kick out all the users and pull some wires last night. I put the scanner/copier, 1 windows client, Server 1, Back-up Server, and Routerbox, on the 8-port switch. I was surprised to find that the slow performance was exactly the same; so, this rules out a faulty 48-port switch. However, the bad news is that I'm still hunting the problem. - - - - - - - I did some tcpdump logging after configuring the routerbox to act as an ntp server. I had a client box, server 1, back-up server, and routerbox all looking synced to within a blink of an eye. In the following logs the tcpdump snaplen was set to either 3000 or 5000 bytes; I can't remember which --- I hope this provides the right amount of logged data. Clientbox pinging Routerbox LAN nic: Code:
2011-06-10 12:03:41.375677 IP 10.10.10.193 > 10.10.10.1: ICMP echo request, id 23559, seq 1, length 64 Code:
2011-06-10 12:03:41.481708 IP 10.10.10.193 > 10.10.10.1: ICMP echo request, id 23559, seq 1, length 64 Code:
2011-06-10 12:03:48.719680 IP 10.10.10.193 > 10.10.10.186: ICMP echo request, id 23815, seq 1, length 64 Code:
2011-06-10 12:03:48.826310 IP 10.10.10.193 > 10.10.10.186: ICMP echo request, id 23815, seq 1, length 64 Code:
2011-06-10 12:03:57.455216 IP 10.10.10.193 > 10.10.10.199: ICMP echo request, id 24071, seq 1, length 64 Code:
2011-06-10 12:03:57.563778 IP 10.10.10.193 > 10.10.10.199: ICMP echo request, id 24071, seq 1, length 64 |
@ Reply
Hi Sum1,
Yes, I am still here :-) Can you please let me know the IPs of the following: 1. Backup server 2. Client machine (from where you have performed the test) 3. Server1 4. Router box From the trace it appears that server1 is responding properly. One thing that I have observed is that there are ARP request which are taking longer time. Do you have DHCP lease set to low say 1 day or 2 days. Also what are the following IPs: 2011-06-10 12:04:05.927198 IP 10.10.10.137.137 > 10.10.10.255.137: UDP, length 50 2011-06-10 12:04:06.676694 IP 10.10.10.137.137 > 10.10.10.255.137: UDP, length 50 Please let me know the IPs of the above mentioned boxes so it will be easier to identify the response time. |
- - - - - - - - - - - -
Backup server: .186 Clientbox: .193 Server 1: .199 Routerbox: .1 (LAN nic eth1) - - - - - - - - - - - - - DHCP lease settings from /etc/dhcpd.conf: default-lease-time 720; max-lease-time 86400; - - - - - - - - - - - - - 10.10.10.137 has got to be one of the windows client machines on the LAN; since all the network printers and scanners have static ip's ending in even numbers with a zero, such as: .120 or .150 And I believe the 10.10.10.255 is the broadcast address that all devices default to during outbound tcp/ip requests as defined by the dhcp server. From /etc/dhcpd.conf: option subnet-mask 255.255.255.0; option broadcast-address 10.10.10.255; option routers 10.10.10.1; - - - - - - - - - - - - - |
@ Reply
Hi Sum1,
I hope you are having a nice weekend. From the trace I do not see any packet drop or delay. I am assuming that the trace has been taken with 8 port switch. Also during the trace I think you have used static IPs which will roll out the issue with DHCP. The only thing that I can think of after this trace is NIC card setting i.e. automatic or full duplex. Sometimes network work slow if you are using a swith or a router that works on full duplex and NIC card of the server working on half duplex or auto negotiation mode. The result of either of those will result in half duplex speed. Also if you get a chance I would suggest you to take a packet trace in live environment which will show us some data exchange between client and server not only ICMP requests. |
Quote:
All is well except for this little mystery. <grin> I hope you had good weather and a bit of rest and relaxation this weekend. The scan was taken using the 48-port switch, so it's good news that I don't need to buy another one. The Routerbox, Server 1, and Backup Server, are set with static ip addresses. I'll check into the nic modes on the R-box, Serv1, and Bserv. And I'll post up some typical traffic on R-box LAN nic and Serv1 nic. Thanks again for giving your patience and guidance; it's very helpful. Be well. |
Here's a clip of traffic on the Server 1 nic; it's difficult to post a wider timeframe due to the 30,000 character forum limit:
Code:
2011-06-20 14:08:12.762113 IP 10.10.10.199.445 > 10.10.10.109.4678: Flags [P.], seq 979411:979774, ack 165907, win 65535, length 363SMB PACKET: SMBreadX (REPLY) |
@ Reply
Hi Sum1,
Indeed it is a mystery. I do not see any delay in the trace. Not sure from where the delay is coming from. This is interesting and at the same time head scratching. Have a look at the NIC settings (I don't think so that will be the case otherwise it should reflect in the trace) but still we can have a look. So far we have ruled out the following possibilities: 1. Hardware issue. 2. Problem with primary DNS. 3. Problem with switch port. 4. Problem with the server1 as it works fine when you request for ssh over WAN. Current situation: 1. LAN still slow, problem with backups. I hope I am summarizing it correctly :-) just doing so that we can just look at this post and can take it forward from here. Give a try to the following steps: 1. Try to ssh to server1 using IP address instead of hostname (Just to be sure that there is nothing wrong with DNS) 2. I have feeling that workstation's requests are getting processed normally and the issue is only between server1 and backup server and with the printers. 3. Are you using cups for printing? If yes then try to print to the printer by installing printer as a local printer on a workstation. So that we can be sure that it is not a cups issue. 4. Also for the backup server are we have trouble with daily, weekly or montly backups so that we can investigate accordingly. |
Quote:
Long time, no see. I hope all is well on your side. Well, I may have a dns problem afterall. I finally got back around to taking a look at the problem and found that when I put the configuration parameter "useDNS = no" into my sshd_config file on the routerbox, primary serv, and backup serv; suddenly, all authentication slowdown has disappeared. Ssh works perfectly now. That said, users on the network still report intermittent pauses, slowdowns, and "Not Responding" messages when traversing directories and shares on the samba server (primary server). Going back in memory to the start of the problem, I recall upgrading the primary server from Slackware 13.1 to 13.37 as soon as it was released. This was a full month before the network problems were reported or discovered. I'm wondering if there is some dns issue with samba version 3.5.6, the current version in Slack 13.37. I've asked about this on the samba support list and I'm hoping to obtain some guidance on this point. Even though samba is acting as standalone server (a workgroup server) and not a true primary domain server, I'm wondering if bind is necessary, or some other configuration unique to 3.5.6. Please let me know what you think if you see this. Be well and best regards. |
Follow-up:
/var/log/samba/nmbd.log -- [2011/09/19 13:13:07.959554, 0] nmbd/nmbd_browsesync.c:350(find_domain_master_name_query_fail) find_domain_master_name_query_fail: Unable to find the Domain Master Browser name MW<1b> for the workgroup MW. Unable to sync browse lists in this workgroup. [2011/09/19 13:28:07.204633, 0] nmbd/nmbd_browsesync.c:350(find_domain_master_name_query_fail) root@a1:/var/log/samba# smbclient -N -L a1 -- Anonymous login successful Domain=[MW] OS=[Unix] Server=[Samba 3.5.6] Sharename Type Comment --------- ---- ------- Ac Disk Ma Disk Ca Disk Ne Disk Ol Disk Ka Disk Mz Disk Fa Disk Sc Disk IPC$ IPC IPC Service (A1 Server) Anonymous login successful Domain=[MW] OS=[Unix] Server=[Samba 3.5.6] Server Comment --------- ------- A1 A1 Server Workgroup Master --------- ------- MW A1 /etc/samba/smb.conf -- #======================= Global Settings ===================================== [global] netbios name = a1 workgroup = mw server string = A1 Server security = user hosts allow = 192.168.1. 127.0.0. hosts deny = 0.0.0.0/0 log file = /var/log/samba.%m max log size = 500 passdb backend = tdbsam encrypt passwords = Yes local master = yes os level = 99 time server = yes preferred master = yes wins support = yes wide links = no #============================ Share Definitions ============================== [Ac] writable = yes read only = no guest ok = yes public = yes oplocks = true level2 oplocks = true path = /abc/def create mask = 0777 directory mask = 0777 security mask = 0777 directory security mask = 0777 [Ma] writable = yes read only = no guest ok = yes public = yes oplocks = true level2 oplocks = true path = /abc/ghi create mask = 0777 directory mask = 0777 security mask = 0777 directory security mask = 0777 the settings on all other shares are precisely the same. |
I am now using log level 3 in samba and the nmbd process shows the following just about every 15 minutes:
Code:
[2011/09/23 11:39:19.617707, 0] nmbd/nmbd_browsesync.c:350(find_domain_master_name_query_fail) Code:
[2011/09/23 12:35:40.645624, 0] lib/util_sock.c:1432(get_peer_addr_internal) |
Solved.
Not a hardware (nic or switch) issue. Not a dns issue. Not an ntp issue. Not a firewall/router tcp/ip or iptables issue. I needed to declare a samba domain master, add a few parameters, and drop some others. /etc/samba/smb.conf: #======================= Global Settings ===================================== [global] netbios name = a1 workgroup = mw server string = A1 Server security = user hosts allow = 10.10.10. 127. hosts deny = 0.0.0.0/0 log file = /var/log/samba.%m max log size = 500 passdb backend = tdbsam encrypt passwords = Yes domain master = yes (added) local master = yes ## os level = 99 (commented out) smb ports = 139 (added) ## time server = yes (commented out) preferred master = yes wins support = yes name resolve order = wins host bcast lmhosts (added) wide links = no log level = 3 - - - - - - - - - - - - - - - - Success: killall nmbd killall smbd then started samba - /etc/rc.d/rc.samba - - - - - - /var/log/samba.nmbd read: [2011/09/27 14:13:37.248333, 3] nmbd/nmbd_sendannounce.c:207(send_host_announcement) send_host_announcement: type 819a03 for host A1 on subnet xxx.xxx.xxx.xxx for workgroup MW [2011/09/27 14:13:37.248435, 0] nmbd/nmbd_become_dmb.c:337(become_domain_master_browser_wins) become_domain_master_browser_wins: Attempting to become domain master browser on workgroup MW, subnet UNICAST_SUBNET. [2011/09/27 14:13:37.248523, 0] nmbd/nmbd_become_dmb.c:351(become_domain_master_browser_wins) become_domain_master_browser_wins: querying WINS server from IP xxx.xxx.xxx.xxx for domain master browser name MW<1b> on workgroup MW and then a flood of incoming client requests to process multihomed winserver name query and then add_name_to_subnet: Added netbios name A1 to subnet and then check_for_master_browser_fail: Forcing election on workgroup MW and then check_elections: >>> Starting election for workgroup MW on subnet and then Samba server A1 is now a domain master browser for workgroup MW on subnet and then Samba name server A1 is now a local master browser for workgroup MW on subnet and finally We are both a domain and a local master browser for workgroup MW. Do not announce to ourselves. - - - - - - - - - - - - - No more domain master query failures. No more slowness and "not responding" delays when traversing folders and directories on the network. I received some very kind help on the samba user mailing list, and am very grateful for it. So glad this issue is resolved. |
Quote:
After a few days, the same network slowdown came back. User authentication to the samba server was functioning fine, but opening shares and sub-directories was often slow and/or completely unresponsive for up to 1 minute. I changed the "smb ports =" parameter to 139 and 445. I reconfirmed access delays from client boxes to server using tcpdump. I now believe the problem has been solved based on user reports. What worked was making changes to my routerbox dhcp server and hosts file on the samba server. Since my samba server is not a true domain server, I am not running a bind/named/dns server on my LAN, only a dhcp server. I added "host" entries in the routerbox dhcpd.conf so that every client machine is assigned a static ip address based on its NIC MAC address. Example: host kevin_box { hardware ethernet 00:0B:A5:1F:B3:88; fixed-address 10.10.10.60; } I then added the following entry in the samba server's /etc/hosts file: 10.10.10.60 kevin_box I added these entries for every network client box and document scanner. The document scanners complete file transfers almost instantly and I have observed the users moving quickly and smoothly across share sub-directories. |
All times are GMT -5. The time now is 04:01 PM. |