Network Slowdown: difficult to diagnose
Something definitely disrupted my network over Memorial Day weekend May 28-30.
Upon arriving back to the office, I noticed that my usual tar backups and rsync between primary server and backup server were not completed. My devices are as follows: 1. slackware 13.0 router/firewall box 2. slackware 13.1 primary server with samba 3. gentoo backup/fail-over server (rsyncs hourly with primary server and performs nightly backups) 4. copier/scanner that transfers scanned docs. directly to primary server. 5. 40-50 windows clients During the past week, I've experienced the dreaded, nagging, "network slowdown" and have not been able to discern the cause. Here's the list of symptoms: Copier/scanner used to complete transfer of scanned documents to the samba server in less than a second; now it takes a solid 10-12 seconds. Windows clients get intermittant slowness when trying to open different directories and sub-folders on the samba server. When I ssh into the slackware router/firewall box from an external location over the internet, there is no pause or delay. When I ssh from inside the LAN to either the routerbox, primary server, or backup server I get the same 10-12 second delay upon login -- immediate access but upon providing my user name, there is now a 10-12 second delay before prompted to provide user password. The weird thing is everything works; only slower: I've pinged every box/server/client from every other box/server/client for hours with no packet loss. Users can access all their directories and documents on the samba server. Accounting people are using the data sets on the samba server. Everyone has consistent internet access. The rsync between primary and backup server works, as well as nightly backup. I've cycled power on the 48 port network switch, and the routerbox, primary server, backup server, scanner, and all client boxes. I've run hdparm on routerbox, primary server, and backup server, and they all report normal speeds for older SATA 1 raid drives: Timing cached reads: 1842 MB in 2.00 seconds = 920.90 MB/sec Timing buffered disk reads: 190 MB in 3.00 seconds = 63.30 MB/sec I've run "top" for hours on each box and the cpu loads are typically 2%-3%, with only occasional spikes up to 30% on the samba server. There's almost no use of swap. I'm not a tcpdump expert, but I've been logging data traffic on the routerbox and don't see any obvious culprit like a piece of spyware that is crushing all the available bandwidth like a denial of service attack. The office tends toward the permissive side regarding internet use, so I see a decent amount of data traffic from constantly updating browser apps. from The Weather Channel, Facebook, Marketwatch, etc. I've also run iptraf to observe packets and bytes per interface and per LAN client and I'm not seeing any obvious abuse of bandwidth. ----------------------------------------- What is the next course of action? Do I need to take a class in Wireshark? I've hit the limit of what I know to look for. Thank you for reading such a long post; any guidance greatly appreciated. |
@ Reply
Hi there,
As you said Quote:
You said that you have got a 48 port switch. Is every system connected to this switch? As of now the image of your network that appears in my mind is as follows: Code:
Also, it would be great if you can let us know how if the above diagram defines your infrastructure or is there a difference because if there are other switches involved we have to look at them as well. Is this happening for each and every system within the LAN or with few of the systems. Did you configure VLAN on the switch, if yes how it is configured and is this happening with particular VLAN or with all of them. |
Quote:
1. Yes, every system is connected to the switch. 2. Yes, network image is exactly right. 3. Yes, this is happening for each and every system within the LAN. 4. VLAN is not configured on the switch, and no other switches exist on the network. |
@ Reply
Now I am getting a better picture. Try the following things:
1. As you said you have around 40-50 Windows client, configure a shared folder on one of the client and try to access from another Windows machine. This will show us if the problem occurs only when accessing data on servers or it irrespective of that. 2. If you have 8 port switch for testing (perform this only when you 48 port switch is not acting as DHCP) then put that switch between the servers and 2-3 clients and see if you see any difference because as of now it appears to be like a hardware issue instead of software. 3. Install wireshark on client/server and perform a simple ping request and see at what time client sends the ping request and at what time server sees it. This will tell us exact difference in time. I hope this helps. |
Quote:
Quote:
Quote:
I need to do some reading on how to use wireshark too. T3RM1, thank you very much for the guidance. Will report back once I complete the testing mentioned above. |
@ Reply
You're welcome.
You can get information about wireshark from here: http://www.wireshark.org/ , you can also download it from the same link. Edit: Forgot to mention that for linux you can go with ethereal: http://www.ethereal.com/download.html |
Quote:
|
@ Reply
Great.
Now we know that the issue is not with the client system communication among themselves. As we already know that the server response is normal when accessed externally this implies that the issue is not with server either. To further narrow down the issue we can try following things: 1. Perform ping from server to different clients, keep a track of reply time in ms. 2. Perform a ping between clients to compare the latency. 3. If possible connect server on a different port on switch and see the difference. This will clear up the problematic port issue if any. 4. If above steps does not give us any clue then LAN trace will be only option to go deeper. |
T3RM1,
Thanks for hanging in with me. I couldn't stay late tonight to try the 8-port switch experiment; some personal demands to take care of. I did try an experiment earlier in the day today -- I ran tcpdump on routerbox LAN_nic and Primary/Samba_Server_nic while I scanned a document. I was able to date-sync the two boxes to within a second of each other and will try to piece together the segment-to-segment response from Scanner to LAN_nic to Samba_Server_nic. Wouldn't it figure, though; I saw a noticeable difference in the scanner notification today -- it was much shorter -- more like 3-5 secs. today as opposed to 10-12 secs. the day before. This is real "ghost in the machine" stuff. |
Quote:
Since tcpdump and TShark(commandline wireshark) both use pcap reporting; I thought I would stick with tcpdump for now. Here's the first timestamp of the document scanner hitting the routerbox LAN_nic: Code:
2011-06-06 09:34:32.934962 IP (tos 0x0, ttl 64, id 15075, offset 0, flags [none], proto UDP (17), le The first timestamp and beginning transaction on the SERVER_nic: Code:
2011-06-06 09:34:34.284477 IP (tos 0x0, ttl 64, id 15075, offset 0, flags [none], proto UDP (17), l$ From this output, it appears at most a 2 second delay between them. Of course, this is frustrating because it seems to undermine what I am seeing with my own eyes and what other users on the network are reporting to me. I'll try to perform the "switch test" after 6 pm today. |
I've experienced general network slowdowns in the past, one time it turned out that one of the DNS servers (primary) wasn't running properly so it was taking several seconds to failover to the secondary, once the issue was corrected bam all network services started to work again.
|
Quote:
Thanks for your response. I'm doing some research to find out about using tcpdump to discover dns problems. |
@ Reply
For checking dns server problem you can also use nslookup and see from which server you get authoritative answer, if there is something wrong with primary dns server then you will get authoritative answer from secondary dns. You can use the following command:
nslookup - to get into nslookup set debug - this will show you the query you are performing and the response you are getting, will also display how much time it took to resolve the query. You can use "server xxx.xxx.xxx.xxx" or "server dns_name" without quotes in nslookup prompt to change the dns server using which you want to perform dns query. |
Quote:
Testing from several different clients obtains the same answer (the ISP's primary dns) with no delay. sidenote: still cannot complete the substitute router test due to some co-workers staying late to finish a project. |
Quote:
Wireshark on client and tcpdump on server. Shown below is the last request/reply in a 500 count series. No packet loss. Clock Sync Issue: unable to sync clocks between client/server; at any given time there was observable difference of 1-3 seconds. Client Ping to Server: Logged on Client nic: 10:34:33.291392 IP 10.10.10.185 > 10.10.10.199: ICMP echo request, id 512, seq 34818, length 40 10:34:33.291398 IP 10.10.10.199 > 10.10.10.185: ICMP echo reply, id 512, seq 34818, length 40 Logged on Server nic: 10.10.10.185(Client) 10.10.10.199(Server) ICMP 74 Echo (ping) request id=0x0200, seq=34818/648, ttl=128 Arrival Time: Jun 8, 2011 10:34:30.823649000 Eastern Daylight Time 10.10.10.199(Server) 10.10.10.185(Client) ICMP 74 Echo (ping) reply id=0x0200, seq=34818/648, ttl=64 Arrival Time: Jun 8, 2011 10:34:30.823797000 Eastern Daylight Time |
All times are GMT -5. The time now is 03:24 AM. |