LinuxQuestions.org - [SOLVED] Network Slowdown: difficult to diagnose

- Linux - Networking (https://www.linuxquestions.org/questions/linux-networking-3/)

- - Network Slowdown: difficult to diagnose (https://www.linuxquestions.org/questions/linux-networking-3/network-slowdown-difficult-to-diagnose-884510/)

Network Slowdown: difficult to diagnose

Something definitely disrupted my network over Memorial Day weekend May 28-30.
Upon arriving back to the office, I noticed that my usual tar backups and rsync
between primary server and backup server were not completed.

My devices are as follows:

1. slackware 13.0 router/firewall box
2. slackware 13.1 primary server with samba
3. gentoo backup/fail-over server (rsyncs hourly with
primary server and performs nightly backups)
4. copier/scanner that transfers scanned docs. directly to primary server.
5. 40-50 windows clients

During the past week, I've experienced the dreaded, nagging, "network slowdown" and have
not been able to discern the cause.

Here's the list of symptoms:

Copier/scanner used to complete transfer of scanned documents to
the samba server in less than a second;
now it takes a solid 10-12 seconds.

Windows clients get intermittant slowness when trying to open different directories and sub-folders
on the samba server.

When I ssh into the slackware router/firewall box from an external location over the internet,
there is no pause or delay.
When I ssh from inside the LAN to either the routerbox, primary server,
or backup server I get the same 10-12 second delay upon login -- immediate access
but upon providing my user name, there is now a 10-12 second delay
before prompted to provide user password.

The weird thing is everything works; only slower:

I've pinged every box/server/client from every other box/server/client for hours with no packet loss.
Users can access all their directories and documents on the samba server.
Accounting people are using the data sets on the samba server.
Everyone has consistent internet access.
The rsync between primary and backup server works, as well as nightly backup.

I've cycled power on the 48 port network switch, and the routerbox, primary server,
backup server, scanner, and all client boxes.

I've run hdparm on routerbox, primary server, and backup server, and they all report
normal speeds for older SATA 1 raid drives:

Timing cached reads: 1842 MB in 2.00 seconds = 920.90 MB/sec
Timing buffered disk reads: 190 MB in 3.00 seconds = 63.30 MB/sec

I've run "top" for hours on each box and the cpu loads are typically
2%-3%, with only occasional spikes up to 30% on the samba server.
There's almost no use of swap.

I'm not a tcpdump expert, but I've been logging data traffic on the
routerbox and don't see any obvious culprit like
a piece of spyware that is crushing all the available bandwidth
like a denial of service attack.

The office tends toward the permissive side regarding internet use,
so I see a decent amount of data traffic from constantly updating
browser apps. from The Weather Channel, Facebook, Marketwatch, etc.

I've also run iptraf to observe packets and bytes per interface and per LAN client and I'm not seeing any obvious abuse of bandwidth.

-----------------------------------------
What is the next course of action?
Do I need to take a class in Wireshark?
I've hit the limit of what I know to look for.

Thank you for reading such a long post; any guidance
greatly appreciated.

Hi there,

As you said

Quote:

When I ssh into the slackware router/firewall box from an external location over the internet, there is no pause or delay.

The problem appears within the LAN, also this proves that the server is responding properly as if there was something wrong with the server then it would have take long time over the Internet as well to process the request.

You said that you have got a 48 port switch. Is every system connected to this switch? As of now the image of your network that appears in my mind is as follows:

Code:



                    -----------

                  |  Internet  |

                    ----------- 

                        |

                        |

          -------------------------------              --------------

        |slackware 13.0 router/firewall |------------|    switch    |

          -------------------------------            /--------------\

                                          ___________/        |      \      --------------

                                        /                    |        \-----| Copier/Scan  |

                                        /                    / \              --------------

                                      /                    /  \

                                      /                    /    \

                                    /        -------------      ---------------------------

                                    /        | Gentoo Back |    | Primary Server with Samba |

                                  /          -------------      ---------------------------

                                  /

                          ----------------

                        | Client Systems |

                          ----------------

You can run tcpdump/wireshark on the server and workstation as well and see at what time workstation sends a packet and when server sees that packet. I would suggest you to try this from 2-3 workstation and if you see the same results then it appears to be problem with the switch itself.

Also, it would be great if you can let us know how if the above diagram defines your infrastructure or is there a difference because if there are other switches involved we have to look at them as well.

Is this happening for each and every system within the LAN or with few of the systems. Did you configure VLAN on the switch, if yes how it is configured and is this happening with particular VLAN or with all of them.

Quote:

Originally Posted by T3RM1NVT0R (Post 4376244)

Is every system connected to this switch? As of now the image of your network that appears in my mind is as follows:

Is this happening for each and every system within the LAN or with few of the systems. Did you configure VLAN on the switch, if yes how it is configured and is this happening with particular VLAN or with all of them.

T3RM, Thank you for your response.

1. Yes, every system is connected to the switch.
2. Yes, network image is exactly right.
3. Yes, this is happening for each and every system within the LAN.
4. VLAN is not configured on the switch, and no other switches exist on the network.

Now I am getting a better picture. Try the following things:

1. As you said you have around 40-50 Windows client, configure a shared folder on one of the client and try to access from another Windows machine. This will show us if the problem occurs only when accessing data on servers or it irrespective of that.

2. If you have 8 port switch for testing (perform this only when you 48 port switch is not acting as DHCP) then put that switch between the servers and 2-3 clients and see if you see any difference because as of now it appears to be like a hardware issue instead of software.

3. Install wireshark on client/server and perform a simple ping request and see at what time client sends the ping request and at what time server sees it. This will tell us exact difference in time.

I hope this helps.

Quote:

Originally Posted by T3RM1NVT0R (Post 4376791)

Definitely will try this on Monday morning; easy enough and it won't disrupt other users.

Quote:

Originally Posted by T3RM1NVT0R (Post 4376791)

2. If you have 8 port switch for testing (perform this only when you 48 port switch is not acting as DHCP) then put that switch between the servers and 2-3 clients and see if you see any difference because as of now it appears to be like a hardware issue instead of software.

Ahh, great idea....so simple but as usual it's hard to pull yourself out and see the whole picture once you've been digging around in it for a few days. I will give this a try before 8 am or after 6 pm so as not to disrupt business hours.

Quote:

Originally Posted by T3RM1NVT0R (Post 4376791)

3. Install wireshark on client/server and perform a simple ping request and see at what time client sends the ping request and at what time server sees it. This will tell us exact difference in time.

Will do -- I suppose I should try to sync time between the devices too...maybe setup ntpd on primary server for this testing.
I need to do some reading on how to use wireshark too.

T3RM1, thank you very much for the guidance.
Will report back once I complete the testing mentioned above.

You're welcome.

You can get information about wireshark from here: http://www.wireshark.org/ , you can also download it from the same link.

Edit: Forgot to mention that for linux you can go with ethereal: http://www.ethereal.com/download.html

Quote:

Originally Posted by T3RM1NVT0R (Post 4376791)

configure a shared folder on one of the client and try to access from another Windows machine. This will show us if the problem occurs only when accessing data on servers or it irrespective of that.

Completed; it works.

Great.

Now we know that the issue is not with the client system communication among themselves. As we already know that the server response is normal when accessed externally this implies that the issue is not with server either.

To further narrow down the issue we can try following things:

1. Perform ping from server to different clients, keep a track of reply time in ms.
2. Perform a ping between clients to compare the latency.
3. If possible connect server on a different port on switch and see the difference. This will clear up the problematic port issue if any.
4. If above steps does not give us any clue then LAN trace will be only option to go deeper.

T3RM1,
Thanks for hanging in with me.
I couldn't stay late tonight to try the 8-port switch experiment; some personal demands to take care of.

I did try an experiment earlier in the day today --
I ran tcpdump on routerbox LAN_nic and Primary/Samba_Server_nic while I scanned a document.
I was able to date-sync the two boxes to within a second of each other and will try to piece together the segment-to-segment response from Scanner to LAN_nic to Samba_Server_nic.

Wouldn't it figure, though; I saw a noticeable difference in the scanner notification today -- it was much shorter -- more like 3-5 secs. today as opposed to 10-12 secs. the day before. This is real "ghost in the machine" stuff.

Quote:

Originally Posted by Sum1 (Post 4378282)

T3RM1,
I did try an experiment earlier in the day today --
I ran tcpdump on routerbox LAN_nic and Primary/Samba_Server_nic while I scanned a document.

Forgot to mention -- I did the tcpdump test above because I don't have xorg installed on the routerbox and servers.
Since tcpdump and TShark(commandline wireshark) both use pcap reporting; I thought I would stick with tcpdump for now.

Here's the first timestamp of the document scanner hitting the routerbox LAN_nic:

Code:

2011-06-06 09:34:32.934962 IP (tos 0x0, ttl 64, id 15075, offset 0, flags [none], proto UDP (17), le

ngth 78)

    10.10.10.120.65387 > 10.10.10.255.netbios-ns: [udp sum ok] 

>>> NBT UDP PACKET(137): QUERY; REQUEST; BROADCAST

TrnID=0x356C

OpCode=0

NmFlags=0x11

Rcode=0

QueryCount=1

AnswerCount=0

AuthorityCount=0

AddressRecCount=0

QuestionRecords:

Name=A1              NameType=0x20 (Server)

QuestionType=0x20

QuestionClass=0x1

The first timestamp and beginning transaction on the SERVER_nic:

Code:

2011-06-06 09:34:34.284477 IP (tos 0x0, ttl 64, id 15075, offset 0, flags [none], proto UDP (17), l$

    10.10.10.120.65387 > 10.10.10.255.netbios-ns: [udp sum ok]

>>> NBT UDP PACKET(137): QUERY; REQUEST; BROADCAST

TrnID=0x356C

OpCode=0

NmFlags=0x11

Rcode=0

QueryCount=1

AnswerCount=0

AuthorityCount=0

AddressRecCount=0

QuestionRecords:

Name=A1              NameType=0x20 (Server)

QuestionType=0x20

QuestionClass=0x1





2011-06-06 09:34:34.284578 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length $

    a1.server.com.netbios-ns > 10.10.10.120.65387: [bad udp cksum 2b14!]

>>> NBT UDP PACKET(137): QUERY; POSITIVE; RESPONSE; UNICAST

TrnID=0x356C

OpCode=0

NmFlags=0x58

Rcode=0

QueryCount=0

AnswerCount=1

AuthorityCount=0

AddressRecCount=0

ResourceRecords:

Name=A1              NameType=0x20 (Server)

ResType=0x20

ResClass=0x1

TTL=259200 (0x3f480)

ResourceLength=6

ResourceData=

AddrType=0x6000

Address=10.10.10.199



2011-06-06 09:34:34.552531 IP (tos 0x0, ttl 64, id 15076, offset 0, flags [none], proto TCP (6), le$

    10.10.10.120.65476 > a1.server.com.netbios-ssn: Flags [S], cksum 0x534b (correct), seq 307548068$

2011-06-06 09:34:34.552552 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 6$

    a1.server.com.netbios-ssn > 10.10.10.120.65476: Flags [S.], cksum 0x0ef6 (incorrect -> 0xc286), $

2011-06-06 09:34:34.552724 IP (tos 0x0, ttl 64, id 15077, offset 0, flags [none], proto TCP (6), le$

    10.10.10.120.65476 > a1.server.com.netbios-ssn: Flags [.], cksum 0xc081 (correct), seq 1, ack 1,$

2011-06-06 09:34:34.552922 IP (tos 0x0, ttl 64, id 15078, offset 0, flags [none], proto TCP (6), le$

    10.10.10.120.65476 > a1.server.com.netbios-ssn: Flags [P.], cksum 0xb377 (correct), seq 1:73, ac$

>>> NBT Session Packet

NBT Session Request

Flags=0x0

Length=68 (0x44)

Destination=A1              NameType=0x20 (Server)

Source=DOCUMENT_SCANNER    NameType=0x00 (Workstation)

2011-06-06 09:34:34.552931 IP (tos 0x0, ttl 64, id 11735, offset 0, flags [DF], proto TCP (6), leng$

    a1.server.com.netbios-ssn > 10.10.10.120.65476: Flags [.], cksum 0x0eee (incorrect -> 0x047c), s

As stated in previous post, I sync'd the router and primary server to within a second of each other.
From this output, it appears at most a 2 second delay between them.
Of course, this is frustrating because it seems to undermine what I am seeing with my own eyes and what other users on the network are reporting to me.

I'll try to perform the "switch test" after 6 pm today.

I've experienced general network slowdowns in the past, one time it turned out that one of the DNS servers (primary) wasn't running properly so it was taking several seconds to failover to the secondary, once the issue was corrected bam all network services started to work again.

Quote:

Originally Posted by linuxguy7820 (Post 4378832)

turned out that one of the DNS servers (primary) wasn't running properly

Linuxguy,
Thanks for your response.
I'm doing some research to find out about using tcpdump to discover dns problems.

For checking dns server problem you can also use nslookup and see from which server you get authoritative answer, if there is something wrong with primary dns server then you will get authoritative answer from secondary dns. You can use the following command:

nslookup - to get into nslookup
set debug - this will show you the query you are performing and the response you are getting, will also display how much time it took to resolve the query.

You can use "server xxx.xxx.xxx.xxx" or "server dns_name" without quotes in nslookup prompt to change the dns server using which you want to perform dns query.

Quote:

Originally Posted by T3RM1NVT0R (Post 4379053)

nslookup - to get into nslookup
set debug - this will show you the query you are performing and the response you are getting, will also display how much time it took to resolve the query.

It appears there's no problem with dns on my network.
Testing from several different clients obtains the same answer (the ISP's primary dns) with no delay.

sidenote: still cannot complete the substitute router test due to some co-workers staying late to finish a project.

Quote:

Originally Posted by T3RM1NVT0R (Post 4376791)

3. Install wireshark on client/server and perform a simple ping request and see at what time client sends the ping request and at what time server sees it. This will tell us exact difference in time.

T3RM1,

Wireshark on client and tcpdump on server.
Shown below is the last request/reply in a 500 count series.
No packet loss.

Clock Sync Issue: unable to sync clocks between client/server; at any given time there was observable difference of 1-3 seconds.

Client Ping to Server:

Logged on Client nic:

10:34:33.291392 IP 10.10.10.185 > 10.10.10.199: ICMP echo request, id 512, seq 34818, length 40
10:34:33.291398 IP 10.10.10.199 > 10.10.10.185: ICMP echo reply, id 512, seq 34818, length 40

Logged on Server nic:

10.10.10.185(Client) 10.10.10.199(Server) ICMP 74 Echo (ping) request id=0x0200, seq=34818/648, ttl=128
Arrival Time: Jun 8, 2011 10:34:30.823649000 Eastern Daylight Time
10.10.10.199(Server) 10.10.10.185(Client) ICMP 74 Echo (ping) reply id=0x0200, seq=34818/648, ttl=64
Arrival Time: Jun 8, 2011 10:34:30.823797000 Eastern Daylight Time