LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Networking
User Name
Password
Linux - Networking This forum is for any issue related to networks or networking.
Routing, network cards, OSI, etc. Anything is fair game.

Notices


Reply
  Search this Thread
Old 06-04-2011, 11:30 AM   #1
Sum1
Member
 
Registered: Jul 2007
Distribution: Fedora, CentOS, and would like to get back to Gentoo
Posts: 332

Rep: Reputation: 30
Network Slowdown: difficult to diagnose


Something definitely disrupted my network over Memorial Day weekend May 28-30.
Upon arriving back to the office, I noticed that my usual tar backups and rsync
between primary server and backup server were not completed.

My devices are as follows:

1. slackware 13.0 router/firewall box
2. slackware 13.1 primary server with samba
3. gentoo backup/fail-over server (rsyncs hourly with
primary server and performs nightly backups)
4. copier/scanner that transfers scanned docs. directly to primary server.
5. 40-50 windows clients

During the past week, I've experienced the dreaded, nagging, "network slowdown" and have
not been able to discern the cause.


Here's the list of symptoms:

Copier/scanner used to complete transfer of scanned documents to
the samba server in less than a second;
now it takes a solid 10-12 seconds.

Windows clients get intermittant slowness when trying to open different directories and sub-folders
on the samba server.

When I ssh into the slackware router/firewall box from an external location over the internet,
there is no pause or delay.
When I ssh from inside the LAN to either the routerbox, primary server,
or backup server I get the same 10-12 second delay upon login -- immediate access
but upon providing my user name, there is now a 10-12 second delay
before prompted to provide user password.


The weird thing is everything works; only slower:

I've pinged every box/server/client from every other box/server/client for hours with no packet loss.
Users can access all their directories and documents on the samba server.
Accounting people are using the data sets on the samba server.
Everyone has consistent internet access.
The rsync between primary and backup server works, as well as nightly backup.

I've cycled power on the 48 port network switch, and the routerbox, primary server,
backup server, scanner, and all client boxes.

I've run hdparm on routerbox, primary server, and backup server, and they all report
normal speeds for older SATA 1 raid drives:

Timing cached reads: 1842 MB in 2.00 seconds = 920.90 MB/sec
Timing buffered disk reads: 190 MB in 3.00 seconds = 63.30 MB/sec

I've run "top" for hours on each box and the cpu loads are typically
2%-3%, with only occasional spikes up to 30% on the samba server.
There's almost no use of swap.

I'm not a tcpdump expert, but I've been logging data traffic on the
routerbox and don't see any obvious culprit like
a piece of spyware that is crushing all the available bandwidth
like a denial of service attack.

The office tends toward the permissive side regarding internet use,
so I see a decent amount of data traffic from constantly updating
browser apps. from The Weather Channel, Facebook, Marketwatch, etc.

I've also run iptraf to observe packets and bytes per interface and per LAN client and I'm not seeing any obvious abuse of bandwidth.

-----------------------------------------
What is the next course of action?
Do I need to take a class in Wireshark?
I've hit the limit of what I know to look for.

Thank you for reading such a long post; any guidance
greatly appreciated.

Last edited by Sum1; 06-04-2011 at 11:31 AM.
 
Old 06-04-2011, 01:29 PM   #2
T3RM1NVT0R
Senior Member
 
Registered: Dec 2010
Location: Internet
Distribution: Linux Mint, SLES, CentOS, Red Hat
Posts: 2,385

Rep: Reputation: 477Reputation: 477Reputation: 477Reputation: 477Reputation: 477
@ Reply

Hi there,

As you said

Quote:
When I ssh into the slackware router/firewall box from an external location over the internet, there is no pause or delay.
The problem appears within the LAN, also this proves that the server is responding properly as if there was something wrong with the server then it would have take long time over the Internet as well to process the request.

You said that you have got a 48 port switch. Is every system connected to this switch? As of now the image of your network that appears in my mind is as follows:

Code:
                    -----------
                   |  Internet  |
                    ----------- 
                         |
                         |
          -------------------------------              --------------
         |slackware 13.0 router/firewall |------------|    switch    |
          -------------------------------             /--------------\
                                          ___________/        |       \       --------------
                                         /                    |        \-----| Copier/Scan  |
                                        /                    / \              --------------
                                       /                    /   \
                                      /                    /     \
                                     /        -------------      ---------------------------
                                    /        | Gentoo Back |    | Primary Server with Samba |
                                   /          -------------      ---------------------------
                                  /
                          ----------------
                         | Client Systems |
                          ----------------
You can run tcpdump/wireshark on the server and workstation as well and see at what time workstation sends a packet and when server sees that packet. I would suggest you to try this from 2-3 workstation and if you see the same results then it appears to be problem with the switch itself.

Also, it would be great if you can let us know how if the above diagram defines your infrastructure or is there a difference because if there are other switches involved we have to look at them as well.

Is this happening for each and every system within the LAN or with few of the systems. Did you configure VLAN on the switch, if yes how it is configured and is this happening with particular VLAN or with all of them.
 
Old 06-05-2011, 01:21 AM   #3
Sum1
Member
 
Registered: Jul 2007
Distribution: Fedora, CentOS, and would like to get back to Gentoo
Posts: 332

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by T3RM1NVT0R View Post
Is every system connected to this switch? As of now the image of your network that appears in my mind is as follows:

Is this happening for each and every system within the LAN or with few of the systems. Did you configure VLAN on the switch, if yes how it is configured and is this happening with particular VLAN or with all of them.
T3RM, Thank you for your response.

1. Yes, every system is connected to the switch.
2. Yes, network image is exactly right.
3. Yes, this is happening for each and every system within the LAN.
4. VLAN is not configured on the switch, and no other switches exist on the network.
 
Old 06-05-2011, 06:30 AM   #4
T3RM1NVT0R
Senior Member
 
Registered: Dec 2010
Location: Internet
Distribution: Linux Mint, SLES, CentOS, Red Hat
Posts: 2,385

Rep: Reputation: 477Reputation: 477Reputation: 477Reputation: 477Reputation: 477
@ Reply

Now I am getting a better picture. Try the following things:

1. As you said you have around 40-50 Windows client, configure a shared folder on one of the client and try to access from another Windows machine. This will show us if the problem occurs only when accessing data on servers or it irrespective of that.

2. If you have 8 port switch for testing (perform this only when you 48 port switch is not acting as DHCP) then put that switch between the servers and 2-3 clients and see if you see any difference because as of now it appears to be like a hardware issue instead of software.

3. Install wireshark on client/server and perform a simple ping request and see at what time client sends the ping request and at what time server sees it. This will tell us exact difference in time.

I hope this helps.
 
1 members found this post helpful.
Old 06-05-2011, 08:46 AM   #5
Sum1
Member
 
Registered: Jul 2007
Distribution: Fedora, CentOS, and would like to get back to Gentoo
Posts: 332

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by T3RM1NVT0R View Post
Now I am getting a better picture. Try the following things:

1. As you said you have around 40-50 Windows client, configure a shared folder on one of the client and try to access from another Windows machine. This will show us if the problem occurs only when accessing data on servers or it irrespective of that.
Definitely will try this on Monday morning; easy enough and it won't disrupt other users.

Quote:
Originally Posted by T3RM1NVT0R View Post
2. If you have 8 port switch for testing (perform this only when you 48 port switch is not acting as DHCP) then put that switch between the servers and 2-3 clients and see if you see any difference because as of now it appears to be like a hardware issue instead of software.
Ahh, great idea....so simple but as usual it's hard to pull yourself out and see the whole picture once you've been digging around in it for a few days. I will give this a try before 8 am or after 6 pm so as not to disrupt business hours.

Quote:
Originally Posted by T3RM1NVT0R View Post
3. Install wireshark on client/server and perform a simple ping request and see at what time client sends the ping request and at what time server sees it. This will tell us exact difference in time.
Will do -- I suppose I should try to sync time between the devices too...maybe setup ntpd on primary server for this testing.
I need to do some reading on how to use wireshark too.

T3RM1, thank you very much for the guidance.
Will report back once I complete the testing mentioned above.
 
Old 06-05-2011, 08:49 AM   #6
T3RM1NVT0R
Senior Member
 
Registered: Dec 2010
Location: Internet
Distribution: Linux Mint, SLES, CentOS, Red Hat
Posts: 2,385

Rep: Reputation: 477Reputation: 477Reputation: 477Reputation: 477Reputation: 477
@ Reply

You're welcome.

You can get information about wireshark from here: http://www.wireshark.org/ , you can also download it from the same link.

Edit: Forgot to mention that for linux you can go with ethereal: http://www.ethereal.com/download.html

Last edited by T3RM1NVT0R; 06-05-2011 at 08:54 AM.
 
Old 06-06-2011, 09:43 AM   #7
Sum1
Member
 
Registered: Jul 2007
Distribution: Fedora, CentOS, and would like to get back to Gentoo
Posts: 332

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by T3RM1NVT0R View Post
configure a shared folder on one of the client and try to access from another Windows machine. This will show us if the problem occurs only when accessing data on servers or it irrespective of that.
Completed; it works.
 
Old 06-06-2011, 02:03 PM   #8
T3RM1NVT0R
Senior Member
 
Registered: Dec 2010
Location: Internet
Distribution: Linux Mint, SLES, CentOS, Red Hat
Posts: 2,385

Rep: Reputation: 477Reputation: 477Reputation: 477Reputation: 477Reputation: 477
@ Reply

Great.

Now we know that the issue is not with the client system communication among themselves. As we already know that the server response is normal when accessed externally this implies that the issue is not with server either.

To further narrow down the issue we can try following things:

1. Perform ping from server to different clients, keep a track of reply time in ms.
2. Perform a ping between clients to compare the latency.
3. If possible connect server on a different port on switch and see the difference. This will clear up the problematic port issue if any.
4. If above steps does not give us any clue then LAN trace will be only option to go deeper.
 
Old 06-06-2011, 08:02 PM   #9
Sum1
Member
 
Registered: Jul 2007
Distribution: Fedora, CentOS, and would like to get back to Gentoo
Posts: 332

Original Poster
Rep: Reputation: 30
T3RM1,
Thanks for hanging in with me.
I couldn't stay late tonight to try the 8-port switch experiment; some personal demands to take care of.

I did try an experiment earlier in the day today --
I ran tcpdump on routerbox LAN_nic and Primary/Samba_Server_nic while I scanned a document.
I was able to date-sync the two boxes to within a second of each other and will try to piece together the segment-to-segment response from Scanner to LAN_nic to Samba_Server_nic.

Wouldn't it figure, though; I saw a noticeable difference in the scanner notification today -- it was much shorter -- more like 3-5 secs. today as opposed to 10-12 secs. the day before. This is real "ghost in the machine" stuff.
 
Old 06-07-2011, 08:10 AM   #10
Sum1
Member
 
Registered: Jul 2007
Distribution: Fedora, CentOS, and would like to get back to Gentoo
Posts: 332

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by Sum1 View Post
T3RM1,
I did try an experiment earlier in the day today --
I ran tcpdump on routerbox LAN_nic and Primary/Samba_Server_nic while I scanned a document.
Forgot to mention -- I did the tcpdump test above because I don't have xorg installed on the routerbox and servers.
Since tcpdump and TShark(commandline wireshark) both use pcap reporting; I thought I would stick with tcpdump for now.

Here's the first timestamp of the document scanner hitting the routerbox LAN_nic:
Code:
2011-06-06 09:34:32.934962 IP (tos 0x0, ttl 64, id 15075, offset 0, flags [none], proto UDP (17), le
ngth 78)
    10.10.10.120.65387 > 10.10.10.255.netbios-ns: [udp sum ok] 
>>> NBT UDP PACKET(137): QUERY; REQUEST; BROADCAST
TrnID=0x356C
OpCode=0
NmFlags=0x11
Rcode=0
QueryCount=1
AnswerCount=0
AuthorityCount=0
AddressRecCount=0
QuestionRecords:
Name=A1              NameType=0x20 (Server)
QuestionType=0x20
QuestionClass=0x1

The first timestamp and beginning transaction on the SERVER_nic:
Code:
2011-06-06 09:34:34.284477 IP (tos 0x0, ttl 64, id 15075, offset 0, flags [none], proto UDP (17), l$
    10.10.10.120.65387 > 10.10.10.255.netbios-ns: [udp sum ok]
>>> NBT UDP PACKET(137): QUERY; REQUEST; BROADCAST
TrnID=0x356C
OpCode=0
NmFlags=0x11
Rcode=0
QueryCount=1
AnswerCount=0
AuthorityCount=0
AddressRecCount=0
QuestionRecords:
Name=A1              NameType=0x20 (Server)
QuestionType=0x20
QuestionClass=0x1


2011-06-06 09:34:34.284578 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length $
    a1.server.com.netbios-ns > 10.10.10.120.65387: [bad udp cksum 2b14!]
>>> NBT UDP PACKET(137): QUERY; POSITIVE; RESPONSE; UNICAST
TrnID=0x356C
OpCode=0
NmFlags=0x58
Rcode=0
QueryCount=0
AnswerCount=1
AuthorityCount=0
AddressRecCount=0
ResourceRecords:
Name=A1              NameType=0x20 (Server)
ResType=0x20
ResClass=0x1
TTL=259200 (0x3f480)
ResourceLength=6
ResourceData=
AddrType=0x6000
Address=10.10.10.199

2011-06-06 09:34:34.552531 IP (tos 0x0, ttl 64, id 15076, offset 0, flags [none], proto TCP (6), le$
    10.10.10.120.65476 > a1.server.com.netbios-ssn: Flags [S], cksum 0x534b (correct), seq 307548068$
2011-06-06 09:34:34.552552 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 6$
    a1.server.com.netbios-ssn > 10.10.10.120.65476: Flags [S.], cksum 0x0ef6 (incorrect -> 0xc286), $
2011-06-06 09:34:34.552724 IP (tos 0x0, ttl 64, id 15077, offset 0, flags [none], proto TCP (6), le$
    10.10.10.120.65476 > a1.server.com.netbios-ssn: Flags [.], cksum 0xc081 (correct), seq 1, ack 1,$
2011-06-06 09:34:34.552922 IP (tos 0x0, ttl 64, id 15078, offset 0, flags [none], proto TCP (6), le$
    10.10.10.120.65476 > a1.server.com.netbios-ssn: Flags [P.], cksum 0xb377 (correct), seq 1:73, ac$
>>> NBT Session Packet
NBT Session Request
Flags=0x0
Length=68 (0x44)
Destination=A1              NameType=0x20 (Server)
Source=DOCUMENT_SCANNER     NameType=0x00 (Workstation)
2011-06-06 09:34:34.552931 IP (tos 0x0, ttl 64, id 11735, offset 0, flags [DF], proto TCP (6), leng$
    a1.server.com.netbios-ssn > 10.10.10.120.65476: Flags [.], cksum 0x0eee (incorrect -> 0x047c), s
As stated in previous post, I sync'd the router and primary server to within a second of each other.
From this output, it appears at most a 2 second delay between them.
Of course, this is frustrating because it seems to undermine what I am seeing with my own eyes and what other users on the network are reporting to me.

I'll try to perform the "switch test" after 6 pm today.
 
Old 06-07-2011, 10:36 AM   #11
linuxguy7820
Member
 
Registered: Mar 2011
Distribution: CentOS, RHEL, Fedora
Posts: 35

Rep: Reputation: 0
I've experienced general network slowdowns in the past, one time it turned out that one of the DNS servers (primary) wasn't running properly so it was taking several seconds to failover to the secondary, once the issue was corrected bam all network services started to work again.
 
Old 06-07-2011, 12:30 PM   #12
Sum1
Member
 
Registered: Jul 2007
Distribution: Fedora, CentOS, and would like to get back to Gentoo
Posts: 332

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by linuxguy7820 View Post
turned out that one of the DNS servers (primary) wasn't running properly
Linuxguy,
Thanks for your response.
I'm doing some research to find out about using tcpdump to discover dns problems.
 
Old 06-07-2011, 03:04 PM   #13
T3RM1NVT0R
Senior Member
 
Registered: Dec 2010
Location: Internet
Distribution: Linux Mint, SLES, CentOS, Red Hat
Posts: 2,385

Rep: Reputation: 477Reputation: 477Reputation: 477Reputation: 477Reputation: 477
@ Reply

For checking dns server problem you can also use nslookup and see from which server you get authoritative answer, if there is something wrong with primary dns server then you will get authoritative answer from secondary dns. You can use the following command:

nslookup - to get into nslookup
set debug - this will show you the query you are performing and the response you are getting, will also display how much time it took to resolve the query.

You can use "server xxx.xxx.xxx.xxx" or "server dns_name" without quotes in nslookup prompt to change the dns server using which you want to perform dns query.
 
1 members found this post helpful.
Old 06-07-2011, 04:43 PM   #14
Sum1
Member
 
Registered: Jul 2007
Distribution: Fedora, CentOS, and would like to get back to Gentoo
Posts: 332

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by T3RM1NVT0R View Post
nslookup - to get into nslookup
set debug - this will show you the query you are performing and the response you are getting, will also display how much time it took to resolve the query.
It appears there's no problem with dns on my network.
Testing from several different clients obtains the same answer (the ISP's primary dns) with no delay.

sidenote: still cannot complete the substitute router test due to some co-workers staying late to finish a project.
 
Old 06-08-2011, 01:16 PM   #15
Sum1
Member
 
Registered: Jul 2007
Distribution: Fedora, CentOS, and would like to get back to Gentoo
Posts: 332

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by T3RM1NVT0R View Post
3. Install wireshark on client/server and perform a simple ping request and see at what time client sends the ping request and at what time server sees it. This will tell us exact difference in time.
T3RM1,

Wireshark on client and tcpdump on server.
Shown below is the last request/reply in a 500 count series.
No packet loss.

Clock Sync Issue: unable to sync clocks between client/server; at any given time there was observable difference of 1-3 seconds.

Client Ping to Server:

Logged on Client nic:

10:34:33.291392 IP 10.10.10.185 > 10.10.10.199: ICMP echo request, id 512, seq 34818, length 40
10:34:33.291398 IP 10.10.10.199 > 10.10.10.185: ICMP echo reply, id 512, seq 34818, length 40

Logged on Server nic:

10.10.10.185(Client) 10.10.10.199(Server) ICMP 74 Echo (ping) request id=0x0200, seq=34818/648, ttl=128
Arrival Time: Jun 8, 2011 10:34:30.823649000 Eastern Daylight Time
10.10.10.199(Server) 10.10.10.185(Client) ICMP 74 Echo (ping) reply id=0x0200, seq=34818/648, ttl=64
Arrival Time: Jun 8, 2011 10:34:30.823797000 Eastern Daylight Time
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Kernel and Ext3 Filesystem Errors at the same time -- difficult to diagnose. Sum1 Linux - Server 3 10-24-2009 06:33 PM
Help with network slowdown (fragmentation?) c4onastick Linux - Networking 7 07-12-2007 12:34 AM
Intense network slowdown mdarby Slackware 2 04-23-2005 10:32 PM
network slowdown from 2.2 to 2.4 ccap Linux - Networking 3 02-13-2004 01:35 PM
Major network slowdown. BrianG Linux - Networking 4 01-18-2002 03:32 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Networking

All times are GMT -5. The time now is 12:05 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration