LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Debian (http://www.linuxquestions.org/questions/debian-26/)
-   -   Trying to troubleshoot system (network?) pauses (http://www.linuxquestions.org/questions/debian-26/trying-to-troubleshoot-system-network-pauses-724280/)

emgee3 05-06-2009 09:01 PM

Trying to troubleshoot system (network?) pauses
 
I'm running Debian 5 PPC on a Apple G4 that I'm using as a router. Since I installed it, I have a strange problem where every several minutes (I've seen it between 4 and 6 minutes, but not every 4-6 minutes) the system pauses and doesn't accept any input for a 1-3 minutes. Then it continues on as if nothing happened.

I'm not running X, this is all from the command line. I noticed this once or twice when setting up the system, but didn't think much about it. After I installed the router, it's gotten a bit more annoying, as I ssh to a server behind the router and mid-typing all of a sudden I have to wait.

I suspect it's coming from one network cards based on some google searches I've done, but I can't tell if I'm describing the problem properly.

This is a lcpci that shows the network cards:
Code:

01:02.0 Ethernet controller: VIA Technologies, Inc. VT86C100A [Rhine] (rev 06)
01:03.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)
01:04.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)

I've also checked various log files but I don't see any messages around that time period that are suspicious to me.

I've had this problem in a Vyatta 4 install as well (which is based on Debian Lenny), on completely different hardware, but the same brand network card.

rylan76 05-07-2009 02:41 AM

Ouch - these types of intermittent problems are the hardest to solve.

Have you tried "downing" one or all of the network cards, and then seeing if the problem persists?

(I don't know if it will have any effect - I don't know enough about Linux architecture. Even for a "downed" interface, the drivers and associated software is still present in the kernel... as far as I know.)

You do not mention it, but do you get these lockups at the physical console commandline for that system, or ONLY if you are SSH'ed in from a remote connection somewhere?

I. e. what I'd do is:

Code:

# /sbin/ifconfig eth0 down
# /sbin/ifconfig eth1 down
# /sbin/ifconfig eth2 down

Then, wait or use the system for at least an hour to see if the lockups persist. If it is suddenly working WITHOUT lockups, you know that at least one of the cards is the culprit.

If you only get the lockups when behind a SSH'ed connection, try and selectively disable the other cards. I. e. if you know you are using eth1 to SSH in (that's where you get the lockups) try downing eth0 and eth2 and seeing if the problem persists.

Also, you can try downing the cards and then unloading their modules with the "rmmod" command. Of course, try and use the simplest topology possible when testing - i. e. don't have other routers, gateways, proxies or firewalls between you and the server you are trying to fix - they will only complicate matters since the error might be almost -anywhere- if you have too many factors involved in it.

What is the load on that system? I have encountered something similar once when the kernel got busy on an older system of mine. I did not compile that kernel with DMA for my motherboard, and when the system got busy I used to get "micro lockups" of about 30 to 45 seconds, exactly the way you describe yours. Are you sure you have DMA enabled, and that it works? No idea how this applies on an Apple, but I also noticed this on yet another kernel I was using - network throughput was slow, and although the system did not hang, it got sluggish if there was lots of network traffic - I had to recompile the kernel with DMA support for my motherboard chipset, and after that is was fixed...

You might have a network buffer or something that is overflowing, do you see anything relevant in dmesg or in the kernel's logs or your network logs? While the system is "input locked", does it still respond to network traffic / pings? Try, for example, FTP'ing in while it is input locked - does the FTP connection work and is it responsive? I. e. it might be a protocol or port that is getting blocked for some reason, if, for example, your SSH session is locked down but FTP is working...?

It can be just about anything.

You'll need to do some elimination here first, and the best way is to start at the simplest possible configuration and then slowly add complexity. The problem you describe can be caused by -very- many different factors, not all neccessarily integral to your software, hardware or network infrastructure. It might be a combination of all three, one aspect only, or something else that might be exceedingly trivial, or extremely complex to solve.

Hope this gives you some ideas...

farslayer 05-07-2009 09:23 AM

Anything in the logs when this happens ?


NETDEV WATCHDOG: eth1: transmit timed out
eth1: Tx timed out, lost interrupt?


or anything else bizarre ?

emgee3 05-07-2009 10:40 PM

Thanks rylan76 and farslayer for your replies.

Quote:

Anything in the logs when this happens ?
I checked all the logs in /var/log and there's no entries that coincide with the times of a pause. Which is just strange.

Quote:

Try, for example, FTP'ing in while it is input locked - does the FTP connection work and is it responsive? I. e. it might be a protocol or port that is getting blocked for some reason, if, for example, your SSH session is locked down but FTP is working...?
Another factor is that established network connections stay connected during this time, but no traffic makes it through. The exception is if the timeout is set different. I've had some connections, usually large http transfers, time out during a pause. But SSH connections and most everything else stays connected, but with no traffic, during the pause.

-----

I can only SSH in from the main interface, and that's mostly how I connect in to work on it.

The place the computer is physically located is not easy for me to get to or stay for any length of time, but it's not being used for production work, so I'm going to get it and see if I still get the lockups from the console.

The DMA enabled is also a good thing to check, but I've got no idea how or if that applies to a PPC box either.

But you've given a couple things to test, so that's good. I'm going to try at the console with the ethernet cards down, and leave a terminal running top open and see if anything happens then.

emgee3 05-08-2009 03:04 PM

Ok, a bit more troubleshooting done. The pauses occur even with the two PCI ethernet cards downed.

Also, when there's a pause, the console is still responsive. So it does seem to be networking related, not the whole system.

I left a top open during a freeze and there was nothing out of the ordinary. CPU usage never seems to climb above 10%.

farslayer 05-08-2009 03:14 PM

Install and run itop to see if theres an interrupt issue, causing the pauses.. maybe you have a piece of hardware that is freakig out and sending tons of interrupts..


itop -a

emgee3 05-08-2009 04:21 PM

So using itop, I got the following:

Code:

18 [              MESH]    0 Ints/s    (max:    0)
 20 [        NMI - XMON]    0 Ints/s    (max:    0)
 21 [          pcilynx]    0 Ints/s    (max:    0)
 24 [              eth1]  183 Ints/s    (max:  250)
 25 [              eth2]    0 Ints/s    (max:    92)
 26 [              ide1]    0 Ints/s    (max:    57)
 27 [              PMac]    0 Ints/s    (max:    0)
 28 [    ohci_hcd:usb1]    0 Ints/s    (max:    0)
 29 [      PMac Output]    0 Ints/s    (max:    0)
 30 [        PMac Input]    0 Ints/s    (max:    0)
 31 [            SWIM3]    0 Ints/s    (max:    0)
 33 [              ADB]    0 Ints/s    (max:    0)
 34 [              ide0]    0 Ints/s    (max:    0)
 36 [        BMAC-txdma]    64 Ints/s    (max:  127)
 37 [        BMAC-rxdma]  120 Ints/s    (max:  139)
 42 [        BMAC-misc]    0 Ints/s    (max:    0)

This was during a large http transfer from behind the router. When the freeze came, eth1 dropped down the 2 Ints/s and everything else dropped down to 0.

I'm not sure what the normal range of these are.

farslayer 05-10-2009 10:03 PM

Hmm not familiar enough with the tool, but I was expecting something ot go haywire with interrupts when the pause occurred if that was the issue..

This blog has some other interesting tools, that might be worth looking at.
http://prefetch.net/blog/index.php/c...lities/page/2/
be sure to scroll back through previous posts, for a lot of additional tools that can be used for diagnostics.


Without knowing the source of the problem, how does one figure out what to look at.. I guess that is the ultimate question..

emgee3 05-10-2009 10:56 PM

Quote:

This blog has some other interesting tools, that might be worth looking at.
http://prefetch.net/blog/index.php/c...lities/page/2/
be sure to scroll back through previous posts, for a lot of additional tools that can be used for diagnostics.
That looks like a great set of resources. Unfortunately for my G4, I dumpstered it earlier today, after:
  1. Removing all unnecessary hardware
  2. Swapping all the network cards
  3. Compiling a custom kernel
  4. Reinstalling Debian

Thank you farslayer and rylan76 for your assistance.

emgee3 05-21-2009 01:11 AM

must be a Debian bug...
 
So, I installed the same setup from scratch on a Celeron 600 I had lying around, using different network cards. The crazy thing is it started doing the same thing!

I figured I'd list the software I installed on it in case it gives any clues:

base debian lenny install
firehol
dansguardian
squid

firehol is set to reroute http and https to port 8080, dansguardian, which uses squid.

All fine and dandy, but it's not http traffic that's getting pauses. Even SSH or FTP do.

Here's the odd thing. I installed SUSE on the same computer, set up the same software and no pauses. I'm not sure what more data to collect for a bug report.


All times are GMT -5. The time now is 04:50 PM.