Trying to troubleshoot system (network?) pauses
I'm running Debian 5 PPC on a Apple G4 that I'm using as a router. Since I installed it, I have a strange problem where every several minutes (I've seen it between 4 and 6 minutes, but not every 4-6 minutes) the system pauses and doesn't accept any input for a 1-3 minutes. Then it continues on as if nothing happened.
I'm not running X, this is all from the command line. I noticed this once or twice when setting up the system, but didn't think much about it. After I installed the router, it's gotten a bit more annoying, as I ssh to a server behind the router and mid-typing all of a sudden I have to wait.
I suspect it's coming from one network cards based on some google searches I've done, but I can't tell if I'm describing the problem properly.
This is a lcpci that shows the network cards:
I've had this problem in a Vyatta 4 install as well (which is based on Debian Lenny), on completely different hardware, but the same brand network card.
Ouch - these types of intermittent problems are the hardest to solve.
Have you tried "downing" one or all of the network cards, and then seeing if the problem persists?
(I don't know if it will have any effect - I don't know enough about Linux architecture. Even for a "downed" interface, the drivers and associated software is still present in the kernel... as far as I know.)
You do not mention it, but do you get these lockups at the physical console commandline for that system, or ONLY if you are SSH'ed in from a remote connection somewhere?
I. e. what I'd do is:
If you only get the lockups when behind a SSH'ed connection, try and selectively disable the other cards. I. e. if you know you are using eth1 to SSH in (that's where you get the lockups) try downing eth0 and eth2 and seeing if the problem persists.
Also, you can try downing the cards and then unloading their modules with the "rmmod" command. Of course, try and use the simplest topology possible when testing - i. e. don't have other routers, gateways, proxies or firewalls between you and the server you are trying to fix - they will only complicate matters since the error might be almost -anywhere- if you have too many factors involved in it.
What is the load on that system? I have encountered something similar once when the kernel got busy on an older system of mine. I did not compile that kernel with DMA for my motherboard, and when the system got busy I used to get "micro lockups" of about 30 to 45 seconds, exactly the way you describe yours. Are you sure you have DMA enabled, and that it works? No idea how this applies on an Apple, but I also noticed this on yet another kernel I was using - network throughput was slow, and although the system did not hang, it got sluggish if there was lots of network traffic - I had to recompile the kernel with DMA support for my motherboard chipset, and after that is was fixed...
You might have a network buffer or something that is overflowing, do you see anything relevant in dmesg or in the kernel's logs or your network logs? While the system is "input locked", does it still respond to network traffic / pings? Try, for example, FTP'ing in while it is input locked - does the FTP connection work and is it responsive? I. e. it might be a protocol or port that is getting blocked for some reason, if, for example, your SSH session is locked down but FTP is working...?
It can be just about anything.
You'll need to do some elimination here first, and the best way is to start at the simplest possible configuration and then slowly add complexity. The problem you describe can be caused by -very- many different factors, not all neccessarily integral to your software, hardware or network infrastructure. It might be a combination of all three, one aspect only, or something else that might be exceedingly trivial, or extremely complex to solve.
Hope this gives you some ideas...
Anything in the logs when this happens ?
NETDEV WATCHDOG: eth1: transmit timed out
eth1: Tx timed out, lost interrupt?
or anything else bizarre ?
Thanks rylan76 and farslayer for your replies.
I can only SSH in from the main interface, and that's mostly how I connect in to work on it.
The place the computer is physically located is not easy for me to get to or stay for any length of time, but it's not being used for production work, so I'm going to get it and see if I still get the lockups from the console.
The DMA enabled is also a good thing to check, but I've got no idea how or if that applies to a PPC box either.
But you've given a couple things to test, so that's good. I'm going to try at the console with the ethernet cards down, and leave a terminal running top open and see if anything happens then.
Ok, a bit more troubleshooting done. The pauses occur even with the two PCI ethernet cards downed.
Also, when there's a pause, the console is still responsive. So it does seem to be networking related, not the whole system.
I left a top open during a freeze and there was nothing out of the ordinary. CPU usage never seems to climb above 10%.
Install and run itop to see if theres an interrupt issue, causing the pauses.. maybe you have a piece of hardware that is freakig out and sending tons of interrupts..
So using itop, I got the following:
I'm not sure what the normal range of these are.
Hmm not familiar enough with the tool, but I was expecting something ot go haywire with interrupts when the pause occurred if that was the issue..
This blog has some other interesting tools, that might be worth looking at.
be sure to scroll back through previous posts, for a lot of additional tools that can be used for diagnostics.
Without knowing the source of the problem, how does one figure out what to look at.. I guess that is the ultimate question..
Thank you farslayer and rylan76 for your assistance.
must be a Debian bug...
So, I installed the same setup from scratch on a Celeron 600 I had lying around, using different network cards. The crazy thing is it started doing the same thing!
I figured I'd list the software I installed on it in case it gives any clues:
base debian lenny install
firehol is set to reroute http and https to port 8080, dansguardian, which uses squid.
All fine and dandy, but it's not http traffic that's getting pauses. Even SSH or FTP do.
Here's the odd thing. I installed SUSE on the same computer, set up the same software and no pauses. I'm not sure what more data to collect for a bug report.
|All times are GMT -5. The time now is 01:33 AM.|