Linux system stalling every few minutes, yet no errors??
I have a Gentoo Linux system that has a VERY annoying and hard to track issue with freezing up for 2-15 secs every minute or so. I have go through a few ideas about watching dmesg/messages and using the performance tools to look for problems.
No log errors of any kind, and I am not sure that this I/O cpu utilization is normal (it is contantly this high ) This system is a net-flow receiver, and our MRTG system so it has a healthy in and out of bursty net traffic. MRTG is run in cron every 5 mins and has about 60 devices it watches. There are about 10 net-flow sources hitting me too. I would just like some help on learning where I can look next. :) Nick Every 2.0s: iostat Sat Dec 30 07:36:34 2006 Linux 2.6.19-gentoo-r2 (poindexter) 12/30/06 avg-cpu: %user %nice %system %iowait %steal %idle 11.41 0.00 4.64 37.75 0.00 46.20 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 89.21 98.59 1398.02 866602 12288704 sdb 2.11 33.94 14.54 298318 127808 Every 2.0s: mpstat Sat Dec 30 07:36:53 2006 Linux 2.6.19-gentoo-r2 (poindexter) 12/30/06 07:36:53 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 07:36:53 all 11.40 0.00 4.28 37.85 0.06 0.30 0.00 46.10 409.99 |
You can check to see how your syslog daemon is configured to report errors. You will find that information in the /etc/syslog.conf file. Look for a section like this:
Code:
# Kernel logging Even though the CPU and I/O appear to be in normal range you could still check to see what is eating up these resources using the top utility. Run the top utility in a command line window or in a text console. You can use Ethereal or tcpdump to see what network traffic is coming to your machine. Network traffic would be handled with a high priority. If you are getting a lot of bogus network connect requests then that would show up as degraded interactive response. It is entirely possible that you have a broken hardware device. It could be a hard disk with a lot of bad blocks or a NIC that is broken but appears to be working or a bad network cable or a bad power supply or whatever. I hope that I've provided some useful ideas. |
Hey Stress_Junkie,
I use syslog_ng (my favorite) and I do have debugs (or really all else than the stuff I want seperate) going to /var/log/debug.log_ng and it be clean.. When I watch top, it stalls as well and then when it free's up, I am never sure if I am missing what stalled it. I will look at my network card... that's a thought. The 3550G-12 cisco switch shows no port errors. But as you say, if I am getting a lot of network traffic, that would degrade me, and the spurts of net-flow and outgoing SNMP from MRTG may be in. I will also try blocking that traffic and shutting off MRTG to see if it frees me up) Thanks! Nick Quote:
|
One thing about Linux is that it has terrible memory management. How long has this machine been running? Can you reboot it to see if that imroves the performance? It is possible that your normal work load is simply more than the memory manager can handle. I know that on a workstation if I start Firefox and run some video streaming content and run backblocks from a console then the performance of the machine will eventually degrade. I think it is because the video streaming content puts too much of a strain on the memory manager and this is made visible by the bladblocks utility writing to the disks. In other words running a memory hog (video streaming) plus a real time utility (badblocks) brings out the weaknesses in the job controller and in the memory manager. Your workload may just be more than Linux can handle. After all, Linux is good but it isn't in the same league as Solaris.
Device level troublshooting is fairly simple but very tedious and time consuming. I think that I would start taking hardware devices out of the machine to see if removing one of them fixes the problem. You can start by just unplugging the network cable. See if that helps. If so then replace the network cable. If things are still looking good then throw the original network cable in the trash. And so you go on with all of the machine's hardware. If it is possible to swap one device for another then all the better. If not then just remove the device if possible. You can always boot a live CD when you disconnect the hard disks for example. Even though the technique is simple it is not so simple to find the problem. If device swapping doesn't help then of course you have to move on to looking at software. |
Funny you should mention that, I had just done a rebuild of the entire portage tree (600 apps) and rebooted, it's a weekly routine as this is a work server for just the network operations area. And I noticed the slowdown right away after the reboot as I went straight away to editing a config file for syslog_ng of all things. Trying to seperate out the bash logger to it's own log.
I am going to VPN into work and kill MRTG and Net-Flow for 10 mins, as I normally get stalls every other minute reliably, this ought to tell us something. Will post in a few. Nick Quote:
|
Found a possible problem. Your tip on memory had me try "watch free" while I editing files (that seemed to trip it fastest). My system has 1 Gig, but rarely has over 60 megs free.
I tried turning off a few things, mainly nessusd. went to 109 megs free and it was a long time before I ever saw any sign of a slight stall. Memory is cheap, perhaps just adding a bit more will help. The network card was clean from errors, BTW. I am going to kill a few more unnecessary items, maybe just Apache first while leaving my collectors running. Nick |
Mmmmm - I'm be thinking swap contending with normal I/O. Especially with high wait times. Try running vmstat across a time period where you see a slow-down - it'll give you an idea of I/O load and swap load. See if they correlate.
|
OK, I will look at that when I get into work in the morning. Thanks!
If I do see a correlation, any suggestions? Would more RAM lessen the need for swap? (I am assuming so) Nick |
Short answer, Yes.
Paging (swapping if you will) is just part of a well managed system. When it interferes with the real work, then it's a problem. In a normal environment, minimizing paging is a sensible goal. Easiest way to do that is to provide more (real) memory - on any recent x86 hardware (with PAE), all the way up to 64 Gig unless I'm mistaken. Whether it's actually necessary, and the cost/benefit is part of the fun ... |
Ok, took a real good look at memory this morning.
When I came in: top - 06:42:23 up 3 days, 1:32, 3 users, load average: 3.07, 2.58, 2.69 Tasks: 85 total, 2 running, 83 sleeping, 0 stopped, 0 zombie Cpu(s): 8.0%us, 4.0%sy, 0.0%ni, 41.8%id, 45.5%wa, 0.2%hi, 0.5%si, 0.0%st Mem: 1035208k total, 925428k used, 109780k free, 115548k buffers Swap: 2008116k total, 144k used, 2007972k free, 629988k cached After reboot and all service showing up for 10 mins (two MRTG sweeps, several complete net-flows recorded) top - 07:14:44 up 15 min, 2 users, load average: 0.03, 0.09, 0.15 Tasks: 76 total, 1 running, 75 sleeping, 0 stopped, 0 zombie Cpu(s): 0.2%us, 0.0%sy, 0.0%ni, 98.8%id, 1.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1035208k total, 369316k used, 665892k free, 39976k buffers Swap: 2008116k total, 0k used, 2008116k free, 233584k cached Before I bounced the box, I slowy wiped each app running to see which one would release the most memory. I was down to system essentials and still only 157Megs free. How would I go about tracking down what is eating so much RAM? (I assume this must be a leak?) Nick |
Ok, it's been running for almost 90 mins and I lost another 447M disappeared... I have a cullendar for a memory manager??
Forgive the longer complete TOP posting, but it shows a few of the processes I run. top - 08:27:42 up 1:27, 2 users, load average: 0.07, 0.05, 0.25 Tasks: 76 total, 1 running, 75 sleeping, 0 stopped, 0 zombie Cpu(s): 0.3%us, 0.0%sy, 0.0%ni, 98.5%id, 1.2%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1035208k total, 816904k used, 218304k free, 185380k buffers Swap: 2008116k total, 0k used, 2008116k free, 497544k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8610 root 15 0 6244 4712 480 S 1 0.5 0:23.27 flow-capture 5974 root 15 0 2160 1092 820 R 0 0.1 0:00.34 top 1 root 15 0 1532 520 452 S 0 0.1 0:00.96 init 2 root RT 0 0 0 0 S 0 0.0 0:00.29 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0 4 root RT 0 0 0 0 S 0 0.0 0:00.28 migration/1 5 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/1 6 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/0 7 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/1 8 root 10 -5 0 0 0 S 0 0.0 0:00.00 khelper 9 root 12 -5 0 0 0 S 0 0.0 0:00.00 kthread 60 root 10 -5 0 0 0 S 0 0.0 0:00.22 kblockd/0 61 root 10 -5 0 0 0 S 0 0.0 0:00.23 kblockd/1 62 root 16 -5 0 0 0 S 0 0.0 0:00.00 kacpid 139 root 10 -5 0 0 0 S 0 0.0 0:00.00 kseriod 140 root 16 -5 0 0 0 S 0 0.0 0:00.00 ata/0 141 root 17 -5 0 0 0 S 0 0.0 0:00.00 ata/1 142 root 17 -5 0 0 0 S 0 0.0 0:00.00 ata_aux 143 root 17 -5 0 0 0 S 0 0.0 0:00.00 ksuspend_usbd 146 root 10 -5 0 0 0 S 0 0.0 0:00.00 khubd 159 root 16 -5 0 0 0 S 0 0.0 0:00.00 khpsbpkt 182 root 21 0 0 0 0 S 0 0.0 0:00.00 pdflush 183 root 15 0 0 0 0 S 0 0.0 0:01.29 pdflush 184 root 17 -5 0 0 0 S 0 0.0 0:00.00 kswapd0 185 root 17 -5 0 0 0 S 0 0.0 0:00.00 aio/0 186 root 18 -5 0 0 0 S 0 0.0 0:00.00 aio/1 793 root 15 -5 0 0 0 S 0 0.0 0:00.00 kpsmoused 832 root 10 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_0 833 root 10 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_1 880 root 15 -5 0 0 0 S 0 0.0 0:00.00 reiserfs/0 881 root 10 -5 0 0 0 S 0 0.0 0:00.05 reiserfs/1 1061 root 15 -4 1852 600 344 S 0 0.1 0:00.87 udevd 6567 root 15 0 2024 744 444 S 0 0.1 0:00.04 syslog-ng 6641 named 18 0 14336 11m 1896 S 0 1.2 0:01.69 named 6778 mysql 19 0 138m 26m 3788 S 0 2.6 0:00.17 mysqld 6879 root 15 0 3880 1012 712 S 0 0.1 0:00.00 sshd 6957 root 18 0 17992 6008 3076 S 0 0.6 0:00.16 apache2 6959 apache 20 0 17020 2600 836 S 0 0.3 0:00.00 apache2 7189 apache 18 0 17992 3788 844 S 0 0.4 0:00.00 apache2 7190 apache 19 0 17992 3788 844 S 0 0.4 0:00.00 apache2 7191 apache 19 0 17992 3788 844 S 0 0.4 0:00.00 apache2 7192 apache 19 0 17992 3788 844 S 0 0.4 0:00.00 apache2 7193 apache 19 0 17992 3788 844 S 0 0.4 0:00.00 apache2 8008 messageb 15 0 2092 744 604 S 0 0.1 0:00.00 dbus-daemon 8142 root 15 0 1712 600 500 S 0 0.1 0:00.00 crond 8209 haldaemo 18 0 8824 7304 1608 S 0 0.7 0:00.37 hald 8210 root 18 0 2816 1016 864 S 0 0.1 0:00.00 hald-runner 8216 haldaemo 15 0 1928 788 680 S 0 0.1 0:00.00 hald-addon-acpi 8232 root 18 0 1736 596 524 S 0 0.1 0:00.01 hald-addon-stor 8363 root 19 0 5296 2652 1156 S 0 0.3 0:00.00 nessusd 8475 root 18 0 6252 1756 1336 S 0 0.2 0:00.00 master 8517 postfix 18 0 6288 1748 1340 S 0 0.2 0:00.00 pickup 8518 postfix 15 0 6336 1800 1384 S 0 0.2 0:00.00 qmgr 8547 root 18 0 2188 832 668 S 0 0.1 0:00.01 xinetd 8612 root 15 0 3572 1976 480 S 0 0.2 0:06.21 flow-capture 8614 root 18 0 1668 524 392 S 0 0.1 0:00.00 cdp-send 8618 root 24 0 1644 328 256 S 0 0.0 0:00.00 pamsmbd 8619 root 17 0 3636 1040 576 S 0 0.1 0:00.00 mount.smbfs 8626 root 10 -5 0 0 0 S 0 0.0 0:00.10 smbiod 8639 root 18 0 1568 612 528 S 0 0.1 0:00.00 agetty 8640 root 18 0 1564 608 528 S 0 0.1 0:00.00 agetty 8641 root 18 0 1568 612 528 S 0 0.1 0:00.00 agetty 8642 root 18 0 1568 612 528 S 0 0.1 0:00.00 agetty 8643 root 18 0 1564 608 528 S 0 0.1 0:00.00 agetty 8644 root 18 0 1568 612 528 S 0 0.1 0:00.00 agetty 8656 root 18 0 12948 9944 1356 S 0 1.0 0:00.85 smokeping 8657 root 17 0 6696 2152 1732 S 0 0.2 0:00.01 sshd 8662 e19425 15 0 6836 1452 1000 S 0 0.1 0:04.85 sshd 8663 e19425 16 0 2988 1560 1240 S 0 0.2 0:00.00 bash 8672 root 18 0 2260 1008 776 S 0 0.1 0:00.00 su 8673 root 15 0 2608 1576 1260 S 0 0.2 0:00.03 bash 8694 root 17 0 6700 2144 1732 S 0 0.2 0:00.01 sshd 8699 monitor 15 0 6700 1440 1004 S 0 0.1 0:00.00 sshd 8700 monitor 18 0 2864 1284 1068 S 0 0.1 0:00.00 bash 8708 monitor 15 0 1536 428 356 S 0 0.0 0:00.08 tail 8709 monitor 18 0 3368 1600 1292 S 0 0.2 0:00.03 tacacs-watch |
Ok, I think I found it. I killed every app that I did not NEED and rebooted. 977 megs free. 90 mins later, I lost 2 megs... Whoo hoo, so it's one of my apps. I started one instance of flow-capture. 15 mins later lost over 100 megs. and it drops 100K every 10 secs.
I am going to reboot with all apps back on minus flow-capture. See what I get. |
I have rebooted now with all save flow-capture and the memory drop is faster.. I disabled MRTG and Smokeping (monitoring tools) and the decrease sloowed to a crawl and even went the other way a few times. those three apps are my big network users. Maybe a leak in the nic driver? (Broadcom Corporation NetXtreme BCM5704 Gigabit)
< > Alteon AceNIC/3Com 3C985/NetGear GA620 Gigabit support < > D-Link DL2000-based Gigabit Ethernet support <M> Intel(R) PRO/1000 Gigabit Ethernet support [ ] Use Rx Polling (NAPI) [ ] Disable Packet Split for PCI express adapters < > National Semiconductor DP83820 support < > Packet Engines Hamachi GNIC-II support < > Packet Engines Yellowfin Gigabit-NIC support (EXPERIMENTAL) < > Realtek 8169 gigabit ethernet support < > SiS190/SiS191 gigabit ethernet support < > New SysKonnect GigaEthernet support < > SysKonnect Yukon2 support (EXPERIMENTAL) < > Marvell Yukon Chipset / SysKonnect SK-98xx Support (DEPRECATED) < > VIA Velocity support <*> Broadcom Tigon3 support < > Broadcom NetXtremeII support < > QLogic QLA3XXX Network Driver Support I know I do not have a NetExtreme II, so I grabbed the Tigon.. Perhaps I have the wrong one? lspci 00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 09) 00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 09) 00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 09) 00:06.0 PCI bridge: Intel Corporation E7520 PCI Express Port C (rev 09) 00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge (rev 02) 00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02) 00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02) 00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer (rev 02) 00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt Controller (rev 02) 00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced Host Controller (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a) 00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller (rev 02) 00:1f.1 IDE interface: Intel Corporation 6300ESB PATA Storage Controller (rev 02) 00:1f.2 IDE interface: Intel Corporation 6300ESB SATA Storage Controller (rev 02) 01:03.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) 01:04.0 System peripheral: Compaq Computer Corporation Integrated Lights Out Controller (rev 01) 01:04.2 System peripheral: Compaq Computer Corporation Integrated Lights Out Processor (rev 01) 02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) 02:02.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) 06:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09) 06:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09) |
It sounds like you're not quite sure which program has the memory leak. I'd recommend using 'top' to help find this out.
(1) Start top (2) Hit 'G' then '3' to switch to memory view (that must be an uppercase 'G', not lowercase) (3) Hit 'x' then 'b' to turn on various highlighting (4) You will probably find that the "%MEM" column is highlighted (the highlighted column is your sort column). You want to sort on %MEM and look (over time) for the process that is consuming more and more memory. (5) If %MEM is not your default sort column in step (4), use your '<' and '>' keys to move the highlight to the %MEM column. [edit]Fixed spelling error[/edit] |
Quote:
IMHO you do not have a memory problem - your issue lies elsewhere. Linux attempt to maximize the memory used for efficiency - and after all what's the point of having it all just laying around idle ???. May be a driver issue, might be something else - will take some legwork (like you are doing) to determine. |
All times are GMT -5. The time now is 04:02 PM. |