Server with high load average and no obvious reason.

DotHQ · 10-20-2008, 10:35 AM

I'm running a DB server (actually a number of DB servers). One of the servers has a load average of 12.00. Yesterday it was a load average of 11. Looking back to July the load average has constanly inched upward for no obvious reason.

I've suggested rebooting but it is a production server so that is not an easy alternative at this juncture.

Anyone else ever see load average go high while CPU's are 98% idle with no other indicators of what might be causing the load?

sar -q shows 1, 2 or 5 processes in the que but reports a load average of 12.00. Crazy.
Any ideas on where to look to solve this one?

trickykid · 10-20-2008, 12:14 PM

Are you seeing this load with any command you run to view the load averages or with sar only?

DotHQ · 10-20-2008, 01:09 PM

Yes with any command such as Top, sar & Cacti monitor all show high load average and low CPU, memory & disk stats.

trickykid · 10-20-2008, 01:30 PM

So what does cat /proc/loadavg tell you? I'd imagine it's a mistake in the proc file getting properly updated. How long as this system been running without a reboot? And is the /proc/loadavg file getting updated at all by looking at it's latest timestamp?

DotHQ · 10-20-2008, 01:46 PM

Good questions TK!!

Here are the answers:

[root@xxxxxxx sa]# cat /proc/loadavg
12.83 12.56 12.53 4/868 6932
[root@xxxxxxx sa]# ls -l /proc/loadavg
-r--r--r-- 1 root root 0 Oct 20 14:44 /proc/loadavg
[root@xxxxxxx sa]# uptime
14:45:02 up 95 days, 1:14, 2 users, load average: 12.95, 12.61, 12.55
[root@xxxxxxxx sa]#

trickykid · 10-20-2008, 03:20 PM

Try restarting services then you know are safe. The only other option would probably be to reboot to see if the problem comes back. Schedule some downtime since it's a production machine. I've seen this myself, a rather large load average that wasn't accurate, reboot fixed and I never saw it come back.

syg00 · 10-20-2008, 04:41 PM

Quote:

Originally Posted by trickykid

I've seen this myself, a rather large load average that wasn't accurate, reboot fixed and I never saw it come back.

What makes you think it wasn't accurate ???.
Loadavg (in Linux) is not just the runq - it also includes tasks in uninterruptible sleep. This is usually disk wait, but not necessarily. Poorly designed code will place threads in uniterruptible sleep and "forget" about them.
I use the following to track down anything like this - stick it in a loop in need.

Code:

top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}'

DotHQ · 10-21-2008, 05:05 AM

Good input guys, Thanks!

I have seen a bogus load average before and rebooted and it cleared. So, I understand where TK is coming from.

TK what services would you restart?

I can schedule a down time but management thinks of that as sweeping it under the rug. If we have an issue we would like to find it rather than hide it only to have it rear it's head again after the reboot.
Since there is no reason found for the load I'm leaning to the reboot camp, but have agreed to look further and ask folks like you all if you've seen stuff like this. We run 20+ DB servers running an oracle database. These servers are in what oracle calls a RAC environment (much like clustered). So in this particular RAC I have three database servers running exactly the same code but only one of them shows the high load average symptom.

I ran the command you supplied Syg00. Thanks!!!!

Here is the output of that command:

[root@xxxxxxxx sa]# top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}'
top - 05:56:09 up 95 days, 16:25, 2 users, load average: 12.05, 12.16, 12.20
Tasks: 588 total, 2 running, 584 sleeping, 0 stopped, 2 zombie
Cpu(s): 3.9% us, 0.8% sy, 0.0% ni, 92.8% id, 2.0% wa, 0.0% hi, 0.4% si
Mem: 16479084k total, 13289944k used, 3189140k free, 497220k buffers
Swap: 14647288k total, 416k used, 14646872k free, 11171240k cached

PID USER PR NI %CPU TIME+ %MEM VIRT RES SHR S COMMAND
Total status D:

Thoughts?

syg00 · 10-21-2008, 05:15 AM

You've messed with your columns in "top" (or I have ...

).
Change the script to ($11 == "D")

DotHQ · 10-22-2008, 08:00 AM

Okay, I tried it with the correct column plugged in.
Sorry for the delay in response ....I'm in a class ...

Here is the output:

]# top -b -n 1 | awk '{if (NR <=7) print; else if ($11 == "D") {print; count++} } END {print "Total status D: "count}'
top - 08:56:50 up 96 days, 19:25, 3 users, load average: 13.56, 13.87, 13.47
Tasks: 674 total, 4 running, 668 sleeping, 0 stopped, 2 zombie
Cpu(s): 3.9% us, 0.8% sy, 0.0% ni, 92.9% id, 2.0% wa, 0.0% hi, 0.4% si
Mem: 16479084k total, 13892488k used, 2586596k free, 500092k buffers
Swap: 14647288k total, 416k used, 14646872k free, 11190876k cached

PID USER PR NI %CPU TIME+ %MEM VIRT RES SHR S COMMAND
606 root 15 0 0 0:00.03 0.0 0 0 0 D scsi_eh_0
607 root 15 0 0 0:20.91 0.0 0 0 0 D usb-storage
772 root 18 0 0 0:00.00 0.0 2684 1160 1004 D IbmDup
3698 root 16 0 0 6:24.83 0.0 17424 3992 1544 D hald
4025 root 18 0 0 0:00.00 0.0 2684 1160 1004 D IbmDup
9462 root 18 0 0 0:00.00 0.0 2684 1160 1004 D IbmDup
22121 root 18 0 0 0:00.00 0.0 2684 1160 1004 D IbmDup
23496 root 18 0 0 0:00.00 0.0 2684 1160 1004 D IbmDup
25507 root 18 0 0 0:00.00 0.0 2684 1160 1004 D IbmDup
29496 root 18 0 0 0:00.00 0.0 2684 1160 1004 D IbmDup
29859 root 18 0 0 0:00.00 0.0 2684 1160 1004 D IbmDup
31556 root 18 0 0 0:00.00 0.0 2684 1160 1004 D IbmDup
Total status D: 12

I do not believe this explains the high load average ... notice the load average is now up to 13.

syg00 · 10-22-2008, 04:23 PM

On the contrary, it directly explains the load average. That status of "D" is uninterruptible sleep; loadavg = (runq + uninterruptable).
If you constantly have say 12 "D" tasks, how can the loadavg ever drop below around 13 or 14.

Better check where those IBMDup processes are being generated - must be a hell of a lot of them, look at the PIDs.

Edit: as this illustrates, an unusual loadavg isn't necessarily an indicator of a (performance) problem - at least under Linux.
Sure there's a problem, but it likely isn't directly impacting your ability to service your users. However, if it's a symptom of something else (like a flakey disk say), you'd do well to pay it some attention.

DotHQ · 10-24-2008, 08:45 AM

Thank you very much Syg00! You are correct in saying that it does not effect system performance over all, but we were concerned and wondering if it was bogus.
Looks as if we do indeed have an issue. I sure appreciate your help!!!!

DotHQ · 10-24-2008, 09:11 AM

I have since found out that those processes are part of Dell Open Manage. Duh!!!!!! At first I thought they were part of the Oracle DB we have running on that server.

permalac · 03-06-2009, 02:13 AM

Sorry. This post was for another thread.

syg00 · 03-06-2009, 03:03 AM

@permalac, post this in your thread - I referenced this thread, but it makes no sense (in either thread) to post here.