Server with high load average and no obvious reason.
Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Server with high load average and no obvious reason.
I'm running a DB server (actually a number of DB servers). One of the servers has a load average of 12.00. Yesterday it was a load average of 11. Looking back to July the load average has constanly inched upward for no obvious reason.
I've suggested rebooting but it is a production server so that is not an easy alternative at this juncture.
Anyone else ever see load average go high while CPU's are 98% idle with no other indicators of what might be causing the load?
sar -q shows 1, 2 or 5 processes in the que but reports a load average of 12.00. Crazy.
Any ideas on where to look to solve this one?
So what does cat /proc/loadavg tell you? I'd imagine it's a mistake in the proc file getting properly updated. How long as this system been running without a reboot? And is the /proc/loadavg file getting updated at all by looking at it's latest timestamp?
Try restarting services then you know are safe. The only other option would probably be to reboot to see if the problem comes back. Schedule some downtime since it's a production machine. I've seen this myself, a rather large load average that wasn't accurate, reboot fixed and I never saw it come back.
I've seen this myself, a rather large load average that wasn't accurate, reboot fixed and I never saw it come back.
What makes you think it wasn't accurate ???.
Loadavg (in Linux) is not just the runq - it also includes tasks in uninterruptible sleep. This is usually disk wait, but not necessarily. Poorly designed code will place threads in uniterruptible sleep and "forget" about them.
I use the following to track down anything like this - stick it in a loop in need.
Code:
top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}'
I have seen a bogus load average before and rebooted and it cleared. So, I understand where TK is coming from.
TK what services would you restart?
I can schedule a down time but management thinks of that as sweeping it under the rug. If we have an issue we would like to find it rather than hide it only to have it rear it's head again after the reboot.
Since there is no reason found for the load I'm leaning to the reboot camp, but have agreed to look further and ask folks like you all if you've seen stuff like this. We run 20+ DB servers running an oracle database. These servers are in what oracle calls a RAC environment (much like clustered). So in this particular RAC I have three database servers running exactly the same code but only one of them shows the high load average symptom.
On the contrary, it directly explains the load average. That status of "D" is uninterruptible sleep; loadavg = (runq + uninterruptable).
If you constantly have say 12 "D" tasks, how can the loadavg ever drop below around 13 or 14.
Better check where those IBMDup processes are being generated - must be a hell of a lot of them, look at the PIDs.
Edit: as this illustrates, an unusual loadavg isn't necessarily an indicator of a (performance) problem - at least under Linux.
Sure there's a problem, but it likely isn't directly impacting your ability to service your users. However, if it's a symptom of something else (like a flakey disk say), you'd do well to pay it some attention.
Last edited by syg00; 10-22-2008 at 04:49 PM.
Reason: Musings
Thank you very much Syg00! You are correct in saying that it does not effect system performance over all, but we were concerned and wondering if it was bogus.
Looks as if we do indeed have an issue. I sure appreciate your help!!!!
I have since found out that those processes are part of Dell Open Manage. Duh!!!!!! At first I thought they were part of the Oracle DB we have running on that server.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.