LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 06-13-2011, 10:21 AM   #1
mk27
Member
 
Registered: Sep 2008
Distribution: fedora, gentoo, ubuntu
Posts: 148

Rep: Reputation: 23
Occasional, unexplainable load average rocketing


I've been a linux user for more than a decade. My load average is almost always low (>0.2). However, lately I have been having a problem with the average suddenly soaring up to 6-8+, which completely locks the system.

This lasts a minute or two, and happens once or twice a day. I cannot find any explanation for it -- in top, there is no evidence of sudden spawning, etc. Ie, this is not due to a real increase in the number of processes.

Which implies to me something strange is going on with the kernel run queue, but that is only a slightly educated guess.

Anyone know how I can track this down?
 
Old 06-13-2011, 11:58 AM   #2
amani
Senior Member
 
Registered: Jul 2006
Location: Kolkata, India
Distribution: Debian 64-bit GNU/Linux, Kubuntu64, Fedora QA, Slackware,
Posts: 2,766

Rep: Reputation: Disabled
see/maintain logs.

selinux is known to cause load increase in Fedora (recent) ...for some applications
 
Old 06-13-2011, 12:27 PM   #3
mk27
Member
 
Registered: Sep 2008
Distribution: fedora, gentoo, ubuntu
Posts: 148

Original Poster
Rep: Reputation: 23
There's nothing in the logs at all. Mebbe I'll up the kernel log level to debug.
 
Old 06-13-2011, 11:41 PM   #4
Valery Reznic
ELF Statifier author
 
Registered: Oct 2007
Posts: 676

Rep: Reputation: 137Reputation: 137
Quote:
Originally Posted by mk27 View Post
There's nothing in the logs at all. Mebbe I'll up the kernel log level to debug.
May be you have something in crontab?
 
Old 06-14-2011, 12:41 AM   #5
ssrameez
Member
 
Registered: Oct 2006
Location: bangalore
Distribution: Fedora, Ubuntu, Debian, Redhat
Posts: 82

Rep: Reputation: 6
I would suggest to put a small script for taking the snapshot of the system, which can run every 2 minutes.
The script can capture.
vmstat
top
ps -eaf
Let this go to some log files with the time stamp.

Analyze the files two or three days, and find out the culprit process.
 
Old 06-14-2011, 05:47 AM   #6
mk27
Member
 
Registered: Sep 2008
Distribution: fedora, gentoo, ubuntu
Posts: 148

Original Poster
Rep: Reputation: 23
Quote:
Originally Posted by ssrameez View Post
I would suggest to put a small script for taking the snapshot of the system, which can run every 2 minutes.
I have had top running before when it's happening, and there are no clues there.

I can't use a script, because that will involve new processes. One of the symptoms is that the active process is frozen for the duration (whatever window I'm working in when it starts). Other windows/existing processes are usable (eg, the browser), but you cannot start another process (eg, a terminal "works", but any command you issue must wait until the event is over). So a single process that can do the necessary monitoring might work, but writing that is not a minor task.

However, changing the configuration of rlogd got me the appropriate kernel output in /var/messages:

Code:
Jun 14 06:28:34 kernel: [ 3479.708078] ata4: lost interrupt (Status 0x50)
Jun 14 06:28:34 kernel: [ 3479.708100] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jun 14 06:28:34 kernel: [ 3479.708108] ata4.00: failed command: WRITE DMA
Jun 14 06:28:34 kernel: [ 3479.708118] ata4.00: cmd ca/00:88:4f:02:ec/00:00:00:00:00/ec tag 0 dma 69632 out
Jun 14 06:28:34 kernel: [ 3479.708120]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 14 06:28:34 kernel: [ 3479.708125] ata4.00: status: { DRDY }
Jun 14 06:28:34 kernel: [ 3479.708139] ata4: soft resetting link
Jun 14 06:28:34 kernel: [ 3479.868631] ata4.00: configured for UDMA/133
Jun 14 06:28:34 kernel: [ 3479.868641] ata4.00: device reported invalid CHS sector 0
Jun 14 06:28:34 mint kernel: [ 3479.868656] ata4: EH complete
This pattern repeats every ~30s for 5 minutes, during which the lock-up is constant. I've had the same event cause boot failures and application crashes lately, it's a hard drive failure. I can't see the kernel itself depending on disk writes, but obviously something bad potentially happens when an active process gets caught in this.

Hopefully it's just some bad blocks...
 
Old 06-14-2011, 07:36 AM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Quote:
Originally Posted by mk27 View Post
I can't see the kernel itself depending on disk writes, but obviously something bad potentially happens when an active process gets caught in this.
Oh yeah ? Where is the kernel loaded from - and where is /var/messages ?.

As for your loadavg problem, that is almost a classic disk error situation. But the loadavg is a symptom, not the problem itself. Get a new disk.
 
Old 06-14-2011, 08:44 AM   #8
mk27
Member
 
Registered: Sep 2008
Distribution: fedora, gentoo, ubuntu
Posts: 148

Original Poster
Rep: Reputation: 23
Quote:
Originally Posted by syg00 View Post
Oh yeah ? Where is the kernel loaded from
A compressed image on disk, but once it is loaded, it's all in RAM.

Quote:
- and where is /var/messages ?.
1) Until I reconfigured rlogd there was no file logging at all during the event.

2) The kernel does not log to disk anyway; it uses printk to the console, this is captured by a userspace tool (rlogd, or whatever). AFAIK the kernel does not do any disk I/O at all except for swap, which my swap is not active, and on behalf of userland, which is not critical to its functioning (userland needs the kernel, the kernel does not need userland).

3) The logging of the error presumes the error has occurred, so logging the error cannot be the cause of the error.

My point is, if this failure happens because of a disk write by a userspace application (which seems to be the case) it should not, IMO, cause craziness with the kernel run queue.

Quote:
As for your loadavg problem, that is almost a classic disk error situation.
Thanks for confirming that. Still curious as to why/how a disk error would lead to a loadavg problem, tho. The fact that it reports exactly 10 times 30 seconds apart implies to me there is some intentional error handling that compensates for the issue in the end.

Last edited by mk27; 06-14-2011 at 08:50 AM.
 
Old 06-14-2011, 08:54 AM   #9
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
It has nothing too do with the run queue - that's Unix thinking, not Linux.
Linux loadavg comprises runable tasks plus those in uninterruptable sleep. Usually (but not exclusively) tasks waiting on disk I/O.
Usually that doesn't matter one iota to other tasks. But if kernel threads get hung up (and they can - even kswapd and the bdi's) then you are history until the outstanding I/O clears. If it doesn't clear, goodbye ...
 
Old 06-14-2011, 09:22 AM   #10
mk27
Member
 
Registered: Sep 2008
Distribution: fedora, gentoo, ubuntu
Posts: 148

Original Poster
Rep: Reputation: 23
Quote:
Originally Posted by syg00 View Post
It has nothing too do with the run queue - that's Unix thinking, not Linux.
Linux loadavg comprises runable tasks plus those in uninterruptable sleep.
Not to get too picky, lol, but "runable tasks" are the run queue (that's what it's called in the linux kernel scheduler), so it has everything (as opposed to "nothing") to do with it.

It's a little hard to believe that sleeping processes contribute anything to the load average, you'll have to give me a source for that because it is oxymoronic (sleeping process do not use the CPU). Sleeping process do have a load weight akin to the "nice" value, which determines their priority if they re-enter the run queue, but load average is about actual (not potential) activity. Load average will affect load weight, but not vice versa.

Maybe you should read:
http://www.linuxjournal.com/article/9001
http://luv.asn.au/overheads/NJG_LUV_2002/luvSlides.html
et. al.

Last edited by mk27; 06-14-2011 at 09:27 AM.
 
Old 06-14-2011, 06:47 PM   #11
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Quote:
Originally Posted by mk27 View Post
It's a little hard to believe that sleeping processes contribute anything to the load average, you'll have to give me a source for that because it is oxymoronic (sleeping process do not use the CPU).
"man proc" and look for loadavg.
The first of your links is reasonably good - I have referred people to it myself. You'll note it also refers to uninterruptible, but only obliquely - and not strictly correctly.
The second link is for Unix, not Linux.
 
Old 06-15-2011, 07:28 AM   #12
mk27
Member
 
Registered: Sep 2008
Distribution: fedora, gentoo, ubuntu
Posts: 148

Original Poster
Rep: Reputation: 23
Quote:
Originally Posted by syg00 View Post
"man proc" and look for loadavg.
By coincidence, I'm working on a process logger, so I've been looking at that page quite a bit. Here's the part you reference, but don't cite...

Quote:
Originally Posted by man proc (for kernel 2.6+)
/proc/loadavg
The first three fields in this file are load average figures giving the number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1, 5, and 15 minutes. They are the same as the load average numbers given by uptime(1) and other programs. The fourth field consists of two numbers separated by a slash (/). The first of these is the number of currently executing kernel scheduling entities (processes, threads); this will be less than or equal to the number of CPUs. The value after the slash is the number of kernel scheduling entities that currently exist on the system. The fifth field is the PID of the process that was most recently created on the system.
Completely unequivocable and unambiguous: the load average is "the number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1, 5, and 15 minutes".

I don't see anything about sleeping processes (state S) here, because, once again, that would make no sense.

[edit: state D is uninterruptable sleep, keep reading if you care ]

Quote:
The first of your links is reasonably good - I have referred people to it myself. You'll note it also refers to uninterruptible, but only obliquely - and not strictly correctly.
No, it does not do so even obliquely. This has nothing to do with sleeping processes. Honestly.

Quote:
The second link is for Unix, not Linux.
It says, quite clearly in the title: Linux Load Average. If you actually read it, you might notice this is derived from the author's input into the development of the CFS scheduler in the linux kernel, which is what manages the run queue.

Last edited by mk27; 06-15-2011 at 10:09 AM.
 
Old 06-15-2011, 08:29 AM   #13
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939
It sounds like some kind of mutex or application-deadlock problem. If commands can be typed at the keyboard, etc, then the basic operating system is working just fine. It could be an XWindows interface issue. In other words, "the system is not hung ... the workloads are waiting on something, and probably timing out." The clues are probably being logged, all right, just not in the logs that you're looking at.

What happens, for example, if you do the ol' Ctrl+Alt+F4 thing to move to an actual terminal-window, bypassing the windowed interface completely?
 
Old 06-15-2011, 08:49 AM   #14
mk27
Member
 
Registered: Sep 2008
Distribution: fedora, gentoo, ubuntu
Posts: 148

Original Poster
Rep: Reputation: 23
Quote:
Originally Posted by sundialsvcs View Post
It sounds like some kind of mutex or application-deadlock problem. If commands can be typed at the keyboard, etc, then the basic operating system is working just fine. It could be an XWindows interface issue. In other words, "the system is not hung ... the workloads are waiting on something, and probably timing out." The clues are probably being logged, all right, just not in the logs that you're looking at.

What happens, for example, if you do the ol' Ctrl+Alt+F4 thing to move to an actual terminal-window, bypassing the windowed interface completely?
Oh I did find it in the logs, qv. post #6.

I'm sure now it is because of the disk error, tho not sure why that has to be the case, or how it resolves itself after a few minutes. I'm also still hoping it is some bad blocks so I don't have to replace the HD (I'm going to run a scan today).

[later: e2fsck -c did fix the problem]

I suppose the original post is mostly solved, but -- for posterity, because everyone including me consults stuff like this via google -- I did not want to leave syg00's authoritative seeming but erroneous info unchallenged. Stuff like that can follow a telephone-game like pattern, whereby a year from now I see it mutate into "the load average is the number of sleeping processes divided by the number of uninterruptable sleeping processes", or something

Last edited by mk27; 06-17-2011 at 02:04 PM.
 
Old 06-15-2011, 09:18 AM   #15
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Your arrogance approaches your ignorance. State "D" is uninterruptible sleep. Period.

See the source for sched.c to educate yourself.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
load average ? mario.almeida Linux - General 5 08-03-2009 08:39 AM
Please Help me in my load average black-code Linux - Server 3 04-08-2009 01:17 PM
X Process: Occasional High CPU Load Fahrion Linux - Software 6 07-16-2005 06:47 PM
load average? ampex189 Linux - Newbie 2 03-06-2005 07:17 PM
Load average 1.0, 1.0, 1.0 ? belated Linux - Newbie 4 11-30-2003 03:49 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 07:41 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration