Very high CPU load, but nothing significant in top

rjbathgate · 06-03-2014, 08:12 PM

I'm running Ubuntu Linux 12.04.1, with VirtualMin 4.08.gpl GPL and 2 CPU cores.

Pretty much all the time for the last few weeks, it's been running at well above load average of 5, usually up closer to 10, sometimes reaching 20.

Right now, CPU load averages: 9.20 (1 min) 8.20 (5 mins) 7.81 (15 mins)

At the same time, VirtualMin returns:

Virtual Memory: 996 MB total, 15.44 MB used Real Memory: 3.80 GB total, 972.43 MB used Local disk space: 915.94 GB total, 116.03 GB used

Have restarted (shutdown -rf now) the machine a few times and sure enough sooner or later we're back up with high CPU loads.

Running top (or htop) returns nothing significant at all running at high CPU - in fact watching it for a few minutes and the highest item would maybe high 3% CPU.

Top returns this also:

Cpu(s): 2.2%us, 1.2%sy, 0.0%ni, 0.0%id, 96.5%wa, 0.0%hi, 0.2%si, 0.0%st

The %wa concerns me as it's so high - seems to stay up above 80%.

I understand this is % in wait, but not sure what that means in practical terms.

Where can I start to debug this and figure out what's causing the high CPU load?

Thanks in advance

chrism01 · 06-04-2014, 06:39 AM

The load avg tells you about the jobs in a runnable state, not whether they are cpu bound (a different qn).
A high %wa means waiting; probably for disk and/or DB access eg long running SQL queries are typical.
Check top cmd and look for processes in 'S' or (worse) 'D' state

http://slack-linux.blogspot.com.au/2...ate-codes.html
http://blog.scoutapp.com/articles/20...-load-averages
https://prutser.wordpress.com/2012/0...verage-part-1/

HTH

syg00 · 06-04-2014, 07:21 AM

As Chris says, loadavg != CPU%.

However sleeping tasks are of no interest either, just "D". Run this for an idea of what is contributing to both the %wa and loadavg

Code:

top -b -n 1 | awk '{if (NR <=7) print; else if ($8 ~ /[RD]/) {print; count++} } END {print "Total: "count}'

rjbathgate · 06-04-2014, 04:08 PM

Thanks for replies.

top with that suggested command returns:

top - 09:06:33 up 6 days, 19:55, 4 users, load average: 20.79, 17.90, 13.76
Tasks: 232 total, 1 running, 208 sleeping, 23 stopped, 0 zombie
Cpu(s): 4.4%us, 9.3%sy, 1.3%ni, 10.8%id, 73.9%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 3983680k total, 1878180k used, 2105500k free, 378640k buffers
Swap: 1019900k total, 21000k used, 998900k free, 594768k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
235 root 20 0 0 0 0 D 0 0.0 43:59.91 flush-8:0
12488 root 20 0 4312 968 668 D 0 0.0 0:02.70 updatedb.mlocat
21169 root 20 0 65208 58m 1948 D 0 1.5 0:15.33 /usr/share/webm
27808 munin 20 0 22268 9892 1640 D 0 0.2 0:00.14 /usr/share/muni
28859 root 20 0 4536 1008 716 D 0 0.0 0:00.13 chown
28904 root 20 0 4472 764 656 D 0 0.0 0:00.03 chown
28905 root 20 0 4472 760 656 D 0 0.0 0:00.03 chown
29099 root 20 0 4472 764 656 D 0 0.0 0:00.01 chown
29103 root 20 0 4472 764 656 D 0 0.0 0:00.00 chown
29107 root 20 0 4472 760 656 D 0 0.0 0:00.03 chown
29110 root 20 0 2848 1196 864 R 0 0.0 0:00.00 top
29162 root 20 0 4472 764 656 D 0 0.0 0:00.00 chown
29165 root 20 0 4472 764 656 D 0 0.0 0:00.00 chown
29166 root 20 0 4472 760 656 D 0 0.0 0:00.00 chown
29168 root 20 0 4472 764 656 D 0 0.0 0:00.00 chown
29172 root 20 0 4472 764 656 D 0 0.0 0:00.00 chown
29173 root 20 0 4472 760 656 D 0 0.0 0:00.00 chown
29175 root 20 0 4472 760 656 D 0 0.0 0:00.00 chown
29176 root 20 0 4472 760 656 D 0 0.0 0:00.00 chown
29178 root 20 0 4472 760 656 D 0 0.0 0:00.00 chown
Total: 20

The first line, flush-8:0 seems a bit dubious, with a TIME+ of 44 hours... Not sure what this is or what to do about it though...

Also...
itop returns:

INT NAME RATE MAX
42 [MSI-edge ahci] 107 Ints/s (max: 416)
43 [MSI-edge eth0] 11 Ints/s (max: 93)

That's it...

Rate fluctuates between 40ish and 160ish for INIT 42, and 3 and 25 for INIT 34

No idea what this means sorry!

Thanks

Habitual · 06-04-2014, 04:39 PM

Quote:

Originally Posted by rjbathgate

The first line, flush-8:0 seems a bit dubious, with a TIME+ of 44 hours... Not sure what this is or what to do about it though...

Code:

lsof -p 28859 | less

and have a look-see.

rjbathgate · 06-04-2014, 04:47 PM

lsof -p 28859 | less

returns nothing...

lsof -p 235 | less (235 = flush-8 process id) returns:

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
flush-8:0 235 root cwd DIR 8,1 4096 2 /
flush-8:0 235 root rtd DIR 8,1 4096 2 /
flush-8:0 235 root txt unknown /proc/235/exe

Thanks

syg00 · 06-04-2014, 07:40 PM

Is this a virtual instance ?. What kernel level are you running ?.
Have a look at your primary "disk" - probably /dev/sda - with sar or similar. The flush (kernel) tasks are just that, they flush pending I/O - they are started as needed hence the PID changing. You disk isn't responding by the looks of it.

rjbathgate · 06-04-2014, 07:48 PM

Hey

Sorry forgive my ignorance, I'm a bit lost here...

Is this a virtual instance ?.
Hmmz, it's a physical machine, running VirtualMin for a heap of VirtualHosts.

What kernel level are you running ?.
This help...? Kernel and CPU Linux 3.2.0-63-generic-pae on i686

Have a look at your primary "disk" - probably /dev/sda - with sar or similar.
What do I need to look at?

The flush (kernel) tasks are just that, they flush pending I/O - they are started as needed hence the PID changing. You disk isn't responding by the looks of it.
Disk is responding ok, we can (and do) access it all the time as we have PCs mapping the home directory as network drives, as we use it for a development server - i.e. we work directly on the files on the server / HDD. Sometimes it hangs a bit when accessing files, hence me starting to look into the high load issues.

Thanks

syg00 · 06-04-2014, 09:40 PM

Quote:

Originally Posted by rjbathgate

You disk isn't responding by the looks of it.
Disk is responding ok, we can (and do) access it all the time as we have PCs mapping the home directory as network drives, as we use it for a development server - i.e. we work directly on the files on the server / HDD. Sometimes it hangs a bit when accessing files, hence me starting to look into the high load issues.

Sorry, poorly worded by me. I meant the disk isn't reponding appropriately (in computer metrics, not human), not that it isn't responding at all.
The sysstat package has iostat as a component - look at the manpage(s) for help, but you want to know the avg read/write rates and response times for each. There are other more finely sampled tools available - collectl for instance. The mere mention of it will likely prod the author to appear with helpful hints. Always good to get knowledgable input.

Some thoughts (without a lot of hard data to back them up):
- all those status "D" tasks are probably waiting on disk I/O - and count directly to loadavg, as well as %wa.
- it looks like you only have one (active) physical disk. That's a bottleneck - spread your I/O load over more disks.
- check SMART data for the disk to ensure it isn't starting to fail. As well as software like sar/collectl/whatever.
- don't run updatedb when anything else is hitting the disk if possible. 02:00 is usually ok for non-worldwide access.
- 32-bit PAE kernels are so last century. Get onto 64-bit hardware (you may be already) and current 64-bit kernel if possible.

basically from here it's a matter of checking all the data.

rjbathgate · 06-04-2014, 11:16 PM

I ran the short test on SMART tools and it seems to get stuck with 10% remaining.

Whilst it doesn't report progress, it indicates 2 minute run time, after 10 minutes, it doesn't report any results.

Then I made it to run the short test again, and then the original test appears in the log as 'aborted' (presumably because I started a new one), aborted with 10% remaining.

Have done this three times, and all seem to hang at 10% remaining:

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Aborted by host 10% 7259 -
# 2 Short offline Aborted by host 10% 7259 -
# 3 Short offline Aborted by host 10% 7259 -

Is this a bad sign?!

I could run a long test overnight...

And I'm currently shopping to potentially replace it with HP ProLiant MicroServer Gen8 as a result of all this...

Thanks

rjbathgate · 06-04-2014, 11:22 PM

Also, re: "32-bit PAE kernels are so last century. Get onto 64-bit hardware (you may be already) and current 64-bit kernel if possible."

The CPU is 64bit compatible... how do I go about changing to 64 bit kernel? Or at least ensuring I get a new server running on the 64 bit kernel?

EDIT: sorry that's a dumb question, have figured that one out!

markseger · 06-06-2014, 03:04 PM

I think the CPU iowait or just wa in top terms is one of the most confusing metrics there is. In sort, all it tells you is there is some I/O going on somewhere and the cpu isn't busy, it's spending most of it's idle time waiting for I/O.

Another way to look at this is on a completely idle system, iowait should be at or close to zero. Now fire up a process that creates or maybe copies a large file while watching it with collectl, had to get that in for syg00.

Since this is almost exclusively I/O bound you know it won't use much cpu time, yet iowait goes to a very high number, at least on the cpu doing the I/O.

If you were to look at a busy nfs server, it typically has a high load average because some many processes are active, though waiting on I/O, and also shows a high iowait.

Does this help or make it more confusing?

-mark