LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 06-03-2014, 08:12 PM   #1
rjbathgate
LQ Newbie
 
Registered: Jun 2014
Posts: 6

Rep: Reputation: Disabled
Very high CPU load, but nothing significant in top


I'm running Ubuntu Linux 12.04.1, with VirtualMin 4.08.gpl GPL and 2 CPU cores.

Pretty much all the time for the last few weeks, it's been running at well above load average of 5, usually up closer to 10, sometimes reaching 20.

Right now, CPU load averages: 9.20 (1 min) 8.20 (5 mins) 7.81 (15 mins)

At the same time, VirtualMin returns:

Virtual Memory: 996 MB total, 15.44 MB used Real Memory: 3.80 GB total, 972.43 MB used Local disk space: 915.94 GB total, 116.03 GB used

Have restarted (shutdown -rf now) the machine a few times and sure enough sooner or later we're back up with high CPU loads.

Running top (or htop) returns nothing significant at all running at high CPU - in fact watching it for a few minutes and the highest item would maybe high 3% CPU.

Top returns this also:

Cpu(s): 2.2%us, 1.2%sy, 0.0%ni, 0.0%id, 96.5%wa, 0.0%hi, 0.2%si, 0.0%st

The %wa concerns me as it's so high - seems to stay up above 80%.

I understand this is % in wait, but not sure what that means in practical terms.

Where can I start to debug this and figure out what's causing the high CPU load?

Thanks in advance
 
Old 06-04-2014, 06:39 AM   #2
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 7.7 (?), Centos 8.1
Posts: 17,848

Rep: Reputation: 2584Reputation: 2584Reputation: 2584Reputation: 2584Reputation: 2584Reputation: 2584Reputation: 2584Reputation: 2584Reputation: 2584Reputation: 2584Reputation: 2584
The load avg tells you about the jobs in a runnable state, not whether they are cpu bound (a different qn).
A high %wa means waiting; probably for disk and/or DB access eg long running SQL queries are typical.
Check top cmd and look for processes in 'S' or (worse) 'D' state

http://slack-linux.blogspot.com.au/2...ate-codes.html
http://blog.scoutapp.com/articles/20...-load-averages
https://prutser.wordpress.com/2012/0...verage-part-1/

HTH
 
Old 06-04-2014, 07:21 AM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 19,587

Rep: Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507
As Chris says, loadavg != CPU%.

However sleeping tasks are of no interest either, just "D". Run this for an idea of what is contributing to both the %wa and loadavg
Code:
top -b -n 1 | awk '{if (NR <=7) print; else if ($8 ~ /[RD]/) {print; count++} } END {print "Total: "count}'
 
Old 06-04-2014, 04:08 PM   #4
rjbathgate
LQ Newbie
 
Registered: Jun 2014
Posts: 6

Original Poster
Rep: Reputation: Disabled
Thanks for replies.

top with that suggested command returns:

top - 09:06:33 up 6 days, 19:55, 4 users, load average: 20.79, 17.90, 13.76
Tasks: 232 total, 1 running, 208 sleeping, 23 stopped, 0 zombie
Cpu(s): 4.4%us, 9.3%sy, 1.3%ni, 10.8%id, 73.9%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 3983680k total, 1878180k used, 2105500k free, 378640k buffers
Swap: 1019900k total, 21000k used, 998900k free, 594768k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
235 root 20 0 0 0 0 D 0 0.0 43:59.91 flush-8:0
12488 root 20 0 4312 968 668 D 0 0.0 0:02.70 updatedb.mlocat
21169 root 20 0 65208 58m 1948 D 0 1.5 0:15.33 /usr/share/webm
27808 munin 20 0 22268 9892 1640 D 0 0.2 0:00.14 /usr/share/muni
28859 root 20 0 4536 1008 716 D 0 0.0 0:00.13 chown
28904 root 20 0 4472 764 656 D 0 0.0 0:00.03 chown
28905 root 20 0 4472 760 656 D 0 0.0 0:00.03 chown
29099 root 20 0 4472 764 656 D 0 0.0 0:00.01 chown
29103 root 20 0 4472 764 656 D 0 0.0 0:00.00 chown
29107 root 20 0 4472 760 656 D 0 0.0 0:00.03 chown
29110 root 20 0 2848 1196 864 R 0 0.0 0:00.00 top
29162 root 20 0 4472 764 656 D 0 0.0 0:00.00 chown
29165 root 20 0 4472 764 656 D 0 0.0 0:00.00 chown
29166 root 20 0 4472 760 656 D 0 0.0 0:00.00 chown
29168 root 20 0 4472 764 656 D 0 0.0 0:00.00 chown
29172 root 20 0 4472 764 656 D 0 0.0 0:00.00 chown
29173 root 20 0 4472 760 656 D 0 0.0 0:00.00 chown
29175 root 20 0 4472 760 656 D 0 0.0 0:00.00 chown
29176 root 20 0 4472 760 656 D 0 0.0 0:00.00 chown
29178 root 20 0 4472 760 656 D 0 0.0 0:00.00 chown
Total: 20

The first line, flush-8:0 seems a bit dubious, with a TIME+ of 44 hours... Not sure what this is or what to do about it though...

Also...
itop returns:

INT NAME RATE MAX
42 [MSI-edge ahci] 107 Ints/s (max: 416)
43 [MSI-edge eth0] 11 Ints/s (max: 93)

That's it...

Rate fluctuates between 40ish and 160ish for INIT 42, and 3 and 25 for INIT 34

No idea what this means sorry!

Thanks
 
Old 06-04-2014, 04:39 PM   #5
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Quote:
Originally Posted by rjbathgate View Post
The first line, flush-8:0 seems a bit dubious, with a TIME+ of 44 hours... Not sure what this is or what to do about it though...
Code:
lsof -p 28859 | less
and have a look-see.
 
Old 06-04-2014, 04:47 PM   #6
rjbathgate
LQ Newbie
 
Registered: Jun 2014
Posts: 6

Original Poster
Rep: Reputation: Disabled
lsof -p 28859 | less

returns nothing...

lsof -p 235 | less (235 = flush-8 process id) returns:

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
flush-8:0 235 root cwd DIR 8,1 4096 2 /
flush-8:0 235 root rtd DIR 8,1 4096 2 /
flush-8:0 235 root txt unknown /proc/235/exe


Thanks
 
Old 06-04-2014, 07:40 PM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 19,587

Rep: Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507
Is this a virtual instance ?. What kernel level are you running ?.
Have a look at your primary "disk" - probably /dev/sda - with sar or similar. The flush (kernel) tasks are just that, they flush pending I/O - they are started as needed hence the PID changing. You disk isn't responding by the looks of it.
 
Old 06-04-2014, 07:48 PM   #8
rjbathgate
LQ Newbie
 
Registered: Jun 2014
Posts: 6

Original Poster
Rep: Reputation: Disabled
Hey

Sorry forgive my ignorance, I'm a bit lost here...

Is this a virtual instance ?.
Hmmz, it's a physical machine, running VirtualMin for a heap of VirtualHosts.

What kernel level are you running ?.
This help...? Kernel and CPU Linux 3.2.0-63-generic-pae on i686

Have a look at your primary "disk" - probably /dev/sda - with sar or similar.
What do I need to look at?

The flush (kernel) tasks are just that, they flush pending I/O - they are started as needed hence the PID changing. You disk isn't responding by the looks of it.
Disk is responding ok, we can (and do) access it all the time as we have PCs mapping the home directory as network drives, as we use it for a development server - i.e. we work directly on the files on the server / HDD. Sometimes it hangs a bit when accessing files, hence me starting to look into the high load issues.

Thanks
 
Old 06-04-2014, 09:40 PM   #9
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 19,587

Rep: Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507Reputation: 3507
Quote:
Originally Posted by rjbathgate View Post
You disk isn't responding by the looks of it.
Disk is responding ok, we can (and do) access it all the time as we have PCs mapping the home directory as network drives, as we use it for a development server - i.e. we work directly on the files on the server / HDD. Sometimes it hangs a bit when accessing files, hence me starting to look into the high load issues.
Sorry, poorly worded by me. I meant the disk isn't reponding appropriately (in computer metrics, not human), not that it isn't responding at all.
The sysstat package has iostat as a component - look at the manpage(s) for help, but you want to know the avg read/write rates and response times for each. There are other more finely sampled tools available - collectl for instance. The mere mention of it will likely prod the author to appear with helpful hints. Always good to get knowledgable input.

Some thoughts (without a lot of hard data to back them up):
- all those status "D" tasks are probably waiting on disk I/O - and count directly to loadavg, as well as %wa.
- it looks like you only have one (active) physical disk. That's a bottleneck - spread your I/O load over more disks.
- check SMART data for the disk to ensure it isn't starting to fail. As well as software like sar/collectl/whatever.
- don't run updatedb when anything else is hitting the disk if possible. 02:00 is usually ok for non-worldwide access.
- 32-bit PAE kernels are so last century. Get onto 64-bit hardware (you may be already) and current 64-bit kernel if possible.

basically from here it's a matter of checking all the data.
 
Old 06-04-2014, 11:16 PM   #10
rjbathgate
LQ Newbie
 
Registered: Jun 2014
Posts: 6

Original Poster
Rep: Reputation: Disabled
I ran the short test on SMART tools and it seems to get stuck with 10% remaining.

Whilst it doesn't report progress, it indicates 2 minute run time, after 10 minutes, it doesn't report any results.

Then I made it to run the short test again, and then the original test appears in the log as 'aborted' (presumably because I started a new one), aborted with 10% remaining.

Have done this three times, and all seem to hang at 10% remaining:


Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Aborted by host 10% 7259 -
# 2 Short offline Aborted by host 10% 7259 -
# 3 Short offline Aborted by host 10% 7259 -

Is this a bad sign?!

I could run a long test overnight...

And I'm currently shopping to potentially replace it with HP ProLiant MicroServer Gen8 as a result of all this...

Thanks
 
Old 06-04-2014, 11:22 PM   #11
rjbathgate
LQ Newbie
 
Registered: Jun 2014
Posts: 6

Original Poster
Rep: Reputation: Disabled
Also, re: "32-bit PAE kernels are so last century. Get onto 64-bit hardware (you may be already) and current 64-bit kernel if possible."

The CPU is 64bit compatible... how do I go about changing to 64 bit kernel? Or at least ensuring I get a new server running on the 64 bit kernel?

EDIT: sorry that's a dumb question, have figured that one out!

Last edited by rjbathgate; 06-04-2014 at 11:26 PM.
 
Old 06-06-2014, 03:04 PM   #12
markseger
Member
 
Registered: Jul 2003
Posts: 244

Rep: Reputation: 26
I think the CPU iowait or just wa in top terms is one of the most confusing metrics there is. In sort, all it tells you is there is some I/O going on somewhere and the cpu isn't busy, it's spending most of it's idle time waiting for I/O.

Another way to look at this is on a completely idle system, iowait should be at or close to zero. Now fire up a process that creates or maybe copies a large file while watching it with collectl, had to get that in for syg00. Since this is almost exclusively I/O bound you know it won't use much cpu time, yet iowait goes to a very high number, at least on the cpu doing the I/O.

If you were to look at a busy nfs server, it typically has a high load average because some many processes are active, though waiting on I/O, and also shows a high iowait.

Does this help or make it more confusing?

-mark
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] High CPU load, but low CPU usage (high idle CPU) baffy Linux - Newbie 5 03-13-2013 09:24 AM
High Load Averages on my Forum- whostmgr2 - top ./top ohlookaforum Linux - Server 6 08-09-2010 05:26 PM
Load is very high but CPU usage is almost zero in top! mam2 Linux - Server 3 12-18-2009 03:53 PM
CPU load high, top processes very low? Thinking Linux - Software 12 03-19-2007 12:59 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 08:16 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration