High Load Average, Low CPU, Low IO Wait

ordaolmayanadam · 07-14-2010, 07:34 AM

Hi,
I need help about the strange output of the top command
in my debian server. We see high load average with relatively
low cpu usage. Also iowait seems normal.
I think this is something about Java socket threading but
i don't know how to discover and fix exactly what is causing
the issue.

We are running a Java socket server on this Debian machine.
It's a Dual-Core AMD Opteron(tm) Processor 1210 processor
with 8GB RAM.

and java -version output:
java version "1.5.0_16"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_16-b02)
Java HotSpot(TM) Server VM (build 1.5.0_16-b02, mixed mode)

and the top output:
top - 15:18:06 up 125 days, 2:41, 4 users, load average: 17.94, 15.81, 16.38
Tasks: 554 total, 2 running, 551 sleeping, 0 stopped, 1 zombie
Cpu(s): 8.9%us, 2.6%sy, 0.0%ni, 87.1%id, 0.0%wa, 0.7%hi, 0.7%si, 0.0%st
Mem: 8315176k total, 6809904k used, 1505272k free, 484384k buffers
Swap: 1879596k total, 0k used, 1879596k free, 3920524k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16654 root 16 0 1600m 287m 11m S 9 3.5 9:52.00 java
19461 www-data 15 0 36496 7464 3372 S 1 0.1 0:00.10 apache2
14302 root 16 0 2824 1448 864 S 1 0.0 0:57.34 top
4386 www-data 15 0 36752 6832 2744 S 1 0.1 0:00.19 apache2
16244 www-data 15 0 36752 6832 2744 S 1 0.1 0:00.14 apache2
31830 www-data 20 0 36752 6848 2760 S 1 0.1 0:00.17 apache2
21110 www-data 18 0 36752 6688 2736 S 1 0.1 0:00.05 apache2
21451 www-data 15 0 36516 6488 2736 S 1 0.1 0:00.03 apache2
23991 root 15 0 2696 1420 860 R 1 0.0 0:00.12 top

Any help will be appreciated.. thanks in advance

kbp · 07-14-2010, 07:51 AM

I don't think there's too much to worry about, there seems to be quite a few idle processes so you could possibly tune things a little. Maybe httpd is configured to spawn lots of children ... ?

cheer

yooy · 07-14-2010, 07:52 AM

High bandwith usage alone won't produce great processor power.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
update:it means that just dl/uploading without use of hard drive wont use much of cpu
example:router can take a lot of traffic on its slow processor

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ordaolmayanadam · 07-14-2010, 07:55 AM

When we shut down the java socket server, load average decreases to 2.0-3.0
so i thought it's about java. also we have another server (higher traffic)
with same apache configuration and it seems fine.

ordaolmayanadam · 07-14-2010, 07:56 AM

yooy, so it is about network latency? is there a way to measure it?

syg00 · 07-14-2010, 07:58 AM

yooy, what the hell is that supposed to mean ?.
OP, try this and post the output

Code:

top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}'

Use code tags when posting output ...

ordaolmayanadam · 07-14-2010, 08:02 AM

Quote:

Originally Posted by syg00

yooy, what the hell is that supposed to mean ?.
OP, try this and post the output

Code:

top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}'

Use code tags when posting output ...

This is the output:

Code:

top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}'
top - 16:00:38 up 125 days,  3:23,  4 users,  load average: 20.40, 27.84, 26.75
Tasks: 328 total,   3 running, 325 sleeping,   0 stopped,   0 zombie
Cpu(s): 11.0%us,  0.6%sy,  0.0%ni, 88.1%id,  0.0%wa,  0.2%hi,  0.2%si,  0.0%st
Mem:   8315176k total,  6428408k used,  1886768k free,   484464k buffers
Swap:  1879596k total,        0k used,  1879596k free,  3980404k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
Total status D:

syg00 · 07-14-2010, 08:10 AM

O.K., I'm confused - that should list all uninteruptible sleep tasks (which contribute to loadavg). I had a look at how loadavg is accumulated a while back - seemed straightforward. What kernel are you on ("uname -a") ?.

ordaolmayanadam · 07-14-2010, 08:14 AM

Kernel version: Linux 2.6.18-5-686-bigmem #1 SMP Sat Dec 1 23:58:00 UTC 2007 i686 GNU/Linux

syg00 · 07-14-2010, 08:18 AM

Keep trying that - in a loop maybe, redirecting to a file. You might have a lot of short-lived processes. Dunno at this point.

ordaolmayanadam · 07-14-2010, 08:33 AM

I tried few times and once catch something like this:

Code:

top - 16:31:07 up 125 days,  3:54,  4 users,  load average: 27.03, 17.73, 18.92
Tasks: 328 total,   1 running, 327 sleeping,   0 stopped,   0 zombie
Cpu(s): 11.0%us,  0.6%sy,  0.0%ni, 88.1%id,  0.0%wa,  0.2%hi,  0.2%si,  0.0%st
Mem:   8315176k total,  6171800k used,  2143376k free,   484916k buffers
Swap:  1879596k total,        0k used,  1879596k free,  3721952k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
31763 www-data  15   0 36736 6836 2756 D    2  0.1   0:00.10 apache2
 1105 root      10  -5     0    0    0 D    0  0.0   2:39.10 kjournald
20325 www-data  15   0 36516 6592 2760 D    0  0.1   0:00.10 apache2
Total status D: 3

markseger · 07-15-2010, 06:06 AM

That's one of the major problems with a tool like top - no history. You see what you see and that's it. PLUS you only see what top wants you to see.

You could always try out collect, either interactively or as a daemon. Dy default in daemon mode it samples everything but processes every 10 seconds and processes every 60 - extra overhead.

BUT if you really want to see what's happening over a relatively short period of time edit /etc/collectl.conf and add "-i1:1" to the line 'DaemonCommands' and that will monitor everything once a second.

"service collectl start" and let it run for a few minutes and then "service collectl stop". now play back the data it collected - too many options to list - but if you run:

collectl -p /var/log/collectl/filename -sxxx -oT

you'll see data for the subsystems specified with 'xxx' along with time stamps. 'c' will show CPU, 'd' disk, etc. "collectl --showsbsys" for a complete listing.

if you want to look at your top processes over time, which is what got me started, you can:

collectl -p filename --top

and you'll see the top 10 processes for every second!!! if you want to see more or less of them "collectl -x" and see the options for --top.

if you "collectl -p filename -sc --verbose -oT" you'll see the load averages along with the number of running processes AND the number of process creations/sec if that is a concern.

for more, just go to soureforge and look at the documentation http://collectl.sourceforge.net

have fun...

-mark

syg00 · 07-15-2010, 06:20 AM

Who is this guy ???.
Seems to want to push collectl pretty hard.
.
.
.
.
Hey Mark, back again ...

I too like his little toy - unfortunately not everyone seems to want to use it.

markseger · 07-15-2010, 06:55 AM

hey back - yes I know you're a fan. I've seen previous posts by you recommending it. I do realize not everyone is on board with it but I also realize not everybody is convinced monitoring is important. I was talking to someone the other day who was a sar user. Nothing wrong with sar, just that people use a monitoring interval that's much too high. I suggested if that they at least drop the monitoring frequency down to 10 seconds as 10 minutes is pretty worthless. They said their vendor told them not to go below a minute and I told them their vendor is wrong! If collectl generates less that 0.1% cpu load running at 10 second monitoring and it's written in perl, SAR had got to have a lighter footprint. But some people just don't get it.

I would wonder why people don't use collectl:
- they don't believe in proactive monitoring
- they're happy with what the have
- they're scared of it

If the first, they're flat out wrong. If the second, that's fine as long as they monitor frequently. If the third I can help if they ask.

I believe EVERYONE should continuously monitor their systems at 5-15 second frequencies. There are a very few situations where I've seen monitoring have on impact on performance - applications that run at 100% cpuloads and are fine-grained parallel jobs running on 1000 cores or more. If you don't know what a find-grained parallel job is, you don't have to worry about collectl! First of all not many people run parallel jobs, let along fine-grained ones, and even less run on 1K cores or more. Even those who do run on that many cores still find a slight performance hit is worth it be able to have the data available if something goes wrong.

-mark