High Load Average, Low CPU, Low IO Wait
Hi,
I need help about the strange output of the top command in my debian server. We see high load average with relatively low cpu usage. Also iowait seems normal. I think this is something about Java socket threading but i don't know how to discover and fix exactly what is causing the issue. We are running a Java socket server on this Debian machine. It's a Dual-Core AMD Opteron(tm) Processor 1210 processor with 8GB RAM. and java -version output: java version "1.5.0_16" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_16-b02) Java HotSpot(TM) Server VM (build 1.5.0_16-b02, mixed mode) and the top output: top - 15:18:06 up 125 days, 2:41, 4 users, load average: 17.94, 15.81, 16.38 Tasks: 554 total, 2 running, 551 sleeping, 0 stopped, 1 zombie Cpu(s): 8.9%us, 2.6%sy, 0.0%ni, 87.1%id, 0.0%wa, 0.7%hi, 0.7%si, 0.0%st Mem: 8315176k total, 6809904k used, 1505272k free, 484384k buffers Swap: 1879596k total, 0k used, 1879596k free, 3920524k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16654 root 16 0 1600m 287m 11m S 9 3.5 9:52.00 java 19461 www-data 15 0 36496 7464 3372 S 1 0.1 0:00.10 apache2 14302 root 16 0 2824 1448 864 S 1 0.0 0:57.34 top 4386 www-data 15 0 36752 6832 2744 S 1 0.1 0:00.19 apache2 16244 www-data 15 0 36752 6832 2744 S 1 0.1 0:00.14 apache2 31830 www-data 20 0 36752 6848 2760 S 1 0.1 0:00.17 apache2 21110 www-data 18 0 36752 6688 2736 S 1 0.1 0:00.05 apache2 21451 www-data 15 0 36516 6488 2736 S 1 0.1 0:00.03 apache2 23991 root 15 0 2696 1420 860 R 1 0.0 0:00.12 top Any help will be appreciated.. thanks in advance |
I don't think there's too much to worry about, there seems to be quite a few idle processes so you could possibly tune things a little. Maybe httpd is configured to spawn lots of children ... ?
cheer |
High bandwith usage alone won't produce great processor power.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ update:it means that just dl/uploading without use of hard drive wont use much of cpu example:router can take a lot of traffic on its slow processor ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
When we shut down the java socket server, load average decreases to 2.0-3.0
so i thought it's about java. also we have another server (higher traffic) with same apache configuration and it seems fine. |
yooy, so it is about network latency? is there a way to measure it?
|
yooy, what the hell is that supposed to mean ?.
OP, try this and post the output Code:
top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}' |
Quote:
Code:
top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}' |
O.K., I'm confused - that should list all uninteruptible sleep tasks (which contribute to loadavg). I had a look at how loadavg is accumulated a while back - seemed straightforward. What kernel are you on ("uname -a") ?.
|
Kernel version: Linux 2.6.18-5-686-bigmem #1 SMP Sat Dec 1 23:58:00 UTC 2007 i686 GNU/Linux
|
Keep trying that - in a loop maybe, redirecting to a file. You might have a lot of short-lived processes. Dunno at this point.
|
I tried few times and once catch something like this:
Code:
top - 16:31:07 up 125 days, 3:54, 4 users, load average: 27.03, 17.73, 18.92 |
That's one of the major problems with a tool like top - no history. You see what you see and that's it. PLUS you only see what top wants you to see.
You could always try out collect, either interactively or as a daemon. Dy default in daemon mode it samples everything but processes every 10 seconds and processes every 60 - extra overhead. BUT if you really want to see what's happening over a relatively short period of time edit /etc/collectl.conf and add "-i1:1" to the line 'DaemonCommands' and that will monitor everything once a second. "service collectl start" and let it run for a few minutes and then "service collectl stop". now play back the data it collected - too many options to list - but if you run: collectl -p /var/log/collectl/filename -sxxx -oT you'll see data for the subsystems specified with 'xxx' along with time stamps. 'c' will show CPU, 'd' disk, etc. "collectl --showsbsys" for a complete listing. if you want to look at your top processes over time, which is what got me started, you can: collectl -p filename --top and you'll see the top 10 processes for every second!!! if you want to see more or less of them "collectl -x" and see the options for --top. if you "collectl -p filename -sc --verbose -oT" you'll see the load averages along with the number of running processes AND the number of process creations/sec if that is a concern. for more, just go to soureforge and look at the documentation http://collectl.sourceforge.net have fun... -mark |
Who is this guy ???.
Seems to want to push collectl pretty hard. . . . . Hey Mark, back again ... ;) I too like his little toy - unfortunately not everyone seems to want to use it. |
hey back - yes I know you're a fan. I've seen previous posts by you recommending it. I do realize not everyone is on board with it but I also realize not everybody is convinced monitoring is important. I was talking to someone the other day who was a sar user. Nothing wrong with sar, just that people use a monitoring interval that's much too high. I suggested if that they at least drop the monitoring frequency down to 10 seconds as 10 minutes is pretty worthless. They said their vendor told them not to go below a minute and I told them their vendor is wrong! If collectl generates less that 0.1% cpu load running at 10 second monitoring and it's written in perl, SAR had got to have a lighter footprint. But some people just don't get it. ;)
I would wonder why people don't use collectl: - they don't believe in proactive monitoring - they're happy with what the have - they're scared of it If the first, they're flat out wrong. If the second, that's fine as long as they monitor frequently. If the third I can help if they ask. I believe EVERYONE should continuously monitor their systems at 5-15 second frequencies. There are a very few situations where I've seen monitoring have on impact on performance - applications that run at 100% cpuloads and are fine-grained parallel jobs running on 1000 cores or more. If you don't know what a find-grained parallel job is, you don't have to worry about collectl! First of all not many people run parallel jobs, let along fine-grained ones, and even less run on 1K cores or more. Even those who do run on that many cores still find a slight performance hit is worth it be able to have the data available if something goes wrong. -mark |
All times are GMT -5. The time now is 02:13 AM. |