High load and high cpu kernel usage

enid · 09-17-2010, 06:00 AM

Hello to all,

On one Debian GNU/Linux 4.0 server , running different servers like dns/bind, sendmail, apache etc, I'm having high load and with top command there is not anything abnormal, but with htop I can see that kernel cpu usage is getting around 100% for all the cores (showing the bars in red) and also the total load average of the server is getting above 100

The nr of processes and RAM usage seems ok.

Where can I look for any problem related with this?

Thanks,
Enid

adamwonski · 09-17-2010, 10:13 AM

So top doesn't show high load and htop does? Maybe your version of htop is broken? Maybe look here to tell which one is right:
% watch -n1 cat /proc/stat
- what's the maximum value of procs_running after some observation?
- same for procs_blocked
- which columns of cpu* lines are growing in the fastest pace?

And what's the nr of processes?

enid · 09-20-2010, 03:35 AM

Quote:

Originally Posted by adamwonski

So top doesn't show high load and htop does? Maybe your version of htop is broken? Maybe look here to tell which one is right:
% watch -n1 cat /proc/stat
- what's the maximum value of procs_running after some observation?
- same for procs_blocked
- which columns of cpu* lines are growing in the fastest pace?

And what's the nr of processes?

Hi adamwonski,

The command watch -n1 cat /proc/stat shows that cpu2 and cp3 are growing higher than the others, than cpu0 and cpu1.

I ment about htop and top, that they show exactly the same load average but the cpu usage about kernel (showed in red at htop) it isn't shown with top command.

Max value of procs_running is below 10, and also the procs_blocked is below 10, and also below the value of procs_running.

Because the server went "Kernel Panic" I rebooted and suspect that the high load is because the I/O operations (hdd's configured as Raid5), and the nr of procs now is around 800000

Thanks,
Enid

adamwonski · 09-21-2010, 01:18 AM

800.000 processes?

If you can use sar, you can run this to observe CPU usage by I/O requests:

Code:

% sar -d 1 0

%util - of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.

enid · 09-21-2010, 04:58 AM

Hi adamwonski,

Indeed the %util is very close to 100 most of the time.
What exactly does this mean and how can it be improved?

Thanks again
Enid

adamwonski · 09-22-2010, 11:58 AM

that means your devices are saturated / overwhelmed with requests

Quote:

Originally Posted by enid

hdd's configured as Raid5

is that software RAID? Do you see any drive broken, or the RAID un-synced? syncing?

i think that having constant load of 10 procs for 4 CPUs is not too bad, although if most of them are also blocked all the time (as I understand from your previous post), then either you have a problem with disks, or your applications (or 1 of them) use them extensively. Is your disk space schrinking fast? You can run something like this to observe:

Code:

watch -n1 -dc df

maybe it's swapping?

Code:

vmstat 1

how do the swap-si/so columns look like?
what does the 'free' command show in Swap line?

add -p parameter to see easier to understand device names:

Code:

sar -dp 1 0

which drives/partitions belong to RAID? which are most loaded? does reading or writing prevail? what other interesting numbers can you observe?

do you see anything particular in logs?

when exactly the problems began? did you change anything prior to that time? ANYTHING? even completely unrelated in your opinion?

enid · 09-23-2010, 09:03 AM

Quote:

Originally Posted by adamwonski

that means your devices are saturated / overwhelmed with requests

is that software RAID? Do you see any drive broken, or the RAID un-synced? syncing?

No it is HW Raid, all the drives are showing OK, and the RAID seems working OK.

Quote:

Originally Posted by adamwonski

i think that having constant load of 10 procs for 4 CPUs is not too bad, although if most of them are also blocked all the time (as I understand from your previous post), then either you have a problem with disks, or your applications (or 1 of them) use them extensively. Is your disk space schrinking fast? You can run something like this to observe:

Code:

watch -n1 -dc df

I see that especially the /var partition is growing faster than the others but not at a very high rate.

Quote:

Originally Posted by adamwonski

maybe it's swapping?

Code:

vmstat 1

how do the swap-si/so columns look like?
what does the 'free' command show in Swap line?

Most of the time si/so show zero, and free (swap around 160MB used from 3800MB)

total used free shared buffers cached
Mem: 2060388 2030080 30308 0 29260 753480
-/+ buffers/cache: 1247340 813048
Swap: 3895720 167896 3727824

Quote:

Originally Posted by adamwonski

add -p parameter to see easier to understand device names:

Code:

sar -dp 1 0

which drives/partitions belong to RAID? which are most loaded? does reading or writing prevail? what other interesting numbers can you observe?

do you see anything particular in logs?

when exactly the problems began? did you change anything prior to that time? ANYTHING? even completely unrelated in your opinion?

As I said the Raid is HW and the all the hard drives (5 HDD's)are shown as 1 big HDD ~1.3TB, partitioned in several partitions.
I think writing prevail most of the time.

I do mention that I did some changes to /etc/fstab (addedd noatime and nodiratime to the /var and /home partitions)
This increased significantly the performance but although the problems seems not to have gone away completely, the load keeps going 100 but at lower rate.
I did an upgrade of the popd/imapd server (dovecot) suspecting that it was causing the problem, which was showing error logs like segfault and now they have gone away.

The problem began about two weeks ago, and I'm sure that no change was made to the server, as concerning to the configuration or anything else, except that I noticed the partition /var and /home growing (not too much although) and the load kept increasing (but always below 20 - 30) not 100.

Thanks,
Enid

adamwonski · 09-27-2010, 02:53 PM

If you have ext2/ext3 file system and can install blktrace on the server you can try to gather more info with it. Manual has examples, the simplest use is:

Code:

btrace /dev/sda

if you get this error:

Code:

mount -t debugfs debugfs /sys/kernel/debug

mount debugfs:

Code:

mount -t debugfs debugfs /sys/kernel/debug

enid · 09-30-2010, 03:33 AM

I did an upgrade of the kernel from vanilla-kernel 2.6.35.5, compile/make/make install, because suspecting of any bug or raid driver malfunctioning.

Now the load average is lower but when the memory usage increases, also the io wait % of cpu increases and the load average also. (lower rates than before)
I plan to increase RAM also and see how it will go.

Regards,
Enid