please help: unexplained spike in load average & drive usage

beardo265 · 12-07-2009, 08:23 PM

Hi, I'm looking for some help. I'm working with a friend on a fedora 10 server running primarily several fairly high traffic apache/php/mysql sites.

Recently a 2nd hard drive was added for some additional storage (this problem may or may not be related to this addition)

Currently, we keep seeing big jumps in load average, without any obvious reason (ie: high cpu process, or something). What we have noticed, using iostat, is right before the jump in load average, both drives %util jump to 100%, and the await/svctm jump way up for several seconds. I haven't been able to track down what could cause this. It has lately been happening every couple minutes, usually giving the load average time to settle down (it will run around 1-2 if this is left alone), but at times if this happens several times in a row, the server can almost grind to a halt. There's lots of memory & cpu available, and we're not swapping.

Is there anything else I can look at, or possible causes? I've been trying to track it down for several days now with no luck.

Showing below, load averages and iostat output for several seconds where this took place.

Now.. I'm no expert at this type of thing, so be nice. :-)

thanks very much for your time! any thoughts/comments/help would be greatly appreciated.

load average: 6.65, 6.63, 6.66 08:26:50
load average: 6.65, 6.63, 6.66 08:26:51
load average: 6.65, 6.63, 6.66 08:26:52
load average: 6.65, 6.63, 6.66 08:26:53
load average: 6.65, 6.63, 6.66 08:26:57
load average: 13.24, 7.99, 7.10 08:26:58
load average: 13.24, 7.99, 7.10 08:26:59

Code:

Time: 08:26:50 PM
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda              23.00     0.00   26.00    0.00   912.00     0.00    35.08     0.17    6.65   4.19  10.90
sda1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3             23.00     0.00   26.00    0.00   912.00     0.00    35.08     0.17    6.65   4.19  10.90
sdb               0.00    60.00    1.00    2.00     8.00   496.00   168.00     0.33  110.33 110.33  33.10
sdb1              0.00    60.00    1.00    2.00     8.00   496.00   168.00     0.33  110.33 110.33  33.10

Time: 08:26:51 PM
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda              23.00     0.00   12.00    0.00   480.00     0.00    40.00     5.42   45.83  53.50  64.20
sda1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3             23.00     0.00   12.00    0.00   480.00     0.00    40.00     5.42   45.83  53.50  64.20
sdb              24.00     0.00    2.00    0.00   112.00     0.00    56.00     2.22  242.00 326.50  65.30
sdb1             24.00     0.00    2.00    0.00   112.00     0.00    56.00     2.22  242.00 326.50  65.30

Time: 08:26:52 PM
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda              26.00     0.00    1.00    0.00   136.00     0.00   136.00    18.73  912.00 1000.00 100.00
sda1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3             26.00     0.00    1.00    0.00   136.00     0.00   136.00    18.73  912.00 1000.00 100.00
sdb               0.00     0.00    1.00    0.00    88.00     0.00    88.00     4.59  686.00 1000.00 100.00
sdb1              0.00     0.00    1.00    0.00    88.00     0.00    88.00     4.59  686.00 1000.00 100.00

Time: 08:26:53 PM
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    2.00    0.00   112.00     0.00    56.00    29.34 2084.00 500.00 100.00
sda1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3              0.00     0.00    2.00    0.00   112.00     0.00    56.00    29.34 2084.00 500.00 100.00
sdb               0.00     0.00    2.00    0.00    56.00     0.00    28.00     4.06 1927.00 500.00 100.00
sdb1              0.00     0.00    2.00    0.00    56.00     0.00    28.00     4.06 1927.00 500.00 100.00

Time: 08:26:54 PM
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    1.00    0.00     8.00     0.00     8.00    32.12 2719.00 1001.00 100.10
sda1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3              0.00     0.00    1.00    0.00     8.00     0.00     8.00    32.12 2719.00 1001.00 100.10
sdb              12.00     0.00    1.00    0.00     8.00     0.00     8.00     2.61 2633.00 1000.00 100.00
sdb1             12.00     0.00    1.00    0.00     8.00     0.00     8.00     2.61 2633.00 1000.00 100.00

Time: 08:26:58 PM
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               9.18   113.61   74.37   22.78  1746.84  1091.14    29.21    24.62  505.05  10.17  98.77
sda1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3              9.18   113.61   74.37   22.78  1746.84  1091.14    29.21    24.62  505.05  10.17  98.77
sdb               7.59     6.33    6.65    0.63   326.58    55.70    52.52     2.58  608.30  94.65  68.89
sdb1              7.59     6.33    6.65    0.63   326.58    55.70    52.52     2.58  608.30  94.65  68.89

Time: 08:26:59 PM
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00   30.00    4.00   792.00    32.00    24.24     0.39   11.53   6.12  20.80
sda1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3              0.00     0.00   30.00    4.00   792.00    32.00    24.24     0.39   11.53   6.12  20.80
sdb              12.00     0.00    5.00    0.00   168.00     0.00    33.60     0.04    8.80   8.80   4.40
sdb1             12.00     0.00    5.00    0.00   168.00     0.00    33.60     0.04    8.80   8.80   4.40

flakblas · 12-07-2009, 10:05 PM

I think I just posted a response in LinuxForums to this lol. Anyway, for the sake of this thread's completeness here's my reply:

Quote:

Just curious, when's the last time you've taken an outage and fsck'd all your partitions? Also, check out the output of smartctl on our drives (yum install smartmontools).

Code:

yum install smartmontools

Code:

smartctl -A /dev/sdx

Then from a recovery shell (no partitions mounted):

Code:

for i in `ls /dev/sd*`; do fsck -fy $i; done

That last one is a little ugly and will throw a few errors but it's quick and easy and it will fsck every sd* partition. Please post the output from these commands. I know an outage sucks but if our partitions are healthy and not too big it shouldn't take long. Smartctl doesn't need an outage though.