Troubleshooting server slowdown

DaRkBoDoM · 11-08-2011, 03:24 PM

Hi there!

Since some days, my Ubuntu Linux Home Server is experiencing extreme slowdown.

Here is what i know:
- Reboot doesn't help
- No unusual dmesg / log output
- Very high (8-22) load average
- High (75%+) CPU "wa" usage
- Very high response time and bad "feel"
- Ive touched nothing to slow it down

I suspect some kind of hardware semi-failure. Can you help me troubleshoot it?

kbp · 11-08-2011, 04:30 PM

High wait (wa) is i/o related, commonly disk but not neccessarily - try starting there, are you seeing high disk utilisation?

With the load peaking at 22 I'm guessing you probably have a lot of processes in the run queue, what sort of services is this server providing?

DaRkBoDoM · 11-08-2011, 04:59 PM

I've checked disk i/o and noted noting above normal.
Anyway, disk i/o is tremendously slow and simply touching a file may take a huge amount of time.

Disks are SATA on RAID1. they are not seek-error messages on dmesg or smart failures.

I've seen a lot of broken disks, but those are strangely "silent".

kbp · 11-08-2011, 05:05 PM

Quote:

Originally Posted by DaRkBoDoM

Anyway, disk i/o is tremendously slow and simply touching a file may take a huge amount of time.

This isn't normal behaviour ...

DaRkBoDoM · 11-08-2011, 05:33 PM

Yep, I know... but what I could do?
How can I detect what's wrong?

I have no error messages and plugging out random hard driver and "see what happens" it doesn't sound like a reasonable idea.

kbp · 11-08-2011, 06:31 PM

There isn't anything wrong with the drive neccessarily, you may just have too many processes that are waiting on disk access and performing a lot of writes, or you could have several processes all waiting on the same file. Try using ps and lsof to see what files the waiting processes are attempting to access.

d3vrandom · 11-09-2011, 12:42 AM

Actually your problem is entirely related to disk i/o. Linux counts processes waiting on disk access in its CPU load figures. So when you have high i/o wait you will also have high load numbers even though your CPU might very well be idle! I suspect there is something wrong with your RAID array. Was there a disk failure and is the array being rebuilt or something? If the answer is no then you have to identify which process is causing high i/o.

One way to check would be to run top and then sort by the time column in descending order. That should tell you which process has been running for a long time.

DaRkBoDoM · 11-09-2011, 03:25 AM

The only heavy I/O process is qemu, but it has always been there and it's not consuming so much disk bandwidth.
Even terminating it doesn't change things that much.

Disk I/O is very slow also after forcing an hard reboot: it tooks a lot of time also to replay the filesystem journal, when no processes are running at all.

Here is a top sorted by Time. Note that at the time of this "top" the server is running really FAST -_-'

Code:

top - 10:22:13 up 12:44,  1 user,  load average: 4.02, 4.80, 5.26
Tasks: 190 total,   1 running, 189 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3%us,  2.3%sy, 17.9%ni, 23.9%id, 55.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3090260k total,  2880424k used,   209836k free,   782748k buffers
Swap:  1952700k total,     6184k used,  1946516k free,   728200k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+    TIME COMMAND
 3577 valhalla  39  19  355m 181m 1792 S 21.5  6.0 120:02.80 120:02 qemu
 3265 nagios    25   5 39240 5404 2008 S  0.0  0.2   1:00.74   1:00 nagios3
 3378 asterisk -11   0  656m  27m 9552 S  0.3  0.9   0:51.89   0:51 asterisk
 1503 mysql     20   0  177m  41m 3364 S  0.0  1.4   0:42.39   0:42 mysqld
  359 root      20   0     0    0    0 S  0.0  0.0   0:29.36   0:29 md8_raid5
 1493 bind      20   0  149m  46m 1716 S  0.0  1.5   0:20.82   0:20 named
 3296 proxy     20   0 86776  21m 2816 S  0.3  0.7   0:20.04   0:20 squid3
  979 syslog    20   0  129m 1720 1112 S  0.3  0.1   0:16.91   0:16 rsyslogd
 3084 snmp      20   0 47456 3660 1548 S  0.0  0.1   0:10.53   0:10 snmpd
   21 root      20   0     0    0    0 S  0.0  0.0   0:10.21   0:10 kswapd0
   10 root      20   0     0    0    0 S  0.0  0.0   0:09.64   0:09 sync_supers
 3423 fetchmai  20   0 43508 3356 2432 S  0.0  0.1   0:08.29   0:08 fetchmail
 6289 root      20   0     0    0    0 S  0.0  0.0   0:07.12   0:07 jbd2/dm-1-8
 2647 postgres  20   0  108m 1704  484 S  0.0  0.1   0:06.70   0:06 postgres
 2706 postgres  20   0  208m 2332 1248 S  0.0  0.1   0:06.39   0:06 postgres
  283 root      20   0     0    0    0 D  0.0  0.0   0:05.56   0:05 md3_raid1
    3 root      20   0     0    0    0 S  0.0  0.0   0:04.92   0:04 ksoftirqd/0

Code:

root@transylvania:~ 0 1001# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10]
md8 : active raid5 sdc1[5] sde1[7] sdf1[4] sdd1[6]
      2927845632 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md0 : active raid1 sdb1[0] sda1[1]
      2928576 blocks [2/2] [UU]

md2 : active raid1 sdb5[0] sda5[1]
      4881344 blocks [2/2] [UU]

md4 : active raid1 sdb7[0] sda7[1]
      2928576 blocks [2/2] [UU]

md1 : active raid1 sdb2[1] sda2[0]
      1952704 blocks [2/2] [UU]

md3 : active raid1 sdb6[0] sda6[1]
      24413120 blocks [2/2] [UU]

md6 : active raid1 sdb8[0] sda8[1]
      451146624 blocks [2/2] [UU]

unused devices: <none>

My guess is that wa is high not because I/O requests are raised above normal, but because I/O requests are server at a really slower rate.

The problem is: what could be causing that behaviour?

d3vrandom · 11-10-2011, 05:07 AM

I have an idea. Boot the server off a linux installation/live cd and run the badblocks program on each of the drives. On modern drives it takes about 1.5-2 hours for the read only test. If the program runs really slow you know something is wrong with your drives or the disk controller. If it runs normally but shows that you have bad blocks on your drives then your drives need to be replaced.

BTW you could run badblocks without rebooting your server i.e. from within your currently installed os. But I want you to use a CD in order to rule out the current filesystem as a factor in the slow down.

deep27ak · 11-10-2011, 05:36 AM

Quote:

Originally Posted by DaRkBoDoM

Hi there!

Since some days, my Ubuntu Linux Home Server is experiencing extreme slowdown.

Here is what i know:
- Reboot doesn't help
- No unusual dmesg / log output
- Very high (8-22) load average
- High (75%+) CPU "wa" usage
- Very high response time and bad "feel"
- Ive touched nothing to slow it down

I suspect some kind of hardware semi-failure. Can you help me troubleshoot it?

Would you mind telling me the RAM and swap memory of your system

Code:

#free -m
post the output

Code:

#df -h
(post the output)

DaRkBoDoM · 11-10-2011, 06:06 AM

Quote:

Originally Posted by d3vrandom

I have an idea. Boot the server off a linux installation/live cd and run the badblocks program on each of the drives. On modern drives it takes about 1.5-2 hours for the read only test. If the program runs really slow you know something is wrong with your drives or the disk controller. If it runs normally but shows that you have bad blocks on your drives then your drives need to be replaced.

Nice idea. I'll do it tonight. Ty

Quote:

Would you mind telling me the RAM and swap memory of your system

Code:

root@transylvania:~ 0 1001# free -m
             total       used       free     shared    buffers     cached
Mem:          3017       2748        269          0        696        784
-/+ buffers/cache:       1267       1750
Swap:         1906         12       1894

Code:

root@transylvania:~ 0 1002# df -h
File system            Dim. Usati Disp. Uso% Montato su
/dev/md0              2,8G  1,4G  1,3G  52% /
udev                  1,5G  8,0K  1,5G   1% /dev
tmpfs                 604M  1,2M  603M   1% /run
none                  5,0M  8,0K  5,0M   1% /run/lock
none                  1,5G     0  1,5G   0% /run/shm
/dev/md2              4,6G  1,3G  3,1G  30% /usr
/dev/md6              431G  171G  261G  40% /home
/dev/md3               24G   17G  7,0G  70% /var
/dev/md4              2,8G  833M  1,8G  32% /var/log
/dev/mapper/extras-extra2
                      2,8T  1,7T  1,1T  60% /mnt/extra

deep27ak · 11-10-2011, 06:21 AM

most of your swap memory is utilized and with such large harddisk size I would advise you to increase swap memory or RAM

DaRkBoDoM · 11-10-2011, 06:35 AM

Quote:

Originally Posted by deep27ak

most of your swap memory is utilized and with such large harddisk size I would advise you to increase swap memory or RAM

It seems to me that system is almost not swapping at all (about 0,6% swap used).
Where I'm wrong?

deep27ak · 11-10-2011, 06:57 AM

Quote:

Originally Posted by DaRkBoDoM

It seems to me that system is almost not swapping at all (about 0,6% swap used).
Where I'm wrong?

well from 3 GB RAM 2.7GB is used and you are running a system with more than 2.8T size?

unSpawn · 11-10-2011, 07:07 AM

Quote:

Originally Posted by DaRkBoDoM

Since some days, my Ubuntu Linux Home Server is experiencing extreme slowdown.

Maybe we could try looking at other stuff in parallel?
- I noticed you saying "since some days". So what happened since the machine last ran OK? Any system updates or reconfiguration? New users? Anything else we should know?
- Can you install Atop, reboot the machine to a sane state and have Atop store system- and process activity for at least 24 hours? (I like Atop because it's easy to replay the binary log given a reasonable interval is used.)
- You stated logs don't show any anomalies but you didn't say what you've looked with. If it was a case of cursory visual inspection I suggest using Logwatch instead. It's helpful for finding leads you might have overlooked in log files it knows about.