LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 08-18-2009, 08:04 AM   #1
Keds
LQ Newbie
 
Registered: Aug 2009
Location: UK
Distribution: Centos
Posts: 9

Rep: Reputation: 0
Question Troubleshooting high load - Possible IO / RAID issue?


Hi Everybody,

This is my first post after quite a while of lurking. To LQ's credit, I can usually find what I need without posting

I have a Centos 5.3 server:

Quote:
Linux x.x.net 2.6.18-53.1.6.el5.028stab053.6 #1 SMP Mon Feb 11 20:14:31 MSK 2008 x86_64 x86_64 x86_64 GNU/Linux
The server has started to hit high load averages recently:

Quote:
top - 08:36:45 up 14 days, 13:50, 1 user, load average: 5.16, 3.27, 2.27
Tasks: 251 total, 1 running, 247 sleeping, 0 stopped, 3 zombie
Cpu(s): 1.3%us, 0.3%sy, 0.0%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4024680k total, 3897460k used, 127220k free, 241264k buffers
Swap: 4080488k total, 120k used, 4080368k free, 2615988k cached
CPU and Memory utilisation actually seems pretty low which made me think that maybe I have a disk / IO issue.

The server is using software RAID1:

Quote:
cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[0] sda1[1]
154248000 blocks [2/2] [UU]

So I ran iostat -dx 5

Quote:
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 45.20 0.00 46.60 0.00 739.20 15.86 1.78 38.22 1.09 5.06
sdb 0.00 45.20 0.00 39.00 0.00 640.00 16.41 4.17 37.12 7.80 30.42
md0 0.00 0.00 0.00 91.20 0.00 729.60 8.00 0.00 0.00 0.00 0.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 8.00 0.20 1.40 1.60 76.80 49.00 0.01 6.75 6.75 1.08
sdb 0.00 8.00 0.00 8.80 0.00 172.80 19.64 7.72 1132.70 109.05 95.96
md0 0.00 0.00 0.20 9.20 1.60 73.60 8.00 0.00 0.00 0.00 0.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.42 0.00 0.00 100.04
md0 0.00 0.00 0.20 0.00 1.60 0.00 8.00 0.00 0.00 0.00 0.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 28.60 0.00 7.40 0.00 292.80 39.57 0.04 4.84 2.89 2.14
sdb 0.00 28.60 0.40 7.60 3.20 296.00 37.40 1.51 426.25 119.85 95.88
md0 0.00 0.00 0.20 35.60 1.60 284.80 8.00 0.00 0.00 0.00 0.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 13.20 0.20 5.40 1.60 153.60 27.71 0.03 4.96 3.68 2.06
sdb 0.00 13.20 0.00 5.40 0.00 153.60 28.44 0.29 53.44 39.04 21.08
md0 0.00 0.00 0.20 17.80 1.60 142.40 8.00 0.00 0.00 0.00 0.00

I haven't used iostat before and I'm not entirely sure how to interpret these results. However, if I'm not mistaken, sdb looks like it is taking up a lot of CPU utilisation and responding much slower than sda.

Also, the disks don't seem to be carrying an equal load:

Quote:
avg-cpu: %user %nice %system %iowait %steal %idle
2.85 0.06 0.93 3.53 0.00 92.64

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 9.60 262.54 238.05 330705548 299849760
sdb 3.36 1.01 326.34 1267426 411068944
md0 29.60 18.63 230.51 23470842 290357512
I have a niggling suspicion that something is not as it should be with sdb and that it's affecting the performance of the server overall.

I really need to track down what is causing this high load. Could anyone give me some guidance?
 
Old 08-18-2009, 09:48 AM   #2
ilikejam
Senior Member
 
Registered: Aug 2003
Location: Glasgow
Distribution: Fedora / Solaris
Posts: 3,109

Rep: Reputation: 97
Hi.

Yup, that iostat output looks a bit suspect. Is there anything mounted on or otherwise using using sdb2 or sdb3 etc?

Dave
 
Old 08-18-2009, 11:07 AM   #3
Keds
LQ Newbie
 
Registered: Aug 2009
Location: UK
Distribution: Centos
Posts: 9

Original Poster
Rep: Reputation: 0
Hi Dave,

Thanks for the response, I ran fdisk -l :


Quote:
Disk /dev/sda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sda1 * 1 19203 154248066 fd Linux raid autodetect
/dev/sda2 19204 19457 2040255 82 Linux swap / Solaris

Disk /dev/sdb: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdb1 * 1 19203 154248066 fd Linux raid autodetect
/dev/sdb2 19204 19457 2040255 82 Linux swap / Solaris

Disk /dev/md0: 157.9 GB, 157949952000 bytes
2 heads, 4 sectors/track, 38562000 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md0 doesn't contain a valid partition table
sdb1 and sdb2 would appear to be the only devices / mounts on sdb...

Are there any other commands I can run to check this?

Thanks again,

Ked
 
Old 08-18-2009, 01:22 PM   #4
ilikejam
Senior Member
 
Registered: Aug 2003
Location: Glasgow
Distribution: Fedora / Solaris
Posts: 3,109

Rep: Reputation: 97
Hi again.

I'd check /var/log/messages for any scsi or md errors if you haven't already. Check 'dmesg' too.

Might be worth (if you're willing) failing the sdb1 partition to stop md0 using it and see if the load averages drop. Risk involved, obviously, but you'd know if the disk was causing the load issue. If memory serves (don't bank on it) the command would be:
# mdadm --fail /dev/md0 /dev/sdb1

Incidentally, what does 'swapon -s' show?
From what I can see from your posts you've only got sdX1 in raid, but you've got swap switched on, so one disk failure might still crash your host if the swap is on the raw partitions.

Dave

Last edited by ilikejam; 08-18-2009 at 01:55 PM. Reason: sdb -> sdb1
 
Old 08-19-2009, 12:19 AM   #5
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,348

Rep: Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749
What have you got in /etc/fstab? As noted above, you've raided your boot/root/data partitions, but not your 2(!) swap partitions, so maybe you're only using one?
 
Old 08-20-2009, 05:27 AM   #6
Keds
LQ Newbie
 
Registered: Aug 2009
Location: UK
Distribution: Centos
Posts: 9

Original Poster
Rep: Reputation: 0
Hi Dave,

In dmesg I have:

Adding 2040244k swap on /dev/sdb2. Priority:-1 extents:1 across:2040244k
Adding 2040244k swap on /dev/sda2. Priority:-2 extents:1 across:2040244k

In /var/log/messages there are a couple of ata / scsi messages that I don't recognise:

Quote:
Aug 17 11:22:14 node1 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Aug 17 11:22:14 node1 kernel: ata2.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 512 in
Aug 17 11:22:14 node1 kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 17 11:22:17 node1 kernel: ata2: soft resetting port
Aug 17 11:22:17 node1 kernel: ata2.00: configured for UDMA/133
Aug 17 11:22:17 node1 kernel: ata2: EH complete
Aug 17 11:22:17 node1 kernel: SCSI device sdb: 312581808 512-byte hdwr sectors (160042 MB)
Aug 17 11:22:17 node1 kernel: sdb: Write Protect is off
Aug 17 11:22:17 node1 kernel: SCSI device sdb: drive cache: write back
Quote:
Aug 19 14:52:14 node1 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Aug 19 14:52:14 node1 kernel: ata2.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 512 in
Aug 19 14:52:14 node1 kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 19 14:52:17 node1 kernel: ata2: soft resetting port
Aug 19 14:52:17 node1 kernel: ata2.00: configured for UDMA/133
Aug 19 14:52:17 node1 kernel: ata2: EH complete
Aug 19 14:52:17 node1 kernel: SCSI device sdb: 312581808 512-byte hdwr sectors (160042 MB)
Aug 19 14:52:17 node1 kernel: sdb: Write Protect is off
Aug 19 14:52:17 node1 kernel: SCSI device sdb: drive cache: write back
Quote:
Aug 19 23:22:14 node1 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Aug 19 23:22:14 node1 kernel: ata2.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 512 in
Aug 19 23:22:14 node1 kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 19 23:22:18 node1 kernel: ata2: soft resetting port
Aug 19 23:22:18 node1 kernel: ata2.00: configured for UDMA/133
Aug 19 23:22:18 node1 kernel: ata2: EH complete
Aug 19 23:22:18 node1 kernel: SCSI device sdb: 312581808 512-byte hdwr sectors (160042 MB)
Aug 19 23:22:18 node1 kernel: sdb: Write Protect is off
Aug 19 23:22:18 node1 kernel: SCSI device sdb: drive cache: write back
I'm guessing these would be the SCSI errors you were refering to?

Failing the drive sounds like it will be the ultimate test here, although I'm a little reluctant to do so until I've gathered as much info as possible. Thanks for the pointer on the mdadmin command to do this, I'll check it out in a bit more detail.

In your opinion, do you reckon I'm dealing with a duff sdb?

Thanks for the help and advice so far, I've learnt a lot about troubleshooting IO issues here
 
Old 08-20-2009, 05:34 AM   #7
Keds
LQ Newbie
 
Registered: Aug 2009
Location: UK
Distribution: Centos
Posts: 9

Original Poster
Rep: Reputation: 0
Chris / Dave,

I thought I'd seperate my response in to 2 distinct posts rather than one mega post...

cat /etc/fstab

Quote:
/dev/md0 / ext3 defaults 1 1
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
LABEL=SWAP-sdb2 swap swap defaults 0 0
LABEL=SWAP-sda2 swap swap defaults 0 0
It looks like the swap is mirrored on sda2 / sdb2?

Although looking at swapon -s, it appears only one swap partition is being used:

Quote:
Filename Type Size Used Priority
/dev/sdb2 partition 2040244 120 -1
/dev/sda2 partition 2040244 0 -2
I'm a little beyond my current knowledge with this config - should the swap be mirrored as well in this situation?

Thanks for you help guys,

Al
 
Old 08-20-2009, 05:35 AM   #8
ilikejam
Senior Member
 
Registered: Aug 2003
Location: Glasgow
Distribution: Fedora / Solaris
Posts: 3,109

Rep: Reputation: 97
I don't think I've seem those messages before either but I'd say they're not too healthy.

Could be something as simple as a loose cable, but if this is a production machine I'd get that disk replaced. As far as I'm concerned, if a disk does anything even slightly odd it gets replaced. Hardware support contracts are a beautiful thing.

You /really/ need to raid1 those two sdX2 partitions and use that as swap instead of the two raw partitions. As it stands at the moment, your host will very probably crash if either of those disks fails (which is looking increasingly likely).

Dave

Last edited by ilikejam; 08-20-2009 at 05:40 AM.
 
Old 08-20-2009, 05:48 AM   #9
Keds
LQ Newbie
 
Registered: Aug 2009
Location: UK
Distribution: Centos
Posts: 9

Original Poster
Rep: Reputation: 0
Lightbulb

Apologies for the extra posts - I'm not bumping, it's just my brain slowly computing what's going on.

I can see what you guys mean now with regard to the swap partitions, lets see if I've got this right:

sda1 & sdb1 are mirrored to form md0
sda2 & sdb2 are not mirrored, and it would appear the sdb2 is the currently active swap partition.

From what you are saying, this is not a fault tolerent config, as if sdb goes down, the swap goes with it and crash goes the box.

Should I consider creating an md1 comprised of sda2 and sdb2 and use it for the swap partition?

I realise this is a seperate issue from the health of the sdb drive.

Sorry if this feels a bit like pulling teeth, I didn't set this server up and it's a bit of a learning process (which I'm enjoying btw).

Al
 
Old 08-20-2009, 05:52 AM   #10
Keds
LQ Newbie
 
Registered: Aug 2009
Location: UK
Distribution: Centos
Posts: 9

Original Poster
Rep: Reputation: 0
Ah, looks like I was drafting my post as you replied.

That's great, you've definately put me on the right track, I'll get that disk replaced and those swap partitions mirrored.

Thanks again!
 
Old 08-21-2009, 01:03 AM   #11
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,348

Rep: Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749
No worries. As you've gathered, you have to actually raid (mirror in this case) the 2 swaps explicitly. Because your system is using very little swap

Swap: 4080488k total, 120k used, 4080368k free, 2615988k cached

its only using one disk, but that is a partition, not a disk, so its draining i/o bandwidth from the data/program i/o in the same physical disk.
In a big system eg a major DB, you'd have data and swap on different disks( not partitions). In fact, they'd be on separate i/o busses as well.
 
Old 05-28-2010, 05:31 PM   #12
hohum
LQ Newbie
 
Registered: Jul 2007
Posts: 11

Rep: Reputation: 0
swap priority

a little off topic here but if you give your swap partitions equal priorities then it will stripe acrosss the two disks.

mine is:

/dev/sda3 swap swap sw,pri=3 0 0
/dev/sdb3 swap swap sw,pri=3 0 0

you'd need to adjust yours manually. This will ensure the ask for each disk remains the same.

what I don't know is what happens if a drive fails and the kernel looses half it's swap.
 
Old 05-29-2010, 04:02 PM   #13
ilikejam
Senior Member
 
Registered: Aug 2003
Location: Glasgow
Distribution: Fedora / Solaris
Posts: 3,109

Rep: Reputation: 97
Quote:
what I don't know is what happens if a drive fails and the kernel looses half it's swap.
Bad Things. Most likely some time after the disk failed, leading to much confusion and gnashing of teeth.

Dave
 
  


Reply

Tags
load, server


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Help on troubleshooting named issue brandon@rhiamet.com Linux - Server 18 02-01-2009 09:15 PM
Need help troubleshooting high load averages jdw52 Linux - Server 6 12-31-2008 06:37 PM
Lighttpd performance problem + RAID performance problem in a high load site phaz0r Linux - Server 0 11-16-2008 08:52 AM
Load Avg High/Phys Mem High teamh Debian 2 12-26-2006 05:03 PM
High load issue, need help newlinuxnewbie Linux - Server 18 12-11-2006 12:54 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 10:33 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration