Troubleshooting high load - Possible IO / RAID issue?
Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I haven't used iostat before and I'm not entirely sure how to interpret these results. However, if I'm not mistaken, sdb looks like it is taking up a lot of CPU utilisation and responding much slower than sda.
Also, the disks don't seem to be carrying an equal load:
I'd check /var/log/messages for any scsi or md errors if you haven't already. Check 'dmesg' too.
Might be worth (if you're willing) failing the sdb1 partition to stop md0 using it and see if the load averages drop. Risk involved, obviously, but you'd know if the disk was causing the load issue. If memory serves (don't bank on it) the command would be:
# mdadm --fail /dev/md0 /dev/sdb1
Incidentally, what does 'swapon -s' show?
From what I can see from your posts you've only got sdX1 in raid, but you've got swap switched on, so one disk failure might still crash your host if the swap is on the raw partitions.
Dave
Last edited by ilikejam; 08-18-2009 at 01:55 PM.
Reason: sdb -> sdb1
What have you got in /etc/fstab? As noted above, you've raided your boot/root/data partitions, but not your 2(!) swap partitions, so maybe you're only using one?
Adding 2040244k swap on /dev/sdb2. Priority:-1 extents:1 across:2040244k
Adding 2040244k swap on /dev/sda2. Priority:-2 extents:1 across:2040244k
In /var/log/messages there are a couple of ata / scsi messages that I don't recognise:
Quote:
Aug 17 11:22:14 node1 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Aug 17 11:22:14 node1 kernel: ata2.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 512 in
Aug 17 11:22:14 node1 kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 17 11:22:17 node1 kernel: ata2: soft resetting port
Aug 17 11:22:17 node1 kernel: ata2.00: configured for UDMA/133
Aug 17 11:22:17 node1 kernel: ata2: EH complete
Aug 17 11:22:17 node1 kernel: SCSI device sdb: 312581808 512-byte hdwr sectors (160042 MB)
Aug 17 11:22:17 node1 kernel: sdb: Write Protect is off
Aug 17 11:22:17 node1 kernel: SCSI device sdb: drive cache: write back
Quote:
Aug 19 14:52:14 node1 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Aug 19 14:52:14 node1 kernel: ata2.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 512 in
Aug 19 14:52:14 node1 kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 19 14:52:17 node1 kernel: ata2: soft resetting port
Aug 19 14:52:17 node1 kernel: ata2.00: configured for UDMA/133
Aug 19 14:52:17 node1 kernel: ata2: EH complete
Aug 19 14:52:17 node1 kernel: SCSI device sdb: 312581808 512-byte hdwr sectors (160042 MB)
Aug 19 14:52:17 node1 kernel: sdb: Write Protect is off
Aug 19 14:52:17 node1 kernel: SCSI device sdb: drive cache: write back
Quote:
Aug 19 23:22:14 node1 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Aug 19 23:22:14 node1 kernel: ata2.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 512 in
Aug 19 23:22:14 node1 kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 19 23:22:18 node1 kernel: ata2: soft resetting port
Aug 19 23:22:18 node1 kernel: ata2.00: configured for UDMA/133
Aug 19 23:22:18 node1 kernel: ata2: EH complete
Aug 19 23:22:18 node1 kernel: SCSI device sdb: 312581808 512-byte hdwr sectors (160042 MB)
Aug 19 23:22:18 node1 kernel: sdb: Write Protect is off
Aug 19 23:22:18 node1 kernel: SCSI device sdb: drive cache: write back
I'm guessing these would be the SCSI errors you were refering to?
Failing the drive sounds like it will be the ultimate test here, although I'm a little reluctant to do so until I've gathered as much info as possible. Thanks for the pointer on the mdadmin command to do this, I'll check it out in a bit more detail.
In your opinion, do you reckon I'm dealing with a duff sdb?
Thanks for the help and advice so far, I've learnt a lot about troubleshooting IO issues here
I don't think I've seem those messages before either but I'd say they're not too healthy.
Could be something as simple as a loose cable, but if this is a production machine I'd get that disk replaced. As far as I'm concerned, if a disk does anything even slightly odd it gets replaced. Hardware support contracts are a beautiful thing.
You /really/ need to raid1 those two sdX2 partitions and use that as swap instead of the two raw partitions. As it stands at the moment, your host will very probably crash if either of those disks fails (which is looking increasingly likely).
its only using one disk, but that is a partition, not a disk, so its draining i/o bandwidth from the data/program i/o in the same physical disk.
In a big system eg a major DB, you'd have data and swap on different disks( not partitions). In fact, they'd be on separate i/o busses as well.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.