Hi all,
Our firm was looking for an inexpensive way of setting up a linux based server (CentOS 5.2) for a small team of software developers in a new office we've begun leasing. We ended up settling on a solution that involves a cheap Dell PowerEdge 840 server using normal onboard disk controllers, with 4 hard disks setup in 2xRAID-1 configurations using MDADM (2x500GB, 2x250GB). On top of these software RAID partitions, we had LVM, with a single volume group spread across most of the disks (with the exception of a /boot 150MB ext3 partition, not in LVM).
After only having it setup for a month or so, we had a single drive failure of one of the 250GB disks.
Somewhat unexpectedly, this caused some problems. Staff in that office suddenly found they could no longer write to the root LV partition (didn't check if they could write to any other LVs). On checking the 'dmesg' output and /var/log/messages output I found some SMART output mentioning one of the disks was about to die, and a number of errors that look like this:
Code:
Mar 31 22:09:08 devsys kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 31 22:09:08 devsys kernel: ata1.00: BMDMA stat 0x64
Mar 31 22:09:08 devsys kernel: ata1.00: cmd c8/00:f8:83:40:d6/00:00:00:00:00/e0 tag 0 dma 126976 in
Mar 31 22:09:08 devsys kernel: res 51/40:00:4b:41:d6/00:00:00:00:00/00 Emask 0x9 (media error)
Mar 31 22:09:08 devsys kernel: ata1.00: status: { DRDY ERR }
Mar 31 22:09:08 devsys kernel: ata1.00: error: { UNC }
Mar 31 22:09:08 devsys kernel: ata1.00: configured for UDMA/133
Mar 31 22:09:08 devsys kernel: ata1.01: configured for UDMA/133
Mar 31 22:09:08 devsys kernel: ata1: EH complete
Mar 31 22:09:09 devsys kernel: SCSI device sda: 488281250 512-byte hdwr sectors (250000 MB)
Mar 31 22:09:09 devsys kernel: sda: Write Protect is off
Mar 31 22:09:09 devsys kernel: SCSI device sda: drive cache: write back
Mar 31 22:09:18 devsys kernel: SCSI device sdb: 976773168 512-byte hdwr sectors (500108 MB)
Mar 31 22:09:20 devsys kernel: sdb: Write Protect is off
Mar 31 22:09:20 devsys kernel: SCSI device sdb: drive cache: write back
The dmesg output also mentioned after some of these problems something along the lines of there being a problem with the journal, and remounting the filesystem read-only! (I didn't keep a copy of the dmesg output thinking most of what would be in here would be recorded in /var/log/messages but can't find this message in the saved copy of /var/log/messages now).
After this happened, I decided to restart the server, do an fsck and make sure the bad drive was failed from the array.
On restart I found we'd gotten some filesystem corruption! fsck thankfully fixed up the problems fine and we didn't seem to lose any important data.
Now I have a few questions:
1. Why didn't the RAID-1 setup under the LVM prevent the FS corruption from occurring? I would have thought that even if one disk is going bad, it would be writing the correct data to the other, good disk in the array and prevent these issues happening?
2. Why did the filesystem get remounted as read only when this failure was occurring? My understanding is that in a software RAID-1 setup, that mdadm could have attempted to read from either of the two disks, and perhaps read some bad data from the failing disk, which I guess may cause the ext3 fs driver to freak out and remount the FS as read only. Would that have been the cause?
3. In the pasted errors from /var/log/messages above, you can see the problem is with the disk at ata1.00 (which was /dev/sda). For some reason though, you also see 'ata1.01: configured for UDMA/133' and 'SCSI device sdb...' mentioned. Why is this? Also on first reboot, the BIOS failed to detect both the 250GB drive that was failing, and the 500GB drive which was fine!?! It reminded me of the old IDE master/slave setups where if one drive on the chain was faulty it could interfere with the other drive on the same chain, but I was sure SATA didn't operate like that... (Is it possible that having it setup as IDE compatibility mode for the SATA ports in the BIOS could be causing it to act like this? I didn't check through all the BIOS settings). Physically removing the failed 250GB disk from the machine made the 500GB drive detect fine again. BTW we are using the onboard SATA controllers which lspci identifies as:
Code:
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01)
00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) SATA IDE Controller (rev 01)
uname -a
Code:
Linux devsys 2.6.18-92.1.18.el5 #1 SMP Wed Nov 12 09:19:49 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
lsmod | grep ata
Code:
ata_piix 54981 6
libata 192345 1 ata_piix
scsi_mod 188665 4 usb_storage,sg,libata,sd_mod
If anyone could shed some light on this, please reply! This whole experience has left me feeling quite concerned about the integrity of our data on software RAID, and the amount of downtime required to bring things back online when a failure does occur. From now on I'll try and get the purchasers of the hardware to spend a little more on a decent hardware RAID setup as it could end up being the cheaper alternative in the event of a disk failure.
Regards,
Mick