LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices

Reply
 
Search this Thread
Old 04-08-2009, 10:55 PM   #1
VenKamikaze
LQ Newbie
 
Registered: Jun 2008
Posts: 4

Rep: Reputation: 0
Question LVM over 2xRAID-1 (mdadm/software) partitions. FS corruption


Hi all,

Our firm was looking for an inexpensive way of setting up a linux based server (CentOS 5.2) for a small team of software developers in a new office we've begun leasing. We ended up settling on a solution that involves a cheap Dell PowerEdge 840 server using normal onboard disk controllers, with 4 hard disks setup in 2xRAID-1 configurations using MDADM (2x500GB, 2x250GB). On top of these software RAID partitions, we had LVM, with a single volume group spread across most of the disks (with the exception of a /boot 150MB ext3 partition, not in LVM).

After only having it setup for a month or so, we had a single drive failure of one of the 250GB disks.

Somewhat unexpectedly, this caused some problems. Staff in that office suddenly found they could no longer write to the root LV partition (didn't check if they could write to any other LVs). On checking the 'dmesg' output and /var/log/messages output I found some SMART output mentioning one of the disks was about to die, and a number of errors that look like this:

Code:
Mar 31 22:09:08 devsys kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 31 22:09:08 devsys kernel: ata1.00: BMDMA stat 0x64
Mar 31 22:09:08 devsys kernel: ata1.00: cmd c8/00:f8:83:40:d6/00:00:00:00:00/e0 tag 0 dma 126976 in
Mar 31 22:09:08 devsys kernel:          res 51/40:00:4b:41:d6/00:00:00:00:00/00 Emask 0x9 (media error)
Mar 31 22:09:08 devsys kernel: ata1.00: status: { DRDY ERR }
Mar 31 22:09:08 devsys kernel: ata1.00: error: { UNC }
Mar 31 22:09:08 devsys kernel: ata1.00: configured for UDMA/133
Mar 31 22:09:08 devsys kernel: ata1.01: configured for UDMA/133
Mar 31 22:09:08 devsys kernel: ata1: EH complete
Mar 31 22:09:09 devsys kernel: SCSI device sda: 488281250 512-byte hdwr sectors (250000 MB)
Mar 31 22:09:09 devsys kernel: sda: Write Protect is off
Mar 31 22:09:09 devsys kernel: SCSI device sda: drive cache: write back
Mar 31 22:09:18 devsys kernel: SCSI device sdb: 976773168 512-byte hdwr sectors (500108 MB)
Mar 31 22:09:20 devsys kernel: sdb: Write Protect is off
Mar 31 22:09:20 devsys kernel: SCSI device sdb: drive cache: write back
The dmesg output also mentioned after some of these problems something along the lines of there being a problem with the journal, and remounting the filesystem read-only! (I didn't keep a copy of the dmesg output thinking most of what would be in here would be recorded in /var/log/messages but can't find this message in the saved copy of /var/log/messages now).
After this happened, I decided to restart the server, do an fsck and make sure the bad drive was failed from the array.

On restart I found we'd gotten some filesystem corruption! fsck thankfully fixed up the problems fine and we didn't seem to lose any important data.

Now I have a few questions:
1. Why didn't the RAID-1 setup under the LVM prevent the FS corruption from occurring? I would have thought that even if one disk is going bad, it would be writing the correct data to the other, good disk in the array and prevent these issues happening?
2. Why did the filesystem get remounted as read only when this failure was occurring? My understanding is that in a software RAID-1 setup, that mdadm could have attempted to read from either of the two disks, and perhaps read some bad data from the failing disk, which I guess may cause the ext3 fs driver to freak out and remount the FS as read only. Would that have been the cause?
3. In the pasted errors from /var/log/messages above, you can see the problem is with the disk at ata1.00 (which was /dev/sda). For some reason though, you also see 'ata1.01: configured for UDMA/133' and 'SCSI device sdb...' mentioned. Why is this? Also on first reboot, the BIOS failed to detect both the 250GB drive that was failing, and the 500GB drive which was fine!?! It reminded me of the old IDE master/slave setups where if one drive on the chain was faulty it could interfere with the other drive on the same chain, but I was sure SATA didn't operate like that... (Is it possible that having it setup as IDE compatibility mode for the SATA ports in the BIOS could be causing it to act like this? I didn't check through all the BIOS settings). Physically removing the failed 250GB disk from the machine made the 500GB drive detect fine again. BTW we are using the onboard SATA controllers which lspci identifies as:
Code:
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01)
00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) SATA IDE Controller (rev 01)
uname -a
Code:
Linux devsys 2.6.18-92.1.18.el5 #1 SMP Wed Nov 12 09:19:49 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
lsmod | grep ata
Code:
ata_piix               54981  6 
libata                192345  1 ata_piix
scsi_mod              188665  4 usb_storage,sg,libata,sd_mod
If anyone could shed some light on this, please reply! This whole experience has left me feeling quite concerned about the integrity of our data on software RAID, and the amount of downtime required to bring things back online when a failure does occur. From now on I'll try and get the purchasers of the hardware to spend a little more on a decent hardware RAID setup as it could end up being the cheaper alternative in the event of a disk failure.

Regards,
Mick
 
Old 04-10-2009, 12:24 AM   #2
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,269

Rep: Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028
You can run

mdadm --detail /dev/md0
mdadm --detail /dev/md1

and

cat /proc/mdstat

to get better info at that level.

As for Qns:
1. depends what caused the problem eg power spike
2. setting itself to RO is to prevent the problem getting worse, but it may take a short time before the system knows its got a problem..
 
Old 04-10-2009, 12:53 AM   #3
reptiler
Member
 
Registered: Mar 2009
Location: Hong Kong
Distribution: Fedora
Posts: 184

Rep: Reputation: 41
I think not many people are aware of this, but LVM also offers mirroring.
Thus I would suggest to not use the double overhead (software RAID + LVM) and use the mirroring LVM offers. Thus you just have the LVM overhead and better performance and probably less problems.
 
Old 04-16-2009, 06:59 AM   #4
VenKamikaze
LQ Newbie
 
Registered: Jun 2008
Posts: 4

Original Poster
Rep: Reputation: 0
Thanks for the responses. I'm curious about mirroring offered through LVM - I decided to choose software RAID-1 with LVM on top of it, as I thought software RAID would be more reliable/better tested as it's been around for so long. Has anyone had any experience with LVM mirroring? Is it reliable in the event of a disk failure?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: How To Resize LVM Software RAID1 Partitions (Shrink & Grow) LXer Syndicated Linux News 0 01-09-2009 02:11 PM
MDADM screws up the LVM partitions? Can't mount the LVMs after mdadm stopped alirezan1 Linux - Newbie 3 11-18-2008 04:42 PM
Software Raid with LVM and Live System Partitions centos82 Linux - Software 4 11-13-2008 10:59 AM
LXer: Back Up (And Restore) LVM Partitions With LVM Snapshots LXer Syndicated Linux News 0 04-17-2007 11:16 AM
LVM + MDADM...help carve81 Debian 2 08-25-2006 03:17 AM


All times are GMT -5. The time now is 09:28 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration