LinuxQuestions.org - Software Raid 5 (md) recovery using mdadm

- Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)

- - Software Raid 5 (md) recovery using mdadm (https://www.linuxquestions.org/questions/linux-server-73/software-raid-5-md-recovery-using-mdadm-551732/)

Software Raid 5 (md) recovery using mdadm

Hello,

after receiving the following errors from two of my four disks in my md0-array

Code:

root:~# cat /etc/mdadm/mdadm.conf

DEVICE /dev/hde1 /dev/hdf1 /dev/hdg1 /dev/hdh1

ARRAY /dev/md0 level=raid5 num-devices=4 UUID=5e01109a:5b458d4d:36b7faae:5aa8706c

  devices=/dev/hde1,/dev/hdf1,/dev/hdg1,/dev/hdh1

i've been unable to restart the array:

Code:

syslog:

hdh: dma_timer_expiry: dma status == 0x61

hdh: DMA timeout error

hdh: dma timeout error: status=0x7f { DriveReady DeviceFault SeekComplete DataRequest CorrectedError Index Error }

hdh: dma timeout error: error=0x7f { DriveStatusError UncorrectableError SectorIdNotFound TrackZeroNotFound AddrMarkNotFound }, LBAsect=150081181286271, high=8945535, low=8355711, sector=82055743

hdg: DMA disabled

hdh: DMA disabled

ide3: reset: success

hdh: dma_timer_expiry: dma status == 0x41

hdh: DMA timeout error

hdh: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest }



hdh: dma_timer_expiry: dma status == 0x41

hdh: DMA timeout error

hdh: dma timeout error: status=0xd0 { Busy }



hdh: DMA disabled

ide3: reset: master: error (0x2c?)

hdh: status error: status=0x2c { DeviceFault DataRequest CorrectedError }



ide3: reset: master: error (0x2c?)

end_request: I/O error, dev hdh, sector 82057535

end_request: I/O error, dev hdh, sector 82057543

..................................................

end_request: I/O error, dev hdh, sector 82055855

end_request: I/O error, dev hdh, sector 82055863

hdg: status error: status=0x2c { DeviceFault DataRequest CorrectedError }



hdh: DMA disabled

ide3: reset: master: error (0x2c?)

hdg: status error: status=0x2c { DeviceFault DataRequest CorrectedError }



ide3: reset: master: error (0x2c?)

end_request: I/O error, dev hdg, sector 312576575

md: write_disk_sb failed for device hdg1

end_request: I/O error, dev hdg, sector 312576575

md: write_disk_sb failed for device hdg1

..................................................

end_request: I/O error, dev hdg, sector 312576575

md: write_disk_sb failed for device hdg1

end_request: I/O error, dev hdg, sector 312576575

md: write_disk_sb failed for device hdg1

RAID5 conf printout:

 --- rd:4 wd:3 fd:1

 disk 0, o:0, dev:hdh1

 disk 1, o:1, dev:hdg1

 disk 2, o:1, dev:hde1

 disk 3, o:1, dev:hdf1

RAID5 conf printout:

 --- rd:4 wd:3 fd:1

 disk 1, o:1, dev:hdg1

 disk 2, o:1, dev:hde1

 disk 3, o:1, dev:hdf1

end_request: I/O error, dev hdg, sector 78144063

XFS: device md0- XFS write error in file system meta-data block 0xa7ade18 in md0

xfs_force_shutdown(md0,0x2) called from line 959 of file fs/xfs/xfs_log.c.  Return address = 0xe0a19b15

lost page write due to I/O error on md0

sage repeated 9 times

RAID5 conf printout:

 --- rd:4 wd:2 fd:2

 disk 1, o:0, dev:hdg1

 disk 2, o:1, dev:hde1

 disk 3, o:1, dev:hdf1

RAID5 conf printout:

 --- rd:4 wd:2 fd:2

 disk 2, o:1, dev:hde1

 disk 3, o:1, dev:hdf1

xfs_force_shutdown(md0,0x2) called from line 959 of file fs/xfs/xfs_log.c.  Return address = 0xe0a19b15

xfs_force_shutdown(md0,0x1) called from line 353 of file fs/xfs/xfs_rw.c.  Return address = 0xe0a19b15

Obviously there was an issue with the new airflow ide-Ribbons.
I already tried to readd the dirty disks without success:

Code:

root:~# mdadm -a /dev/md0 /dev/hde1

mdadm: cannot get array info for /dev/md0



syslog:

md: md0 stopped.

md: unbind<hde1>

md: export_rdev(hde1)

md: unbind<hdf1>

md: export_rdev(hdf1)

md: unbind<hdg1>

md: export_rdev(hdg1)

md: unbind<hdh1>

md: export_rdev(hdh1)

md: bind<hdh1>

md: bind<hdg1>

md: bind<hdf1>

md: bind<hde1>





root:~# cat /proc/mdstat

md0 : inactive hde1[2] hdf1[3] hdg1[1] hdh1[0]

      625153024 blocks

unused devices: <none>

mdadm --examine reports the following:

Code:

/dev/hde1:

          Magic : a92b4efc

        Version : 00.90.00

          UUID : 5e01109a:5b458d4d:36b7faae:5aa8706c

  Creation Time : Tue Sep 28 06:29:56 2004

    Raid Level : raid5

  Raid Devices : 4

  Total Devices : 4

Preferred Minor : 0



    Update Time : Fri May  4 19:05:55 2007

          State : clean

 Active Devices : 2

Working Devices : 2

 Failed Devices : 3

  Spare Devices : 0

      Checksum : 84c312ae - correct

        Events : 0.315786



        Layout : left-symmetric

    Chunk Size : 128K



      Number  Major  Minor  RaidDevice State

this    2      33        1        2      active sync  /dev/hde1



  0    0      0        0        0      removed

  1    1      0        0        1      faulty removed

  2    2      33        1        2      active sync  /dev/hde1

  3    3      33      65        3      active sync  /dev/hdf1

Code:

/dev/hdf1:

          Magic : a92b4efc

        Version : 00.90.00

          UUID : 5e01109a:5b458d4d:36b7faae:5aa8706c

  Creation Time : Tue Sep 28 06:29:56 2004

    Raid Level : raid5

  Raid Devices : 4

  Total Devices : 4

Preferred Minor : 0



    Update Time : Fri May  4 19:05:55 2007

          State : clean

 Active Devices : 2

Working Devices : 2

 Failed Devices : 3

  Spare Devices : 0

      Checksum : 84c312f0 - correct

        Events : 0.315786



        Layout : left-symmetric

    Chunk Size : 128K



      Number  Major  Minor  RaidDevice State

this    3      33      65        3      active sync  /dev/hdf1



  0    0      0        0        0      removed

  1    1      0        0        1      faulty removed

  2    2      33        1        2      active sync  /dev/hde1

  3    3      33      65        3      active sync  /dev/hdf1

Code:

/dev/hdg1:

          Magic : a92b4efc

        Version : 00.90.00

          UUID : 5e01109a:5b458d4d:36b7faae:5aa8706c

  Creation Time : Tue Sep 28 06:29:56 2004

    Raid Level : raid5

  Raid Devices : 4

  Total Devices : 4

Preferred Minor : 0



    Update Time : Fri May  4 19:04:23 2007

          State : active

 Active Devices : 4

Working Devices : 4

 Failed Devices : 0

  Spare Devices : 0

      Checksum : 84be4048 - correct

        Events : 0.315684



        Layout : left-symmetric

    Chunk Size : 128K



      Number  Major  Minor  RaidDevice State

this    1      34        1        1      active sync  /dev/hdg1



  0    0      34      65        0      active sync  /dev/hdh1

  1    1      34        1        1      active sync  /dev/hdg1

  2    2      33        1        2      active sync  /dev/hde1

  3    3      33      65        3      active sync  /dev/hdf1

Code:

/dev/hdh1:

          Magic : a92b4efc

        Version : 00.90.00

          UUID : 5e01109a:5b458d4d:36b7faae:5aa8706c

  Creation Time : Tue Sep 28 06:29:56 2004

    Raid Level : raid5

  Raid Devices : 4

  Total Devices : 4

Preferred Minor : 0



    Update Time : Fri May  4 19:04:23 2007

          State : active

 Active Devices : 4

Working Devices : 4

 Failed Devices : 0

  Spare Devices : 0

      Checksum : 84be4086 - correct

        Events : 0.315684



        Layout : left-symmetric

    Chunk Size : 128K



      Number  Major  Minor  RaidDevice State

this    0      34      65        0      active sync  /dev/hdh1



  0    0      34      65        0      active sync  /dev/hdh1

  1    1      34        1        1      active sync  /dev/hdg1

  2    2      33        1        2      active sync  /dev/hde1

  3    3      33      65        3      active sync  /dev/hdf1

I know that 2 failed disk render the array unusable, but if there is a way to get this array back online I would highly appreciate any help.

Hmmmm ... messy. I lost a hardware RAID1 on a Promise controller (yes I know, I should have known better ... hindsight ... blah!). Burned hand teaches best so they say - I did quite a bit of testing on soft RAID1 and RAID5. Could not get soft RAID5 to be acceptable after pulling power cables out to check the results. RAID5 doesn't like it very much and has a tendency to refuse to mount the volume because the filesystem isn't clean and the RAID is still critical. I go for soft RAID1 or hardware RAID5.

If you've lost two drives, then the RAID set is dead. There are some specialist tools which can attempt to recover some data from a multiple disk RAID5 failure, but given the way the data is written, I wouldn't be too hopeful about what that would get back.

A multiple drive failure at one time is quite rare. Not unheard of though. Assuming that hdg and hdh are master and slave on a single IDE bus, it is possible that the failure of one of the drives is causing some weird bus errors and making it look like the other drive has problems too - I've seen that before. Could try removing each drive from the bus and trying to boot the system to see if either of the drives miraculously recovers, then replace the failed drive and rebuild the RAID.

Thanks for your recommendations. As you mentioned: burned hand teaches best. Your assumptions regarding the drives were correct. There are four drives connected. Two as masters and two as slaves on two IDE busses. Unplugging each device one by one and trying to start the array didn't succeed.

Code:

~# /etc/init.d/mdadm-raid start

Starting raid devices: mdadm: /dev/md0 assembled from 1 drive - not enough to start the array.

done.

Even re-adding the drives into the array wasn't possible. Same error as mentioned in the first post.
Any other suggestions?

Also, though multiple drive failures are uncommon, if you purchased all the drives at the same time, when one fails the others are sure to follow. I've seen RAID 5 on hardware controllers die twice during rebuilds. Generally, when one drive fails, go through the cycle and swap every drive in the array, or you may be sorry.

Or get enough disks that RAID 6 makes sense. That is all we use at work now, it is RAID 5 with an additional hot spare. Most hardware controllers will allow the hot spare to be any of the physical drives in the array, so when one goes bad the hot spare takes its place, then you pull the bad drive out, put a blank drive in, and set it as the new hot spare. Much safer. I've yet to see a RAID 6 failure.

Software RAID 5 sounds like a very bad idea to me. I am aware that it is possible, but any data important enough to be on a RAID 5 array is also important enough that the additional $300 or so is spent on a hardware controller.

Peace,
JimBass

I had (am having) a similar problem with a Silicon Image 3124 PCI-X Serial ATA controller on as Norco DS-500 storage array. The sata_sil24 driver with port multiplier is still a bit experimental, and I've only been able to get to work with it is a patch on a 2.17.4. The controller timed out and two of 5 drives in a raid5 were lost. The drives were still good, but adding the "failed" drives would not work. I was however able to recover the raid array by re-creating ( mdadm --create /dev/md0 --level=5 --raid-devices=5 /dev/sd[d-h]1 ). If the airflow ide-Ribbons were the problem, and the drives did not really "fail" (i.e., just in the array), this might work for you.

I got the following messages during the process:
mdadm: /dev/sdd1 appears to contain an ext2fs file system
size=1953535744K mtime=Thu May 17 22:24:08 2007
mdadm: /dev/sdd1 appears to be part of a raid array:
level=raid5 devices=5 ctime=Sat May 5 15:19:07 2007
mdadm: /dev/sde1 appears to be part of a raid array:
level=raid5 devices=5 ctime=Sat May 5 15:19:07 2007
mdadm: /dev/sdf1 appears to be part of a raid array:
level=raid5 devices=5 ctime=Sat May 5 15:19:07 2007
mdadm: /dev/sdg1 appears to be part of a raid array:
level=raid5 devices=5 ctime=Sat May 5 15:19:07 2007
mdadm: /dev/sdh1 appears to contain an ext2fs file system
size=2005702402K mtime=Wed Nov 28 02:32:38 2007
mdadm: /dev/sdh1 appears to be part of a raid array:
level=raid5 devices=5 ctime=Sat May 5 15:19:07 2007
Continue creating array?y
mdadm: array /dev/md0 started.

I was then able to mount the array and access all files.

good luck, Chris

Hi all (first post here)!

I had identical problems this evening with my Fedora 7 server, which has 4 drives hanging off the Nvidia SATA controller in a RAID 5 array. The console started spitting out "ATA: Abnormal Status" errors, then after rebooting the RAID 5 array would not mount. After attempting the maintennance/reassemble options with mdadm I had no success so I turned to Google and stumbled across this thread.

Thanks to fsbooks's reply I have successfully re-created the array and mounted it without any problems.

I am not sure what caused this problem. I am running Fedora 7 with kernel: 2.6.21-1.3228.fc7.

Would be interesting to see how many others have come across this. I had the same issues as fakeroot with 2 drives showing the "removed, faulty removed" status when examined with mdadm.

Echoing the previous comment: I have a media server with a lot of large files on it - too much to effectively backup until BluRay comes down in price a lot. So I had this bright idea about using software RAID. I purchased 2 more 500G SATA drives and created a RAID 5 array on them, with my original drive "missing". Then I copied my files onto it.

I verified that it survived rebooting. So far, so good. So I repartitioned my original drive and added it to the array. I left for work with it syncing nicely.

I came home to find that two drives had failed - probably a loose power cable (off a splitter from 1 IDE to 2 SATA) because reseating everything brought the drives back to life - but not the array.

Followed fsbook's advice using the two drives I'd originally setup, in case the sync hadn't completed, then added the third drive. My files are there and the drives are once again syncing.

So far, so good....

Lucky garydale!

I've tried out the advice from the past posts on my drives from the thread starter. Never got the files back. I think a loose power cable was the reason too.

Thanks in advance.

Hi

If /dev/md0 raid partiton has already created is it possible to create /dev/md1 with another disks or partitions. If so
then why i'm getting this error below:
[root@cjpunjabiradio ~]# mdadm -C /dev/md1 --level=5 --raid-devices=2 /dev/hda{14,15}
mdadm: error opening /dev/md1: No such file or directory

Also look my md0 configurations

[root@cjpunjabiradio ~]# mdadm -D /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Sun Jun 15 17:50:54 2008
Raid Level : raid5
Array Size : 104320 (101.89 MiB 106.82 MB)
Device Size : 104320 (101.89 MiB 106.82 MB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Sun Jun 15 18:14:34 2008
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

UUID : b28a774d:6f5ec466:2611d4aa:2bdd8c95
Events : 0.10

Number Major Minor RaidDevice State
0 3 12 0 active sync /dev/hda12
1 3 13 1 active sync /dev/hda13

I'm new in linux so i have little bit knowledge in raid. please assist.

Thanks
Charanjit Cheema

1) RAID5 really requires a minimum of 3 drives (Raid5 on Wikipedia). You can setup RAID0 (<-- makes 2 drives look like 1 big drive, no backup) or RAID1 (mirrored drive) with 2 drives if you wish.

2) Here are the links I used to setup my mdadm RAID array:

http://tldp.org/HOWTO/Software-RAID-HOWTO.html
http://ubuntuforums.org/showthread.php?t=408461

I bookmarked these suckers and refer back anytime I have RAID questions.

I've been having a problem with my RAID arrays (WD 250GB SATA drives) failing frequently (once per month?), mostly when under a large load (they max at about 5MB/sec). Two disks will simultaneously fail, but every time they fail, I'm able to recover by recreating the array. Any idea what this could be? Anyone else have this happen?

On another note, I've also been able to successfully move the array from one computer to another (this is a data RAID5 array, OS is installed on separate single drive) by using the create command. mdadm will recognize the drives as already being part of an array and will recover them with my data in tact.

Quote:

Originally Posted by ajg (Post 2738765)

This is incorrect and misleading. Software RAID5 and the filesystem you choose to mount on it are two entirely separate things, if the *filesystem* won't mount after the RAID is rebuilt then that's a filesystem issue, not a RAID one.

Generally, if the RAID has crashed then the filesystem will have a problem mounting, fsck the filesystem or switch to a journalled filesystem like ext3 to minimise that risk.

In my experience it's hardware RAID system which are harder to recover as you're limited to the tools available to you in the BIOS of the controllers vendor. With software RAID you're not quite so limited and in many cases, you can recover from situations where you'd be stuck if you were running hardware RAID.

(I've successfully recovered from a 2 disk failure in a software RAID5 array of 7 drives without losing much data, so it's certainly possible)

It's also *much easier* to be able to pull the disks out of a machine and drop them into an entirely different system running a different linux distribution and even a different architecture, ie: PPC to x86 or Sparc. Doing that with a hardware RAID card can cause driver issues and all sorts.

Software RAID is incredibly flexible.

Either spend lots of money on a hardware RAID5 controller with battery backed cache and a well trusted chipset or stick with Software RAID5. Anything else is a false economy.

Quote:

Originally Posted by ajg (Post 2738765)

If you've lost two drives, then the RAID set is dead. There are some specialist tools which can attempt to recover some data from a multiple disk RAID5 failure, but given the way the data is written, I wouldn't be too hopeful about what that would get back.

Not true at all, mdadm and fsck are usually all you'll need to recover from any sort of Software RAID5 issue in Linux. You will lose data, but depending on the amount of activity on the filesystem it can be surprisingly little.

Quote:

Originally Posted by ajg (Post 2738765)

A multiple drive failure at one time is quite rare. Not unheard of though. Assuming that hdg and hdh are master and slave on a single IDE bus, it is possible that the failure of one of the drives is causing some weird bus errors and making it look like the other drive has problems too - I've seen that before. Could try removing each drive from the bus and trying to boot the system to see if either of the drives miraculously recovers, then replace the failed drive and rebuild the RAID.

This bit I do agree with ;)

Quote:

Originally Posted by JimBass (Post 2751475)

Or get enough disks that RAID 6 makes sense. That is all we use at work now, it is RAID 5 with an additional hot spare. Most hardware controllers will allow the hot spare to be any of the physical drives in the array, so when one goes bad the hot spare takes its place, then you pull the bad drive out, put a blank drive in, and set it as the new hot spare. Much safer. I've yet to see a RAID 6 failure.

Jim, I think you're confused.

What you're describing is just RAID5 with a hot-spare. You can achieve this with Software RAID5 under linux by defining one or more hot-spares. If a drive fails in the RAID5 set then the hot spare is automatically brought into the array and the array is rebuilt onto the hot-spare.

RAID6 is RAID5 with two parity blocks, rather than 1.

Quote:

Originally Posted by JimBass (Post 2751475)

Software RAID 5 sounds like a very bad idea to me. I am aware that it is possible, but any data important enough to be on a RAID 5 array is also important enough that the additional $300 or so is spent on a hardware controller.

Peace,
JimBass

That just sounds like you're confusing having a good backup strategy with volume/disk management strategy! ;)

RAID5 or any level of RAID is not a replacement for a good back-up strategy.

Hardware controllers don't give you any additional resilience or safety for the given raid level unless they utilise battery backed cache or similar.

There is nothing intrinsically wrong with Software RAID5 and in many cases it can be more flexible and resilient than a hardware controller. In almost all instances it's a better and safer bet than a hardware RAID5 controller without battery backup.

Cheers,
John

I trust software RAID more than hardware RAID

Echoing the previous comments, hardware RAID controllers don't offer anything that software RAID doesn't have. They just move it to a controller instead of letting the OS handle it. In practical terms, this means you've got an extra piece of hardware that can mess up, along with some special drivers that aren't exactly mass market items.

Software RAID on the other hand needs no extra hardware. And the software RAID drivers don't have to handle as much as the hardware RAID controllers.

Some hardware RAID controllers do have a battery backup to allow them to save unwritten data but this is not a substitute for a decent UPS and proper shutdown during a power outage.

Similar problem... please help!

Hi there:

I have a similar problem - a raid 5 array with 11 drives and one drive sdh encountered problems. I tried to rebuild but sdb failed midway and my array was degraded.

I followed some advice to recover the data using mdadm -C /dev/md0 /dev/sd[efghiabcdkj]1 both using command line and webmin but the drive order sde[0], sdf[2], sdg[3].... sdk[9], sdj[10] was messed up and the array was reordered sda[0]... sdk[10]. I tried mounting and received a VFS: ext 3 file system not found...

I've tried for a week now to recover the data (which consists of personal data i saved over the last 20 yrs and work data i have spent the last 2 years working on) but to no avail. Any help is greatly appreciated. Thanks in advance.