LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)
-   -   Recovering a Raid 5 array, mdadm mess-up (https://www.linuxquestions.org/questions/linux-server-73/recovering-a-raid-5-array-mdadm-mess-up-584149/)

somebox 09-12-2007 08:29 AM

Recovering a Raid 5 array, mdadm mess-up
 
I've been searching through this site for some raid answers, but found nothing specific to my problem. This is my first post, so here goes :)

I have a debian Etch server, and my /home partition is set up as a RAID 5 array, with four SATA 250GB disks (750GB total). I recently returned from vacation and found that the machine was locked. After rebooting, /home did not mount. Here's what showed in syslog:

Code:

Sep 12 05:18:42 workshop kernel: md: bind<sdb1>
Sep 12 05:18:42 workshop kernel: md: bind<sda1>
Sep 12 05:18:42 workshop kernel: md: bind<sdd1>
Sep 12 05:18:42 workshop kernel: md: bind<sdc1>
Sep 12 05:18:42 workshop kernel: md: kicking non-fresh sda1 from array!
Sep 12 05:18:42 workshop kernel: md: unbind<sda1>
Sep 12 05:18:42 workshop kernel: md: export_rdev(sda1)
Sep 12 05:18:42 workshop kernel: md: kicking non-fresh sdb1 from array!
Sep 12 05:18:42 workshop kernel: md: unbind<sdb1>
Sep 12 05:18:42 workshop kernel: md: export_rdev(sdb1)
Sep 12 05:18:42 workshop kernel: md: md0: raid array is not clean -- starting backgro
und reconstruction
Sep 12 05:18:42 workshop kernel: raid5: device sdc1 operational as raid disk 2
Sep 12 05:18:42 workshop kernel: raid5: device sdd1 operational as raid disk 3
Sep 12 05:18:42 workshop kernel: raid5: not enough operational devices for md0 (2/4 f
ailed)
Sep 12 05:18:42 workshop kernel: RAID5 conf printout:
Sep 12 05:18:42 workshop kernel:  --- rd:4 wd:2 fd:2
Sep 12 05:18:42 workshop kernel:  disk 2, o:1, dev:sdc1
Sep 12 05:18:42 workshop kernel:  disk 3, o:1, dev:sdd1
Sep 12 05:18:42 workshop kernel: raid5: failed to run raid set md0
Sep 12 05:18:42 workshop kernel: md: pers->run() failed ...
Sep 12 05:18:42 workshop kernel: Attempting manual resume
Sep 12 05:18:42 workshop kernel: EXT3-fs: INFO: recovery required on readonly filesys
tem.
Sep 12 05:18:42 workshop kernel: EXT3-fs: write access will be enabled during recover
y.

So, it seemed that two out of the four disks were failed. I was hoping that the drives overheated, perhaps the machine was not cleanly rebooted, etc. Two drives out of four drive raid5 set is not good.

I captured the output of mdadm --examine for all the disks:

Code:

/dev/sda1:
          Magic : a92b4efc
        Version : 00.90.03
          UUID : 43e20969:a2d1e5ba:94f7c737:27a0793c
  Creation Time : Sat Apr 22 22:55:01 2006
    Raid Level : raid5
    Device Size : 244195904 (232.88 GiB 250.06 GB)
    Array Size : 732587712 (698.65 GiB 750.17 GB)
  Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0

    Update Time : Mon Sep  3 13:00:35 2007
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
      Checksum : e679baca - correct
        Events : 0.2488136

        Layout : left-symmetric
    Chunk Size : 64K

      Number  Major  Minor  RaidDevice State
this    1      8        1        1      active sync  /dev/sda1

  0    0      8      17        0      active sync  /dev/sdb1
  1    1      8        1        1      active sync  /dev/sda1
  2    2      8      33        2      active sync  /dev/sdc1
  3    3      8      49        3      active sync  /dev/sdd1
/dev/sdb1:
          Magic : a92b4efc
        Version : 00.90.03
          UUID : 43e20969:a2d1e5ba:94f7c737:27a0793c
  Creation Time : Sat Apr 22 22:55:01 2006
    Raid Level : raid5
    Device Size : 244195904 (232.88 GiB 250.06 GB)
    Array Size : 732587712 (698.65 GiB 750.17 GB)
  Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0

    Update Time : Mon Sep  3 13:00:35 2007
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
      Checksum : e679bad8 - correct
        Events : 0.2488136

        Layout : left-symmetric
    Chunk Size : 64K

      Number  Major  Minor  RaidDevice State
this    0      8      17        0      active sync  /dev/sdb1

  0    0      8      17        0      active sync  /dev/sdb1
  1    1      8        1        1      active sync  /dev/sda1
  2    2      8      33        2      active sync  /dev/sdc1
  3    3      8      49        3      active sync  /dev/sdd1
/dev/sdc1:
          Magic : a92b4efc
        Version : 00.90.03
          UUID : 43e20969:a2d1e5ba:94f7c737:27a0793c
  Creation Time : Sat Apr 22 22:55:01 2006
    Raid Level : raid5
    Device Size : 244195904 (232.88 GiB 250.06 GB)
    Array Size : 732587712 (698.65 GiB 750.17 GB)
  Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0

    Update Time : Mon Sep  3 13:02:51 2007
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 0
      Checksum : e653c444 - correct
        Events : 0.2488139

        Layout : left-symmetric
    Chunk Size : 64K

      Number  Major  Minor  RaidDevice State
this    2      8      33        2      active sync  /dev/sdc1

  0    0      0        0        0      removed
  1    1      0        0        1      faulty removed
  2    2      8      33        2      active sync  /dev/sdc1
  3    3      8      49        3      active sync  /dev/sdd1
/dev/sdd1:
          Magic : a92b4efc
        Version : 00.90.03
          UUID : 43e20969:a2d1e5ba:94f7c737:27a0793c
  Creation Time : Sat Apr 22 22:55:01 2006
    Raid Level : raid5
    Device Size : 244195904 (232.88 GiB 250.06 GB)
    Array Size : 732587712 (698.65 GiB 750.17 GB)
  Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0

    Update Time : Mon Sep  3 13:02:51 2007
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 0
      Checksum : e653c456 - correct
        Events : 0.2488139

        Layout : left-symmetric
    Chunk Size : 64K

      Number  Major  Minor  RaidDevice State
this    3      8      49        3      active sync  /dev/sdd1

  0    0      0        0        0      removed
  1    1      0        0        1      faulty removed
  2    2      8      33        2      active sync  /dev/sdc1
  3    3      8      49        3      active sync  /dev/sdd1

Notice that the different disks had a different idea of what the state of the array was. I hoped that at worst, there was only one faulty disk.

I decided from the above output that I should try to reassemble the array. In the past, mdadm was pretty smart about trying to resync the disks. However, I made a big mistake. I typed the following command:

Code:

# mdadm --create /dev/md0 --level=5 --raid-devices=4 /dev/sd[a-d]
So, madam took a long time to rebuild the array, and then I could not mount it. I tried to reboot, no help. Here's the error from mount:

Code:

# mount /home
mount: wrong fs type, bad option, bad superblock on /dev/md0,
      missing codepage or other error
      In some cases useful info is found in syslog - try
      dmesg | tail  or so

Looking at /proc/mdstat:

Code:

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sda[0] sdd[3] sdc[2] sdb[1]
      732595392 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
     
unused devices: <none>

In horror, I realized that mdadm had built the array using the whole disks, instead of partitions. I wanted /dev/sda1, /dev/sdb1, etc ... NOT /dev/sda, /dev/sdb, etc!

Here's where I get really confused. If I look at the disks with fdisk, the partitions are still there, but two of them are just regular linux partitions (not raid autodetect):

Code:

$ fdisk -l /dev/sda

Disk /dev/sda: 250.0 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

  Device Boot      Start        End      Blocks  Id  System
/dev/sda1              1      30401  244196001  fd  Linux raid autodetect

 $ fdisk -l /dev/sdb

Disk /dev/sdb: 250.0 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

  Device Boot      Start        End      Blocks  Id  System
/dev/sdb1              1      30401  244196032  83  Linux

$ fdisk -l /dev/sdc

Disk /dev/sdc: 250.0 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

  Device Boot      Start        End      Blocks  Id  System
/dev/sdc1              1      30401  244196001  fd  Linux raid autodetect

$ fdisk -l /dev/sdd

Disk /dev/sdd: 250.0 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

  Device Boot      Start        End      Blocks  Id  System
/dev/sdd1              1      30401  244196032  83  Linux

But it gets even stranger... I no longer see the partitions in /dev:

Code:

$ ls /dev/sd*
/dev/sda  /dev/sdb  /dev/sdc  /dev/sdd

And when I try to assemble the array now, mdadm can't find those old partitions:

Code:

$ mdadm --assemble /dev/md0 --verbose /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: looking for devices for /dev/md0
mdadm: cannot open device /dev/sda1: No such file or directory
mdadm: /dev/sda1 has no superblock - assembly aborted

So, I'm in a real bind. I don't know if my data is still on the drives (and of course, I REALLY want to recover it, only some of it is backed up). I can't see the old partitions on the drives, despite the fact that fdisk does see something.

Is it possible that my mdadm --create command wiped my disks somehow? I though mdadm was careful to check for existing raid partitions!

Any help would be greatly appreciated!

macemoneta 09-12-2007 11:27 AM

By running a create on an existing array, you've destroyed the superblocks; mdadm warns you about the existing contents of the drives when you run the command. Restore whatever data you have from backup.

This has been said many times, but it bears repeating: RAID is not a substitute for backup. It's intended to increase uptime (data availability), and does not provide data archiving.

somebox 09-12-2007 11:52 AM

Oh Crap
 
Wow, this sucks. The thing is, I did not get any warning, because I specified the wrong partitions (eg, /dev/sda instead of /dev/sda1). Is there really no way to reconstruct this array now? I can see raid partitions at /dev/sd[a-d]1 -- but I can't access them as the are not in /dev ... can anyone suggest something to try?

fnaaijkens 10-17-2007 08:32 AM

the power of mdadm
 
I did something like that once.
I just created new reiserfs on the raid discs.
Then I rebuild everything by reiserfsck --rebuild-tree --scan-whole-partition.

I recovered almost 100% of the files, and some older versions of them, too.
@500,000 files, a bit confusing, but in combination with a backup (that you restore OVER the recovered, data) your recovery-rate might be pretty good!

f

JimBass 10-17-2007 06:57 PM

And don't make the mistake of doing software RAID on something that is important. A hardware RAID card will cost about $300. I've had almost the identical setup as you, 4 250 GB satas in RAID 5, but with a 3com controller card running it. Obviously it isn't the RAIDs fault that you gave a bad command, but it always seems if you care about your data, its worth the additional cost.

Peace,
JimBass


All times are GMT -5. The time now is 04:30 PM.