mdadm: no such device: md0 -- RAID doesn't work after system recovery

jlinkels · 11-28-2009, 07:10 PM

I am trying to establish a recovery procedure for my file server, but I have problems booting from RAID.

In the server I have a RAID 1 array with 2 sata disks, 5 partitions. I backed up the file server to tape by tarring the root directory.

To make a test recovery I did this on the test server:

booted the test server from USB using a live version of Debian Lenny. This live version is the same version as my live server, same kernel version 2.6.26-AMD64
created partions on both sda and sdb using the dump output of sfdisk as acquired from the live server.
created the RAID1 array using: mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sd[ab]1 Which is almost identical to what I did on the live server (see below). Created /dev/md1 .. /dev/md4. Array started to sync nicely.
formatted the boot partition with ext3, left md1 for the swap alone, formatted the other partitions md2 .. md4 as xfs.
created directory /mnt/restore. Mounted /dev/md0. Created /mnt/restore/home, /mnt/restore/vmbackup, etc. Mounted /dev/md2 on /mnt/restore/home, mounted /dev/md3 on /mnt/restore/vmbackup, etc. This is identical to what I do on the live server. But instead of mounting under /, directories are mounted in /mnt/restore/ and below.
restored the tar: tar -C /mnt/restore/ -xvf /dev/st0 Restore went flawlessly. Each partition holds the data it should hold, i.e. all data is restored to the correct partitions.
chrooted into /mnt/restore/. Installed grub on both hard disks: root (hd0,0); setup (hd0). Same for hd1.

Then I removed the memory stick and rebooted. Grub boots, shows the menu, continues, and then says:
mdadm: no such device: md0
mdadm: no such device: md2
mdadm: no such device: md3
mdadm: no such device: md4

and boots into the busybox shell.
Note that md1 (which is intended to be used as swap) is not among the error messages. In fact, I see a message that md1 is started succesfully, but I don't recall the exact text. md1 is (like the other partitions) mentioned in fstab.

Booting back using the live distro which is provided with RAID support, the arrays immediately start to sync where they left when I stopped the machine. When I mount the file systems again, the files are still there.

Booting again in the restored system brings me into busybox again. But In busybox I can issue: mdadm --assemble /dev/md0 /dev/sd[ab]1 and even there the arrays are started and start to sync at the point where they were in the live distro.

So my conclusion is that the arrays are sound, can be recognized and will function. They do so in at least two booted and running Linux environments, but refuse to so so in the copy of the server file system.

There are some additional things:

when I created the array for the first time in the live server I did so from a running installation. Created a degraded RAID with a missing disk, copied the running installation to RAID, booted from RAID and added the missing disk. IMHO that should not make much difference.
while experimenting on the test server I did a couple of stupid things with the RAID arrays. Like formatting the partitions first before creating the array. That caused problems of course so I corrected that. A number of times I added, failed, removed and re-added disks on the arrays. A lot of things, but I don't recall them all. Eventually I got everything right again. At the very last I rebooted the so restored system on RAID arrays I did everything which Should Not Be Done. But at that time the test server did boot from the restored installation.
when I wanted to try again and note each step carefully for the real recovery procedure I wanted to start from scratch. Therefore I zeroes the first 100 MB of both sda and sdb. When I did the recovery after that the result was as mentioned above.

Since the errors that I get are from mdadm I don't think in terms of boot loader problems where partitions cannot be found. The RAID driver is obviously included in initramfs. So why oh why would mdadm give these errors during booting while the arrays seems to be sound? Is there any pointer to a document which describes in detail exactly at what moment mdadm is started to assemble the arrays and make them accessible? And where does it look? Can it be different from the place it looks while the system is running?

jlinkels

jlinkels · 11-30-2009, 08:14 PM

I am a few steps closer to a solution.

As it seems, at boot time mdadm uses mdadm.conf to assemble the raid arrays, and uses the array's UUID.

Although no file systems are mounted at the time of booting, initramfs certainly is mounted, and the mdadm.conf contained in initrd.img together with the use of the UUID's exactly causes this problem.

There are two possible solutions. One, I can access the mdadm.conf on the backup, extract the UUID's of the md devices, stop and reassemble the arrays I just created on the empty disks using the option --update=uuid --uuid=nnn:nnn:nnn:nnn.

Two, I can disassemble the initrd.img file, do a scan on the newly created arrays, paste that into the mdadm.conf, and reassemble the intitrd.img file.

Both options work. The first is some more work with grep/awk/sed, the second requires the handling of initrd.img.

Because I want to keep the live server and a backup server as identical as possible, I tend to choose for the first solution.

jlinkels