[SOLVED] Problem with software raid! inactive array, won't mount

-Thomas- · 05-13-2012, 03:47 PM

I've been using two 500GB hard drives in raid1 for /home. The device is called md127 and is made up of sdb and sdc.

I ran emerge -uDN world yesterday and it broke a lot of stuff. The update didn't finish successfully, and after it failed I was unable to open most of my programs. When I rebooted I was shown a command line login prompt instead of GDM. I logged in as root, finished the update, ran revdep-rebuild, and now most things seem to be working.

After rebooting again, though, I could not log in as my normal user (I get an error that says something like "Could not update .ICEauthority file /home/thomas/.ICEauthority"). I'm logged in as root now, and it seems that there is a problem with the disk array that I use for /home, but I don't know how to figure out what the problem is, much less how to fix it.

Code:

#mount /home
mount: wrong fs type, bad option, bad superblock on /dev/md127,
       missing codepage or helper program, or other error
       (could this be the IDE device where you in fact use
       ide-scsi so that sr0 or sda or so is needed?)
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

The array seems to be inactive:

Code:

#cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]  
md127 : inactive sdc[1](S) sdb[0](S)
      976771120 blocks super 1.2

Code:

#mdadm --detail /dev/md127
mdadm: md device /dev/md127 does not appear to be active.

And I get some worrisome information here:

Code:

#mdadm --examine /dev/sdb
/dev/sdb:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 8c17a3bc:36c7fdd8:dc36da0b:7de5bdb6
Name : localhost:1
Creation Time : Sun Feb 19 17:25:39 2012
Raid Level : -unknown-
Raid Devices : 0

Avail Dev Size : 976771120 (465.76 GiB 500.11 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : 3e989469:81c06bef:7254bade:1f6f439a

Update Time : Sun May 13 12:54:06 2012
Checksum : baf969d7 - correct
Events : 1

Code:

#mdadm --examine /dev/sdc
/dev/sdc:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 8c17a3bc:36c7fdd8:dc36da0b:7de5bdb6
Name : localhost:1
Creation Time : Sun Feb 19 17:25:39 2012
Raid Level : -unknown-
Raid Devices : 0

Avail Dev Size : 976771120 (465.76 GiB 500.11 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : 0a7d6857:70bb4bce:3426497c:211a23ad

Update Time : Sun May 13 12:54:06 2012
Checksum : 3a1cc656 - correct
Events : 1

Specifically, I do not like the look of "Raid Level : -unknown-"

I get an error when I open GParted titled "Libparted Bug Found!" that reads "Could not stat device /dev/md/1 - No such file or directory." that I did not get before this problem.

Clearly I shouldn't have been using software RAID, understanding as little about it as I do. I stand to lose a large, well organized, and meticulously tagged media library, some vacation pictures, and my to-do list. Not the end of the world, but I would really appreciate any help any of you might be able to provide.

EDIT:
I've just noticed that trying

Code:

#mount /home

adds a line to the output of dmesg that says "EXT4-fs (md127): unable to read superblock"
That begins to explain the nature of the problem, but I still have no idea what happened, whether or not it can be reversed, or how I would go about doing that. Normally I'd spend a lot more time trying different commands and searching the Internet before posting here, but not being able to log in as my normal user is really bothering me! You can't run Chromium as root, which means I only have my netbook for Internet access...

EDIT:
I found a thread which is not about the problem I'm having, but which does describe a technique that I might be able to use to recover my data.
http://www.linuxquestions.org/questi...1-disk-723225/

It says that I should be able to mount either of my disks as a normal ext4 partition after using

Code:

mdadm --stop /dev/md127

to deactivate the array. I'm a little reluctant to try it because I'm not sure whether or not I would be able to reverse it if it didn't work for some reason, and I'm not sure whether or not my unreadable superblock problem would prevent it from working. Even if it does work, and I get all of my data back, I still have to decide whether or not it is worth it to continue to try to use software raid for my personal files. Protection from hard drive failure is nice, but not so nice that I want to put up with having all of my data randomly made inaccessible. I would still appreciate any alternative methods I might be able to use to fix this, or any information about what caused this and whether or not this sort of thing can be prevented.

EDIT:

Code:

#mdadm --stop /dev/md127
mdadm: stopped /dev/md127
#mount -t ext4 /dev/sdb /home
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
#dmesg | tail
EXT4-fs (sdb): VFS: Can't find ext4 filesystem

Well, that didn't work. Interestingly:

Code:

#mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdb /dev/sdc
mdadm: /dev/sdb appears to be part of a raid array
       level=-unknown- devices=0 ctime=Sun Feb 19 17:25:39 2012
mdadm: partition table exists on /dev/sdb but will be lost or
       meaningless after creating array
mdadm: Note: this array has metadata at the start and
       may not be suitable as a boot device. If you plan to
       store '/boot' on this device please ensure that
       your boot-loader understands md/v1.x metadata, or use
       --metadata=0.90
mdadm: /dev/sdc appears to be part of a raid array
       level=-unknown- devices=0 ctime=Sun Feb 19 17:25:39 2012
mdadm: partition table exists on /dev/sdc but will be lost or
       meaningless after creating array
Continue creating array? n
mdadm: create aborted.

I found a website that suggested that creating a new array out of the same disks as an old array would restore the old array, but that seems to be wrong. I hope I'm not digging myself a deeper hole, here. I think the difference is between me and the people whose success I'm trying to replicate is that they were using arrays made up of formatted partitions, whereas I was using a formatted array made up of two blank disks. I'm keeping this post updated because I'm still hoping that somebody who understands something about this stuff is going to come along and prevent me from doing something stupid and erasing all of my precious data, if I haven't already.

I used the --build option to put the array back together the way it was. Attempting to mount the array returns "mount: unknown filesystem type 'linux_raid_member'", which seems strange. /proc/mdstat reveals that the array is in the middle of a resyncing operation. I'm still shooting in the dark, here.

UPDATE: The resync is finished, but the raid device still won't mount. It still gives the error about the file system type. using "-t ext4" just gives a different error. I could really use some help here! Testdisk now sees md127 and I've been experimenting a little bit, but I really don't know what I'm doing. I've tried a bunch of stuff that I haven't listed here. I'm really not sure how much of what I've listed here is helpful or relevant. This is getting pretty long, so I'm going to stop updating it, until I find a solution or give up. I still need help!

Skinjob · 05-16-2012, 10:18 PM

I've experienced the exact same problem. When I went through etc-update, I kept my existing /etc/mdadm.conf file. The file had an explicit entry for the raid array which said:

ARRAY /dev/md0 UUID=2784fc16:a3d6f775:83fdf7c3:43964a75 devices=/dev/sdb1,/dev/sdc1

After I commented out the line and restarted mdraid and mdadm the array worked perfectly.

-Thomas- · 05-18-2012, 12:21 AM

Thanks for the reply, Skinjob, and nice screen name. My mdadm.conf is empty below the comments, and always has been as far as I know. I've never changed it, nor have I noticed it in the etc-update list recently. Initially when I started poking around I thought that maybe its emptiness was the source of my problem, but one of the first lines in the comments says "mdadm will function properly without the use of a configuration file...." I would be very happy if it turned out that the solution was something as simple as that. I'm beginning to get frustrated! Sorry for the slow reply. With my desktop unable to browse the Internet I haven't been checking my email...

Since I posted this I've been able to browse through the files that were on the array using testdisk, both on the rebuilt array and on the individual drives. My attempts to copy them onto something else failed, however. To me, the fact that the directory structure and file names and everything are still there is a good sign. I'm hoping to find a way to use testdisk to make one of the disks mountable as the ext4 file system that was on the array. I don't really know what I'm doing, if it's not obvious, and in my frustration I've started to spend my free time playing Halo instead of working on this problem. I'm getting ready to just accept the loss so that I can put those drives back to work. The rest of my system is on an SSD and I need at least one of the big drives for /home. I'm still open to suggestions or explanations, though, if anyone has any!

Skinjob · 05-18-2012, 10:37 AM

Sorry that didn't work for you. The only thing I can think of are the basics;

is the Ext4 driver working at all? Can you mount another partition that uses Ext4?
I didn't see anything about fsck in your post. Have you tried running fsck?
You might be able to save your data by removing the array and recreating it using one of the disks.

-Thomas- · 05-19-2012, 09:02 AM

I have /boot, /, and swap partitions on an SSD, and my / partition is also ext4. I had tried fsck at one point, but I don't remember why it didn't work. I'll try it again. I haven't tried recreating the array with only one disk because of the warning I got when I tried to use --create to put the original array back together. I will try it with --build first, and if that doesn't work maybe I'll see what happens if I ignore that warning.

Update: Since my last reboot I get this:

Code:

#cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] 
md126 : inactive sdc[0](S)
      488385560 blocks super 1.2
       
md127 : inactive sdb[0](S)
      488385560 blocks super 1.2

Running fsck on one of the raid devices gives me this:

Code:

#fsck /dev/md126
fsck from util-linux 2.21.1
e2fsck 1.42.1 (17-Feb-2012)
fsck.ext4: Invalid argument while trying to open /dev/md127

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
#fsck /dev/sdc  
e2fsck 1.42.1 (17-Feb-2012)
fsck.ext4: Device or resource busy while trying to open /dev/sdc
Filesystem mounted or opened exclusively by another program?
#mdadm --stop /dev/md126
mdadm: stopped /dev/md126
#fsck /dev/sdc
e2fsck 1.42.1 (17-Feb-2012)
fsck.ext4: Superblock invalid, trying backup blocks...
fsck.ext4: Bad magic number in super-block while trying to open /dev/sdc

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

I don't know what the alternate superblock business is all about, but I tried using the one in the message and the ones listed by testdisk, to no avail.

-Thomas- · 05-19-2012, 09:50 AM

Something worked! Using --build to make an array out of only one of the disks resulted in having something called "localhost:1" automatically mounted, and it has my data in it! I don't know why that worked, or what the problem was in the first place, but I'm happy to have my data. I'll mark this thread as solved. Thanks for your help, Skinjob. I don't know whether or not people can continue to post in a thread once it's been marked "solved", but if so I would still appreciate an explanation from someone who knows more about this stuff than I do about what happened and why, and why that solution worked. It might be helpful to anyone who finds this post using google, which is what I usually do...

Skinjob · 05-19-2012, 10:57 AM

Outstanding!

node · 05-28-2012, 05:08 PM

Been having the exact same issue. For some reason, randomly raid arrays get marked as spare drives.

Fixing it can be either hard (haven't gotten that far yet, if things get really broken) or easy if your lucky.

first, stop the broken raid array.

mdadm --stop /dev/md127

now the raid members are available again to manipulate again.

mdadm -A /dev/md127 /dev/sda1 /dev/sdb1 (use appropiate drives/partitions). This may not work if you do not have an appropiate /etc/mdadm.conf

you can try using mdadm -A --scan to see what that spits out, if that contains the original array, append that output to your /etc/mdadm.conf.

If you do not have such entry, create it manually.
ARRAY /dev/md127 metadata=1.2 UUID=nnnn name=localhost:root

You can obtain the information like UUID and metadata from: mdadm --examine /dev/sda1 for example.
(This again assumes that sda1 is a still correct partial array!).

Now, you should be able to run mdadm -A /dev/md127

If this all still fails, there's one last hope for the easy recovery.

mdadm --stop /dev/md127 (and others if those where created using your disks)

mdadm -A /dev/md127 /dev/sda1 /dev/sdb1
or, if one of the disks is broken or something, you can bring the array up in degraded mode.

mdadm -A /dev/md127 /dev/sda1 --run. (or sdb1 or whatever device you use)

This should bring up a degraded array. You can mount and pull backup data from it etc.

You should be able to add the second disk without much fuss.

Just in case, mdadm --zero-superblock /dev/sdb1 (WARNING use this on the SECOND disk that is bad, in this example, we created the degraded array using sda1, so sdb1 is in this case unused! this operation cannot be unddone and renders the second disk of the array as unusuable. Granted testdisk or the like can still recover data if needed, the data is still there, the Meta data is however GONE! YOU HAVE BEEN WARNED). clearing the superblock makes sure that resyncing for some reason doesn't happen the other way around. E.g. empty, corrupted or missing data written from sdb1 to sda1 thus destroying your data.

mdadm --add /dev/md127 /dev/sdb1.

This final step re-adds the second disk to the array and will be resynced. The entire raid array should be back up.

Having said all that, I'm personally in the situation where BOTH my drives where marked as spares and can't mount either in degraded mode. I'm wondering if this is some weird gentoo specific bug or a kernel bug.

-Thomas- · 05-29-2012, 01:49 AM

Hi node,

That is a very useful guide, thanks for adding it. I did try using --assemble to put the array back together in its original form, but I don't know whether that failed because of my empty mdadm.conf or for some other reason, becasue I didn't know to try the next few steps there. I also don't know whether or not I could have used --assemble to bring a single drive up in degraded mode, because I don't remember whether or not that was one of the things I tried. Because of that it's hard to tell (for me at least) whether or not our problems are identical, but if they are, there are two things you might try:

The first solution that worked for me, which you may have already read about above, was to use --build to make an array out of a single disk:

Code:

mdadm --build /dev/md0 --raid-devices=1 /dev/sdb

The differences between --build and --assemble are not so clear to me, but I tried it this way because I thought that --build was more likely to ignore problems that might stop --assemble from working. It gave me a warning about the fact that it's unusual to make a RAID array out of a single disk and that I probably didn't know what I was doing, I ignored it, and a number of oddly named drives were created and mounted, two of which contained my data. I have no idea why that did what it did, or whether or not it is a safe thing to try. If it works for you, you can just copy all of your data to another drive and start your array over from scratch.

After I marked this thread as "solved", I came across another solution that permanently made one of the raid members into a regular old ext4 partition that can be mounted normally. I don't know whether or not this solution will work for everyone, though, and here's why: When I created my array, I used two completely empty, unformatted disks. I then used gparted to write the partition table to my raid device and format it. The impression I have is that it is more common to build an array out of formatted partitions, which is why your raid members are /dev/sda1 and /dev/sdb1 and mine are simply /dev/sdb and /dev/sdc. Anyway, here's what happened: After I recovered my data using the method above, I opened gparted and started to format /dev/sdb. Gparted told me that there was no partition table on the disk, so I created the standard one. After I did this, it suddenly saw an ext4 partition taking up the entire disk, and the drive was effectively converted from a raid member to a standard drive, with no data loss. When I saw this I simply changed my fstab so that /home is mounted to /dev/sdb1, and everything is back to normal as far as I can tell, except that I'm no longer using RAID.

Neither of these methods save you having to remake the array from scratch, but they both gave me access to data that I thought I had lost. I don't know how specific they are to my situation, but hopefully they're helpful to somebody.

How would we begin to determine where the bug is that causes this problem?

node · 05-29-2012, 04:19 AM

Appearantly this is caused by a kernel bug in older kernels. (3.2.1, 3.3 suffer from this, it was fixed in 3.4). The kernel decided for somereason during shutdown to break the array and thus ruining metadata.

The only option is to re-create the array, using the same parameters as the first time, but using the --assume-clean flag. This causes the data to be not overwritten.

As for not finding your array and not using mdadm, the kernel/udev asigns a random name (number) for found arrays, starting at md127 and going down. So md126 would be a second found array (/proc/mdstat helps heirin).

Because this can become quite random, using mdadm you can 'force' a name uppon the array. /dev/md0 can make more sense

I think the manual for mdadm still says, the array will try to get a name based on its last used name. So when you stop the array as md0, it should become md0 again uppon the next scan, but atleast udev prevents that, if not the kernel itself.

As for having back your ext4 partition after creating a partition table, is pure luck. the 0.9 metadata for a raid1 may be about the same size as a partition table. So by creating a partition table, you overwritten the md metadata. The start of the partition just coincidentally was the same as the start of your raid1 partition.

When using raid1, you CAN actually mount it without the raid component. mount -o ro -t ext4 /dev/sdb /mnt may have worked. If not you can even tell mount to use an offset, e.g. tell mount where your data starts.

Since I'm using raid10 on two disks, 1 disk can actually be mounted, but only one due to the layout on the disk.

Anyway, because you now have your data on a regular disk, and an unused disk, you can use that one to create a degraded array and copy your data over. Then simply hot-add the old disk and your back in business

Do try to use mdadm.conf

and raid10 may be more usefull in your situation, even with 2 disks. It will give you the safety of raid1, with the speed of raid0. As for using or not using md/raid, I think we just got extremly unlucky with a known kernel bug (unknown to us at the time) and should not happen again.