Problem with software raid! inactive array, won't mount
I've been using two 500GB hard drives in raid1 for /home. The device is called md127 and is made up of sdb and sdc.
I ran emerge -uDN world yesterday and it broke a lot of stuff. The update didn't finish successfully, and after it failed I was unable to open most of my programs. When I rebooted I was shown a command line login prompt instead of GDM. I logged in as root, finished the update, ran revdep-rebuild, and now most things seem to be working. After rebooting again, though, I could not log in as my normal user (I get an error that says something like "Could not update .ICEauthority file /home/thomas/.ICEauthority"). I'm logged in as root now, and it seems that there is a problem with the disk array that I use for /home, but I don't know how to figure out what the problem is, much less how to fix it. Code:
#mount /home Code:
#cat /proc/mdstat Code:
#mdadm --detail /dev/md127 Code:
#mdadm --examine /dev/sdb Code:
#mdadm --examine /dev/sdc I get an error when I open GParted titled "Libparted Bug Found!" that reads "Could not stat device /dev/md/1 - No such file or directory." that I did not get before this problem. Clearly I shouldn't have been using software RAID, understanding as little about it as I do. I stand to lose a large, well organized, and meticulously tagged media library, some vacation pictures, and my to-do list. Not the end of the world, but I would really appreciate any help any of you might be able to provide. EDIT: I've just noticed that trying Code:
#mount /home That begins to explain the nature of the problem, but I still have no idea what happened, whether or not it can be reversed, or how I would go about doing that. Normally I'd spend a lot more time trying different commands and searching the Internet before posting here, but not being able to log in as my normal user is really bothering me! You can't run Chromium as root, which means I only have my netbook for Internet access... EDIT: I found a thread which is not about the problem I'm having, but which does describe a technique that I might be able to use to recover my data. http://www.linuxquestions.org/questi...1-disk-723225/ It says that I should be able to mount either of my disks as a normal ext4 partition after using Code:
mdadm --stop /dev/md127 EDIT: Code:
#mdadm --stop /dev/md127 Code:
#mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdb /dev/sdc I used the --build option to put the array back together the way it was. Attempting to mount the array returns "mount: unknown filesystem type 'linux_raid_member'", which seems strange. /proc/mdstat reveals that the array is in the middle of a resyncing operation. I'm still shooting in the dark, here. UPDATE: The resync is finished, but the raid device still won't mount. It still gives the error about the file system type. using "-t ext4" just gives a different error. I could really use some help here! Testdisk now sees md127 and I've been experimenting a little bit, but I really don't know what I'm doing. I've tried a bunch of stuff that I haven't listed here. I'm really not sure how much of what I've listed here is helpful or relevant. This is getting pretty long, so I'm going to stop updating it, until I find a solution or give up. I still need help! |
I've experienced the exact same problem. When I went through etc-update, I kept my existing /etc/mdadm.conf file. The file had an explicit entry for the raid array which said:
ARRAY /dev/md0 UUID=2784fc16:a3d6f775:83fdf7c3:43964a75 devices=/dev/sdb1,/dev/sdc1 After I commented out the line and restarted mdraid and mdadm the array worked perfectly. |
Thanks for the reply, Skinjob, and nice screen name. My mdadm.conf is empty below the comments, and always has been as far as I know. I've never changed it, nor have I noticed it in the etc-update list recently. Initially when I started poking around I thought that maybe its emptiness was the source of my problem, but one of the first lines in the comments says "mdadm will function properly without the use of a configuration file...." I would be very happy if it turned out that the solution was something as simple as that. I'm beginning to get frustrated! Sorry for the slow reply. With my desktop unable to browse the Internet I haven't been checking my email...
Since I posted this I've been able to browse through the files that were on the array using testdisk, both on the rebuilt array and on the individual drives. My attempts to copy them onto something else failed, however. To me, the fact that the directory structure and file names and everything are still there is a good sign. I'm hoping to find a way to use testdisk to make one of the disks mountable as the ext4 file system that was on the array. I don't really know what I'm doing, if it's not obvious, and in my frustration I've started to spend my free time playing Halo instead of working on this problem. I'm getting ready to just accept the loss so that I can put those drives back to work. The rest of my system is on an SSD and I need at least one of the big drives for /home. I'm still open to suggestions or explanations, though, if anyone has any! |
The Basics
Sorry that didn't work for you. The only thing I can think of are the basics;
|
I have /boot, /, and swap partitions on an SSD, and my / partition is also ext4. I had tried fsck at one point, but I don't remember why it didn't work. I'll try it again. I haven't tried recreating the array with only one disk because of the warning I got when I tried to use --create to put the original array back together. I will try it with --build first, and if that doesn't work maybe I'll see what happens if I ignore that warning.
Update: Since my last reboot I get this: Code:
#cat /proc/mdstat Code:
#fsck /dev/md126 |
Something worked! Using --build to make an array out of only one of the disks resulted in having something called "localhost:1" automatically mounted, and it has my data in it! I don't know why that worked, or what the problem was in the first place, but I'm happy to have my data. I'll mark this thread as solved. Thanks for your help, Skinjob. I don't know whether or not people can continue to post in a thread once it's been marked "solved", but if so I would still appreciate an explanation from someone who knows more about this stuff than I do about what happened and why, and why that solution worked. It might be helpful to anyone who finds this post using google, which is what I usually do...
|
Outstanding!
|
Been having the exact same issue. For some reason, randomly raid arrays get marked as spare drives.
Fixing it can be either hard (haven't gotten that far yet, if things get really broken) or easy if your lucky. first, stop the broken raid array. mdadm --stop /dev/md127 now the raid members are available again to manipulate again. mdadm -A /dev/md127 /dev/sda1 /dev/sdb1 (use appropiate drives/partitions). This may not work if you do not have an appropiate /etc/mdadm.conf you can try using mdadm -A --scan to see what that spits out, if that contains the original array, append that output to your /etc/mdadm.conf. If you do not have such entry, create it manually. ARRAY /dev/md127 metadata=1.2 UUID=nnnn name=localhost:root You can obtain the information like UUID and metadata from: mdadm --examine /dev/sda1 for example. (This again assumes that sda1 is a still correct partial array!). Now, you should be able to run mdadm -A /dev/md127 If this all still fails, there's one last hope for the easy recovery. mdadm --stop /dev/md127 (and others if those where created using your disks) mdadm -A /dev/md127 /dev/sda1 /dev/sdb1 or, if one of the disks is broken or something, you can bring the array up in degraded mode. mdadm -A /dev/md127 /dev/sda1 --run. (or sdb1 or whatever device you use) This should bring up a degraded array. You can mount and pull backup data from it etc. You should be able to add the second disk without much fuss. Just in case, mdadm --zero-superblock /dev/sdb1 (WARNING use this on the SECOND disk that is bad, in this example, we created the degraded array using sda1, so sdb1 is in this case unused! this operation cannot be unddone and renders the second disk of the array as unusuable. Granted testdisk or the like can still recover data if needed, the data is still there, the Meta data is however GONE! YOU HAVE BEEN WARNED). clearing the superblock makes sure that resyncing for some reason doesn't happen the other way around. E.g. empty, corrupted or missing data written from sdb1 to sda1 thus destroying your data. mdadm --add /dev/md127 /dev/sdb1. This final step re-adds the second disk to the array and will be resynced. The entire raid array should be back up. Having said all that, I'm personally in the situation where BOTH my drives where marked as spares and can't mount either in degraded mode. I'm wondering if this is some weird gentoo specific bug or a kernel bug. |
Hi node,
That is a very useful guide, thanks for adding it. I did try using --assemble to put the array back together in its original form, but I don't know whether that failed because of my empty mdadm.conf or for some other reason, becasue I didn't know to try the next few steps there. I also don't know whether or not I could have used --assemble to bring a single drive up in degraded mode, because I don't remember whether or not that was one of the things I tried. Because of that it's hard to tell (for me at least) whether or not our problems are identical, but if they are, there are two things you might try: The first solution that worked for me, which you may have already read about above, was to use --build to make an array out of a single disk: Code:
mdadm --build /dev/md0 --raid-devices=1 /dev/sdb After I marked this thread as "solved", I came across another solution that permanently made one of the raid members into a regular old ext4 partition that can be mounted normally. I don't know whether or not this solution will work for everyone, though, and here's why: When I created my array, I used two completely empty, unformatted disks. I then used gparted to write the partition table to my raid device and format it. The impression I have is that it is more common to build an array out of formatted partitions, which is why your raid members are /dev/sda1 and /dev/sdb1 and mine are simply /dev/sdb and /dev/sdc. Anyway, here's what happened: After I recovered my data using the method above, I opened gparted and started to format /dev/sdb. Gparted told me that there was no partition table on the disk, so I created the standard one. After I did this, it suddenly saw an ext4 partition taking up the entire disk, and the drive was effectively converted from a raid member to a standard drive, with no data loss. When I saw this I simply changed my fstab so that /home is mounted to /dev/sdb1, and everything is back to normal as far as I can tell, except that I'm no longer using RAID. Neither of these methods save you having to remake the array from scratch, but they both gave me access to data that I thought I had lost. I don't know how specific they are to my situation, but hopefully they're helpful to somebody. How would we begin to determine where the bug is that causes this problem? |
Appearantly this is caused by a kernel bug in older kernels. (3.2.1, 3.3 suffer from this, it was fixed in 3.4). The kernel decided for somereason during shutdown to break the array and thus ruining metadata.
The only option is to re-create the array, using the same parameters as the first time, but using the --assume-clean flag. This causes the data to be not overwritten. As for not finding your array and not using mdadm, the kernel/udev asigns a random name (number) for found arrays, starting at md127 and going down. So md126 would be a second found array (/proc/mdstat helps heirin). Because this can become quite random, using mdadm you can 'force' a name uppon the array. /dev/md0 can make more sense :) I think the manual for mdadm still says, the array will try to get a name based on its last used name. So when you stop the array as md0, it should become md0 again uppon the next scan, but atleast udev prevents that, if not the kernel itself. As for having back your ext4 partition after creating a partition table, is pure luck. the 0.9 metadata for a raid1 may be about the same size as a partition table. So by creating a partition table, you overwritten the md metadata. The start of the partition just coincidentally was the same as the start of your raid1 partition. When using raid1, you CAN actually mount it without the raid component. mount -o ro -t ext4 /dev/sdb /mnt may have worked. If not you can even tell mount to use an offset, e.g. tell mount where your data starts. Since I'm using raid10 on two disks, 1 disk can actually be mounted, but only one due to the layout on the disk. Anyway, because you now have your data on a regular disk, and an unused disk, you can use that one to create a degraded array and copy your data over. Then simply hot-add the old disk and your back in business ;) Do try to use mdadm.conf ;) and raid10 may be more usefull in your situation, even with 2 disks. It will give you the safety of raid1, with the speed of raid0. As for using or not using md/raid, I think we just got extremly unlucky with a known kernel bug (unknown to us at the time) and should not happen again. |
All times are GMT -5. The time now is 06:42 PM. |