Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I'm running a raid 5 on an older machine using MDADM. I've had it running for a few years now, and have been able to recover from every raid failure so far, but this one has thrown me for a bit of a loop. Here's the issue:
Somehow four drives kicked out of the array. It's happened before, no biggie, just restart the PC (to re-discover the drives, it's an old desktop) and re-add. Voila, done it 2-3 times. This time however, this happened:
Code:
/dev/md0:
Version : 00.90
Creation Time : Thu Aug 2 20:58:10 2007
Raid Level : raid5
Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
Raid Devices : 7
Total Devices : 8
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Mon Jan 3 00:52:31 2011
State : active, degraded, Not Started
Active Devices : 5
Working Devices : 8
Failed Devices : 0
Spare Devices : 3
Layout : left-symmetric
Chunk Size : 64K
UUID : bb177475:83977a04:26ebaf7a:a12071a2
Events : 0.1137414
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 33 1 active sync /dev/sdc1
2 8 49 2 active sync /dev/sdd1
3 8 1 3 active sync /dev/sda1
4 0 0 4 removed
5 8 65 5 active sync /dev/sde1
6 0 0 6 removed
7 8 97 - spare /dev/sdg1
8 8 81 - spare /dev/sdf1
9 8 113 - spare /dev/sdh1
For some reason, when I re-added the drives, they went in as spares. I'm thinking the array is hosed, but the Major/Minors still line up, and when I do an examine on one of the disks,
Code:
Magic : a92b4efc
Version : 00.90.00
UUID : bb177475:83977a04:26ebaf7a:a12071a2
Creation Time : Thu Aug 2 20:58:10 2007
Raid Level : raid5
Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
Array Size : 2930303616 (2794.56 GiB 3000.63 GB)
Raid Devices : 7
Total Devices : 8
Preferred Minor : 0
Update Time : Sat Jan 1 07:35:27 2011
State : active
Active Devices : 7
Working Devices : 8
Failed Devices : 0
Spare Devices : 1
Checksum : 60e67ffe - correct
Events : 1137414
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 5 8 113 5 active sync /dev/sdh1
0 0 8 17 0 active sync /dev/sdb1
1 1 8 1 1 active sync /dev/sda1
2 2 8 33 2 active sync /dev/sdc1
3 3 8 49 3 active sync /dev/sdd1
4 4 8 81 4 active sync /dev/sdf1
5 5 8 113 5 active sync /dev/sdh1
6 6 8 65 6 active sync /dev/sde1
7 7 8 97 7 spare /dev/sdg1
My question, I suppose, is this. Can I force the --examine data from the good raid drives onto the ones that somehow think they're spares now, in order to thrust them back into the raid? Or maybe change the information from --detail? I've tried assembling and rebuilding and even creating a new operating system partition and re-installing debian. I'm not entirely sure where to go from here (besides to find a punching bag, I've got some frustration to work out!)
* A controller dies and takes two disks offline at the same time,
* All disks on one scsi bus can no longer be reached if a disk dies,
* A cable comes loose...
In short: quite often you get a temporary failure of several disks at once; afterwards the RAID superblocks are out of sync and you can no longer init your RAID array.
If using mdadm, you could first try to run:
mdadm --assemble --force
If not, there's one thing left: rewrite the RAID superblocks by mkraid --force
To get this to work, you'll need to have an up to date /etc/raidtab - if it doesn't EXACTLY match devices and ordering of the original disks this will not work as expected, but will most likely completely obliterate whatever data you used to have on your disks.
Look at the sylog produced by trying to start the array, you'll see the event count for each superblock; usually it's best to leave out the disk with the lowest event count, i.e the oldest one.
If you mkraid without failed-disk, the recovery thread will kick in immediately and start rebuilding the parity blocks - not necessarily what you want at that moment.
With failed-disk you can specify exactly which disks you want to be active and perhaps try different combinations for best results. BTW, only mount the filesystem read-only while trying this out... This has been successfully used by at least two guys I've been in contact with.
I'll be looking into the use of mkraid when I get back - has anyone used it to any great (or not so great) success?
I had a similar problem once, running two drives in a software raid-1. I replaced one of the drives and it came back saying that it was a spare. After the array rebuilt to 100%, the drive was listed as an active mirror in the array rather than a spare. Check the status of the array and see if it is still rebuilding.
I was in the process of making an /etc/raidtab file (I didn't have one for some reason) and I noticed something strange.
Code:
ENTROPY:/home/captainmullet# mdadm --examine /dev/sd[a-h]1 | grep this
this 3 8 49 3 active sync /dev/sdd1
this 0 8 17 0 active sync /dev/sdb1
this 1 8 1 1 active sync /dev/sda1
this 2 8 33 2 active sync /dev/sdc1
this 5 8 113 5 active sync /dev/sdh1
this 8 8 81 -1 spare /dev/sdf1
this 7 8 65 -1 spare /dev/sde1
this 9 8 97 -1 spare /dev/sdg1
The disks got picked up in the wrong order. What the system sees as /dev/sda is configured as /dev/sdd on the disk. I'll try and get the disks put back in order, and then try and see if I can assemble it.
Well, I got the drives all recognized and the raidtab file created, but alas, it seems that mkraid is a command I don't have, and it's been replaced by mdadm for the most part.
This thread (external site): http://http://www.linuxforums.org/fo...id6-array.html
Mentions perhaps creating a new array over the old one, but before I do that I want to look around for other options. I'm not familiar with mddump at all, so I'll research that as well. I'm willing to do some experimenting because I have the absolutely critical data backed up elsewhere, but I would still very much like to be able to recover the data I don't have backed up.
I suppose I'm using this thread as a pseudo-blog, but I'll continue to document my blinded blundering if it might help someone else (or me!).
# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
# mdadm --create /dev/md0 --chunk=64 --level=raid5 --raid-devices=7 missing /dev/sda1 /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sdh1 /dev/sde1
mdadm: /dev/sda1 appears to be part of a raid array:
level=raid5 devices=7 ctime=Thu Aug 2 20:58:10 2007
mdadm: /dev/sdc1 appears to be part of a raid array:
level=raid5 devices=7 ctime=Thu Aug 2 20:58:10 2007
mdadm: /dev/sdd1 appears to be part of a raid array:
level=raid5 devices=7 ctime=Thu Aug 2 20:58:10 2007
mdadm: /dev/sdf1 appears to be part of a raid array:
level=raid5 devices=7 ctime=Thu Aug 2 20:58:10 2007
mdadm: /dev/sdh1 appears to be part of a raid array:
level=raid5 devices=7 ctime=Thu Aug 2 20:58:10 2007
mdadm: /dev/sde1 appears to contain an ext2fs file system
size=-504831356K mtime=Sun Dec 5 23:00:10 1976
mdadm: /dev/sde1 appears to be part of a raid array:
level=raid5 devices=7 ctime=Thu Aug 2 20:58:10 2007
Continue creating array? yes
Continue creating array? yes
mdadm: array /dev/md0 started.
# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90
Creation Time : Mon Jan 10 21:06:47 2011
Raid Level : raid5
Array Size : 2930303616 (2794.56 GiB 3000.63 GB)
Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
Raid Devices : 7
Total Devices : 6
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Mon Jan 10 21:06:47 2011
State : clean, degraded
Active Devices : 6
Working Devices : 6
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
UUID : d63ce4bb:dc035564:4542cb33:ddca6f8c (local to host ENTROPY)
Events : 0.1
Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 1 1 active sync /dev/sda1
2 8 33 2 active sync /dev/sdc1
3 8 49 3 active sync /dev/sdd1
4 8 81 4 active sync /dev/sdf1
5 8 113 5 active sync /dev/sdh1
6 8 65 6 active sync /dev/sde1
# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active (auto-read-only) raid5 sde1[6] sdh1[5] sdf1[4] sdd1[3] sdc1[2] sda1[1]
2930303616 blocks level 5, 64k chunk, algorithm 2 [7/6] [_UUUUUU]
unused devices: <none>
# mkdir /share
# mount -o ro /dev/md0 /share
# cd /share/
/share# dir
*ALL MY STUFF!*
/share#
It's in the process of transferring off now. Once I get it all backed up, I'm going to try re-adding the disk I set as 'missing', and then re-add the final disk as a spare. I'll update with final results!
I exploded in skype when it worked, and my friend summed it up pretty well:
Quote:
i love how messed up computing is
that when something works the way you think it should, it's like you just slew a dragon
Yep, I was going to leave it open until I had finished backing everything up and restored the array to full working order, but everything is moving along at a pretty nice clip, so I can probably go ahead and mark this solved.
Seriously, I'd like to know that it was fully resolved, plus, if you're willing, an overview of what worked & what didn't. You would/will organize your way, my approach would be: Problem Solution Blind Alleys
Last edited by archtoad6; 01-13-2011 at 07:46 AM.
Reason: add serious
Sorry about the delay, I've been all over the place recently.
Problem:
8 hard drives in a raid 5 (7 active 1 spare). Somehow, when the computer went off (power surge or me hitting the cord, the world will never know!) and came back up, the hard drives were picked up in a different order. I'm not sure how this happened, but if I did an mdadm --examine on /dev/sdb, the mdadm detail would come back and tell me that it was /dev/sda in the raid. Which meant that when I did an --assemble, everything was all out of order and it most certainly didn't work.
Solution:
What I eventually had to do was stop the bad array, and after VERY CAREFULLY making sure that all drives were recognized by the system as what the superblock thought they were, and using the 'most correct' looking superblock for the ordering of the disks, I used mdadm --create to re-create md0.
Blind Alleys:
I read online about using another raid configuration set, mkraid, which seems to have been the norm before I started playing around with this stuff. It seems like now most everything new doesn't use mkraid, and I couldn't find any packages to run it. Even if I did, I'm not sure how well it would have worked. Also, I tried throwing in another hard drive and re-installing the operating system (which I was going to do anyway, I was having a lot of problems with the current install due to my shoddy managing). I thought maybe with a fresh install the raid would just magically work. I am, however, no wizard, so this was not the case.
As a side note, after I re-created the raid and copied everything off, I restarted the PC. When it came back up, the raid was hosed again. The mdamd.conf file was not updated with the newly --created raid. This was repeated in that every time I restarted the PC, I had to do a --create again. The PC running this is a new Debian install, but I have an x64 PC with the new Ubuntu distro that had the same issue with a mirror raid. I found a command, mdadm --examine --scan --config=mdadm.conf >> /etc/mdadm/mdadm.conf, that will supposedly fix my problem, so we'll see!
If anyone has any questions about anything I did, please feel free to ask! I promise I will be more diligent about responding in the future.
+7! Really good exposition of your solution. Thanks. Worth waiting for.
BTW, any suggestions about prevention? -- Anything you could have backed up that would have made the recovery process easier? Although I doubt they would have helped you, things like the MBR via dd, or partitioning info via sfdisk -d.
I will say that the one thing I wish I had available to me was the actual last working raid configuration, in terms of the ordering of the disks. When I went to re-create md0, in order to find out the ordering of the disks in the raid, I had to go through and --examine every disk until I found one that (very luckily) had a complete list of the disks in order. I really had no idea if it was the correct order (or if the ordering of the disks even matters, I assume it does). If I had taken a copy of a --detail of md0 before everything went wonky, I would have had a lot more confidence in re-creating it.
Come to think of it, I think when I originally created the array (back in '06 or '07) I backed up the mkfs info and the mdstat information, but after adding more disks it likely would not have been very helpful. I also had it stored on the raid array itself. I would not recommend doing that.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.