LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 01-06-2011, 10:53 PM   #1
aleinin
LQ Newbie
 
Registered: Jan 2011
Posts: 9

Rep: Reputation: 7
Rats! Degraded Software Raid 5 Issue


Hi there!

I'm running a raid 5 on an older machine using MDADM. I've had it running for a few years now, and have been able to recover from every raid failure so far, but this one has thrown me for a bit of a loop. Here's the issue:

Somehow four drives kicked out of the array. It's happened before, no biggie, just restart the PC (to re-discover the drives, it's an old desktop) and re-add. Voila, done it 2-3 times. This time however, this happened:

Code:
/dev/md0:
        Version : 00.90
  Creation Time : Thu Aug  2 20:58:10 2007
     Raid Level : raid5
  Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
   Raid Devices : 7
  Total Devices : 8
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon Jan  3 00:52:31 2011
          State : active, degraded, Not Started
 Active Devices : 5
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 3

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : bb177475:83977a04:26ebaf7a:a12071a2
         Events : 0.1137414

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8        1        3      active sync   /dev/sda1
       4       0        0        4      removed
       5       8       65        5      active sync   /dev/sde1
       6       0        0        6      removed

       7       8       97        -      spare   /dev/sdg1
       8       8       81        -      spare   /dev/sdf1
       9       8      113        -      spare   /dev/sdh1
For some reason, when I re-added the drives, they went in as spares. I'm thinking the array is hosed, but the Major/Minors still line up, and when I do an examine on one of the disks,

Code:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : bb177475:83977a04:26ebaf7a:a12071a2
  Creation Time : Thu Aug  2 20:58:10 2007
     Raid Level : raid5
  Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
     Array Size : 2930303616 (2794.56 GiB 3000.63 GB)
   Raid Devices : 7
  Total Devices : 8
Preferred Minor : 0

    Update Time : Sat Jan  1 07:35:27 2011
          State : active
 Active Devices : 7
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 1
       Checksum : 60e67ffe - correct
         Events : 1137414

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     5       8      113        5      active sync   /dev/sdh1

   0     0       8       17        0      active sync   /dev/sdb1
   1     1       8        1        1      active sync   /dev/sda1
   2     2       8       33        2      active sync   /dev/sdc1
   3     3       8       49        3      active sync   /dev/sdd1
   4     4       8       81        4      active sync   /dev/sdf1
   5     5       8      113        5      active sync   /dev/sdh1
   6     6       8       65        6      active sync   /dev/sde1
   7     7       8       97        7      spare   /dev/sdg1
All the events seem to line up, too...

Code:
ENTROPY:/home/captainmullet# mdadm --examine /dev/sd[a-h]1 | grep Event
         Events : 1137414
         Events : 1137414
         Events : 1137414
         Events : 1137414
         Events : 1137414
         Events : 1137414
         Events : 1137414
         Events : 1137414
My question, I suppose, is this. Can I force the --examine data from the good raid drives onto the ones that somehow think they're spares now, in order to thrust them back into the raid? Or maybe change the information from --detail? I've tried assembling and rebuilding and even creating a new operating system partition and re-installing debian. I'm not entirely sure where to go from here (besides to find a punching bag, I've got some frustration to work out!)

Thanks in advance!
 
Old 01-07-2011, 11:21 AM   #2
aleinin
LQ Newbie
 
Registered: Jan 2011
Posts: 9

Original Poster
Rep: Reputation: 7
Alright, was poking around and found this (taken from http://tldp.org/HOWTO/html_single/So...D-HOWTO/#ss8.1) :

Quote:
8.1 Recovery from a multiple disk failure

The scenario is:

* A controller dies and takes two disks offline at the same time,
* All disks on one scsi bus can no longer be reached if a disk dies,
* A cable comes loose...

In short: quite often you get a temporary failure of several disks at once; afterwards the RAID superblocks are out of sync and you can no longer init your RAID array.

If using mdadm, you could first try to run:

mdadm --assemble --force

If not, there's one thing left: rewrite the RAID superblocks by mkraid --force

To get this to work, you'll need to have an up to date /etc/raidtab - if it doesn't EXACTLY match devices and ordering of the original disks this will not work as expected, but will most likely completely obliterate whatever data you used to have on your disks.

Look at the sylog produced by trying to start the array, you'll see the event count for each superblock; usually it's best to leave out the disk with the lowest event count, i.e the oldest one.

If you mkraid without failed-disk, the recovery thread will kick in immediately and start rebuilding the parity blocks - not necessarily what you want at that moment.

With failed-disk you can specify exactly which disks you want to be active and perhaps try different combinations for best results. BTW, only mount the filesystem read-only while trying this out... This has been successfully used by at least two guys I've been in contact with.
I'll be looking into the use of mkraid when I get back - has anyone used it to any great (or not so great) success?
 
Old 01-07-2011, 12:05 PM   #3
Noway2
Senior Member
 
Registered: Jul 2007
Distribution: Gentoo
Posts: 2,125

Rep: Reputation: 781Reputation: 781Reputation: 781Reputation: 781Reputation: 781Reputation: 781Reputation: 781
I had a similar problem once, running two drives in a software raid-1. I replaced one of the drives and it came back saying that it was a spare. After the array rebuilt to 100%, the drive was listed as an active mirror in the array rather than a spare. Check the status of the array and see if it is still rebuilding.
 
Old 01-09-2011, 12:40 AM   #4
aleinin
LQ Newbie
 
Registered: Jan 2011
Posts: 9

Original Poster
Rep: Reputation: 7
I was in the process of making an /etc/raidtab file (I didn't have one for some reason) and I noticed something strange.

Code:
ENTROPY:/home/captainmullet# mdadm --examine /dev/sd[a-h]1 | grep this
this     3       8       49        3      active sync   /dev/sdd1
this     0       8       17        0      active sync   /dev/sdb1
this     1       8        1        1      active sync   /dev/sda1
this     2       8       33        2      active sync   /dev/sdc1
this     5       8      113        5      active sync   /dev/sdh1
this     8       8       81       -1      spare   /dev/sdf1
this     7       8       65       -1      spare   /dev/sde1
this     9       8       97       -1      spare   /dev/sdg1
The disks got picked up in the wrong order. What the system sees as /dev/sda is configured as /dev/sdd on the disk. I'll try and get the disks put back in order, and then try and see if I can assemble it.
 
Old 01-10-2011, 08:25 AM   #5
aleinin
LQ Newbie
 
Registered: Jan 2011
Posts: 9

Original Poster
Rep: Reputation: 7
Well, I got the drives all recognized and the raidtab file created, but alas, it seems that mkraid is a command I don't have, and it's been replaced by mdadm for the most part.

This thread (external site): http://http://www.linuxforums.org/fo...id6-array.html
Mentions perhaps creating a new array over the old one, but before I do that I want to look around for other options. I'm not familiar with mddump at all, so I'll research that as well. I'm willing to do some experimenting because I have the absolutely critical data backed up elsewhere, but I would still very much like to be able to recover the data I don't have backed up.
I suppose I'm using this thread as a pseudo-blog, but I'll continue to document my blinded blundering if it might help someone else (or me!).
 
Old 01-10-2011, 08:29 PM   #6
aleinin
LQ Newbie
 
Registered: Jan 2011
Posts: 9

Original Poster
Rep: Reputation: 7
Alrighty,

This site : http://www.issociate.de/board/post/3..._arrays_?.html
Describes using mdadm --create to do what I think I want to do. So...

Code:
# mdadm --stop /dev/md0
mdadm: stopped /dev/md0

# mdadm --create /dev/md0 --chunk=64 --level=raid5 --raid-devices=7 missing /dev/sda1 /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sdh1 /dev/sde1
mdadm: /dev/sda1 appears to be part of a raid array:
    level=raid5 devices=7 ctime=Thu Aug  2 20:58:10 2007
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=7 ctime=Thu Aug  2 20:58:10 2007
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=7 ctime=Thu Aug  2 20:58:10 2007
mdadm: /dev/sdf1 appears to be part of a raid array:
    level=raid5 devices=7 ctime=Thu Aug  2 20:58:10 2007
mdadm: /dev/sdh1 appears to be part of a raid array:
    level=raid5 devices=7 ctime=Thu Aug  2 20:58:10 2007
mdadm: /dev/sde1 appears to contain an ext2fs file system
    size=-504831356K  mtime=Sun Dec  5 23:00:10 1976
mdadm: /dev/sde1 appears to be part of a raid array:
    level=raid5 devices=7 ctime=Thu Aug  2 20:58:10 2007
Continue creating array? yes
Hold onto your butts...
 
Old 01-10-2011, 08:38 PM   #7
aleinin
LQ Newbie
 
Registered: Jan 2011
Posts: 9

Original Poster
Rep: Reputation: 7
Yahoo!

Code:
Continue creating array? yes
mdadm: array /dev/md0 started.

# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Mon Jan 10 21:06:47 2011
     Raid Level : raid5
     Array Size : 2930303616 (2794.56 GiB 3000.63 GB)
  Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
   Raid Devices : 7
  Total Devices : 6
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon Jan 10 21:06:47 2011
          State : clean, degraded
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : d63ce4bb:dc035564:4542cb33:ddca6f8c (local to host ENTROPY)
         Events : 0.1

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8        1        1      active sync   /dev/sda1
       2       8       33        2      active sync   /dev/sdc1
       3       8       49        3      active sync   /dev/sdd1
       4       8       81        4      active sync   /dev/sdf1
       5       8      113        5      active sync   /dev/sdh1
       6       8       65        6      active sync   /dev/sde1
# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active (auto-read-only) raid5 sde1[6] sdh1[5] sdf1[4] sdd1[3] sdc1[2] sda1[1]
      2930303616 blocks level 5, 64k chunk, algorithm 2 [7/6] [_UUUUUU]

unused devices: <none>
# mkdir /share
# mount -o ro /dev/md0 /share
# cd /share/
/share# dir
*ALL MY STUFF!* 
/share#

It's in the process of transferring off now. Once I get it all backed up, I'm going to try re-adding the disk I set as 'missing', and then re-add the final disk as a spare. I'll update with final results!

I exploded in skype when it worked, and my friend summed it up pretty well:

Quote:
i love how messed up computing is
that when something works the way you think it should, it's like you just slew a dragon
 
Old 01-12-2011, 08:31 AM   #8
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 234Reputation: 234Reputation: 234
"Yahoo!" == "[SOLVED]"?
 
Old 01-12-2011, 08:37 AM   #9
aleinin
LQ Newbie
 
Registered: Jan 2011
Posts: 9

Original Poster
Rep: Reputation: 7
Yep, I was going to leave it open until I had finished backing everything up and restored the array to full working order, but everything is moving along at a pretty nice clip, so I can probably go ahead and mark this solved.
 
Old 01-13-2011, 07:41 AM   #10
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 234Reputation: 234Reputation: 234
Quote:
Originally Posted by aleinin View Post
I'll update with final results!
I'm waiting w/ bated breath.

Seriously, I'd like to know that it was fully resolved, plus, if you're willing, an overview of what worked & what didn't. You would/will organize your way, my approach would be:
Problem
Solution
Blind Alleys

Last edited by archtoad6; 01-13-2011 at 07:46 AM. Reason: add serious
 
Old 02-03-2011, 10:29 PM   #11
aleinin
LQ Newbie
 
Registered: Jan 2011
Posts: 9

Original Poster
Rep: Reputation: 7
Talking

Alrighty!

Sorry about the delay, I've been all over the place recently.

Problem:
8 hard drives in a raid 5 (7 active 1 spare). Somehow, when the computer went off (power surge or me hitting the cord, the world will never know!) and came back up, the hard drives were picked up in a different order. I'm not sure how this happened, but if I did an mdadm --examine on /dev/sdb, the mdadm detail would come back and tell me that it was /dev/sda in the raid. Which meant that when I did an --assemble, everything was all out of order and it most certainly didn't work.

Solution:
What I eventually had to do was stop the bad array, and after VERY CAREFULLY making sure that all drives were recognized by the system as what the superblock thought they were, and using the 'most correct' looking superblock for the ordering of the disks, I used mdadm --create to re-create md0.

Blind Alleys:
I read online about using another raid configuration set, mkraid, which seems to have been the norm before I started playing around with this stuff. It seems like now most everything new doesn't use mkraid, and I couldn't find any packages to run it. Even if I did, I'm not sure how well it would have worked. Also, I tried throwing in another hard drive and re-installing the operating system (which I was going to do anyway, I was having a lot of problems with the current install due to my shoddy managing). I thought maybe with a fresh install the raid would just magically work. I am, however, no wizard, so this was not the case.

As a side note, after I re-created the raid and copied everything off, I restarted the PC. When it came back up, the raid was hosed again. The mdamd.conf file was not updated with the newly --created raid. This was repeated in that every time I restarted the PC, I had to do a --create again. The PC running this is a new Debian install, but I have an x64 PC with the new Ubuntu distro that had the same issue with a mirror raid. I found a command, mdadm --examine --scan --config=mdadm.conf >> /etc/mdadm/mdadm.conf, that will supposedly fix my problem, so we'll see!

If anyone has any questions about anything I did, please feel free to ask! I promise I will be more diligent about responding in the future.
 
1 members found this post helpful.
Old 02-05-2011, 07:29 AM   #12
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 234Reputation: 234Reputation: 234
+7! Really good exposition of your solution. Thanks. Worth waiting for.

BTW, any suggestions about prevention? -- Anything you could have backed up that would have made the recovery process easier? Although I doubt they would have helped you, things like the MBR via dd, or partitioning info via sfdisk -d.
 
Old 02-05-2011, 01:28 PM   #13
aleinin
LQ Newbie
 
Registered: Jan 2011
Posts: 9

Original Poster
Rep: Reputation: 7
I will say that the one thing I wish I had available to me was the actual last working raid configuration, in terms of the ordering of the disks. When I went to re-create md0, in order to find out the ordering of the disks in the raid, I had to go through and --examine every disk until I found one that (very luckily) had a complete list of the disks in order. I really had no idea if it was the correct order (or if the ordering of the disks even matters, I assume it does). If I had taken a copy of a --detail of md0 before everything went wonky, I would have had a lot more confidence in re-creating it.

Come to think of it, I think when I originally created the array (back in '06 or '07) I backed up the mkfs info and the mdstat information, but after adding more disks it likely would not have been very helpful. I also had it stored on the raid array itself. I would not recommend doing that.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Raid 1 degraded cferron Linux - Server 6 10-19-2008 10:15 AM
recover dirty, degraded software raid 1 after power failure Rascale Linux - Server 3 07-31-2008 12:00 PM
Degraded Array on Software Raid pcinfo-az Linux - Hardware 8 07-03-2008 10:43 AM
Software RAID-1 unable to boot degraded keithk23 Linux - Server 2 09-27-2006 08:52 AM
RAID 1 Degraded Array gsoft Debian 2 08-18-2006 02:17 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 02:51 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration