Recovering from possibly botched RAID setup

drorex2 · 12-05-2005, 02:06 AM

Ok, this is going to be a long one (lots of logs and stuff). Thanks in advance for reading it, and possibly helping me save my data.

I am installing new disks into my system, and wanted to set them up using RAID.

Here is my old config:

Quote:

/dev/hda: 200GB
/dev/hda1: 20GB (not used)
/dev/hda2: 180GB (mostly full - non essential data)

/dev/hdb: 30GB
/dev/hdb1: 20GB ( root filesystem, mostly full)
/dev/hdb2: 2GB swap
/dev/hdb3: used to be windows install, not used anymore

-------------
Then i installed 2 new 200GB SATA drives. My new config would be:

Quote:

/dev/hda: 200GB
/dev/hda1: 20GB
/dev/hda2: 180GB

/dev/sda: 200GB
/dev/sda1: 180GB
/dev/sda2: 20GB

/dev/sdb: 200GB
/dev/sdb1: 180GB
/dev/sdb2: 20GB

/dev/md0: RAID 1 20GB (extra safe...I can lose 2 of the 3 drives and still boot up ok)
/dev/sda2
/dev/sdb2
/dev/hda1

/dev/md1: RAID5 360GB (I can lose 1 of the 3 drives, and still have all my data, plus it's combined into a single large drive)
/dev/sda1
/dev/sdb1
/dev/hda2

-----------------

Now, in order to migrate my data, here was my plan:

1. install the two new drives
2. create a RAID5 array, md1 from sda1 & sdb1, and 'missing' as the third drive (so that it runs in degraded mode)
3. copy all my data from /dev/hda2 to /dev/md1
4. add /dev/hda2 to /dev/md1, and have it resync the parity

Now this is where I got stuck (haven't gotten to md0 yet). The data seemed to copy fine, and then I unmounted hda2, and then added it to the md1 array. It started resyncing, and got to maybe 10% fine. But after a while, something odd happened. If I cat /proc/mdstat, it would cycle between "Resyncing...0%", then the next time it would say "Resync=DELAYED" then the next time, it wouldn't show it resyncing at all.

And it started generating HUGE amounts of logs in /var/log/syslog and /var/log/messages, about 350 MB before i stopped sysklog & klog (because my disk is already almost full). Here is the end of the log output:

Code:

Dec  4 23:22:33 drorex kernel: ................<6>md: syncing RAID array md1
Dec  4 23:22:33 drorex kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Dec  4 23:22:33 drorex kernel: md: using maximum available idle IO bandwith (but not more than 200000 KB/sec) for reconstruction.
Dec  4 23:22:33 drorex kernel: md: using 128k window, over a total of 175783104 blocks.
Dec  4 23:22:33 drorex kernel: md: md1: sync done.
Dec  4 23:22:33 drorex kernel: ................<6>md: syncing RAID array md1
Dec  4 23:22:33 drorex kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Dec  4 23:22:33 drorex kernel: md: using maximum available idle IO bandwith (but not more than 200000 KB/sec) for reconstruction.
Dec  4 23:22:33 drorex kernel: md: using 128k window, over a total of 175783104 blocks.
Dec  4 23:22:33 drorex kernel: md: md1: sync done.
Dec  4 23:22:33 drorex kernel: ................<6>md: syncing RAID array md1
Dec  4 23:22:33 drorex kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Dec  4 23:22:33 drorex kernel: md: using maximum available idle IO bandwith (but not more than 200000 KB/sec) for reconstruction.
Dec  4 23:22:33 drorex kernel: md: using 128k window, over a total of 175783104 blocks.
Dec  4 23:22:33 drorex kernel: md: md1: sync done.
Dec  4 23:22:33 drorex kernel: ................<6>md: syncing RAID array md1
Dec  4 23:22:33 drorex kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Dec  4 23:22:33 drorex kernel: md: using maximum available idle IO bandwith (but not more than 200000 KB/sec) for reconstruction.
Dec  4 23:22:33 drorex kernel: md: using 128k window, over a total of 175783104 blocks.
Dec  4 23:22:33 drorex kernel: md: md1: sync done.
Dec  4 23:22:33 drorex exiting on signal 15

It's basically the same thing over and over, repeating.

So then I stopped the array, and tried restarting it, but now it says my disks are failed or something?

Code:

root@drorex:~# mdadm --assemble /dev/md1 /dev/sda1 /dev/sdb1 /dev/hda2
mdadm: /dev/md1 assembled from 1 drive and 1 spare - not enough to start the array.
root@drorex:~# cat /proc/mdstat
Personalities : [raid1] [raid5]
md1 : inactive sda1[0] hda2[3] sdb1[1]
      527373440 blocks
unused devices: <none>

Here is the output of mdadm --examine for all of the partitions in md1:

Code:

/dev/sda1:
          Magic : a92b4efc
        Version : 00.90.01
           UUID : aef026ea:a658dd3d:d83036ce:4ad342a2
  Creation Time : Sun Dec  4 18:27:00 2005
     Raid Level : raid5
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0

    Update Time : Sun Dec  4 23:24:16 2005
          State : clean
 Active Devices : 1
Working Devices : 2
 Failed Devices : 3
  Spare Devices : 1
       Checksum : b33aa022 - correct
         Events : 0.1049267

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     0       8        1        0      active sync   /dev/sda1

   0     0       8        1        0      active sync   /dev/sda1
   1     1       0        0        1      faulty removed
   2     2       0        0        2      faulty removed
   3     3       3        2        2      spare   /dev/hda2



/dev/sdb1:
          Magic : a92b4efc
        Version : 00.90.01
           UUID : aef026ea:a658dd3d:d83036ce:4ad342a2
  Creation Time : Sun Dec  4 18:27:00 2005
     Raid Level : raid5
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0

    Update Time : Sun Dec  4 22:57:33 2005
          State : clean
 Active Devices : 2
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 1
       Checksum : b31b8bf7 - correct
         Events : 0.31676

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     1       8       17        1      active sync   /dev/sdb1

   0     0       8        1        0      active sync   /dev/sda1
   1     1       8       17        1      active sync   /dev/sdb1
   2     2       0        0        2      faulty removed
   3     3       3        2        2      spare   /dev/hda2



/dev/hda2:
          Magic : a92b4efc
        Version : 00.90.01
           UUID : aef026ea:a658dd3d:d83036ce:4ad342a2
  Creation Time : Sun Dec  4 18:27:00 2005
     Raid Level : raid5
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0

    Update Time : Sun Dec  4 23:24:16 2005
          State : clean
 Active Devices : 1
Working Devices : 2
 Failed Devices : 3
  Spare Devices : 1
       Checksum : b33aa01f - correct
         Events : 0.1049267

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3       3        2        3      spare   /dev/hda2

   0     0       8        1        0      active sync   /dev/sda1
   1     1       0        0        1      faulty removed
   2     2       0        0        2      faulty removed
   3     3       3        2        3      spare   /dev/hda2

It looks like something weird is going on, with the different reports of 'faulty' and 'spare'.

Again, the raid array was initially created with /dev/sda1 & /dev/sdb1, then /dev/hda2 was added to it.

Is there someway I can reset the flags so they aren't marked as 'failed'?
I'd really like a way to do this without losing my data, I know it's all in there somewhere.

Thanks again for anyone who can help, I had a bunch of stuff I don't want to lose in there

drorex2 · 12-05-2005, 12:05 PM

Ok, I took a chance, and ran mdadm --assemble with the --force flag...and it worked! at least it seems to work...hopefully my disks wont explode in a couple hours :P

drorex2 · 12-05-2005, 12:29 PM

I spoke too soon

The same exact thing happened again...resynced for a little while (15 or 20 mins), then the logs started generating huge amounts of the same msgs as above, and the resync would be stuck in that stopped/delayed/0% state again.

Do you think one of my drives is bad? They aren't new, but they've been sitting on the shelf for a while, and they've either been used not at all, or very little. Is there a way I can test each drive separately to see if one is bad?

thanks

drorex2 · 12-09-2005, 12:30 AM

I haven't found a solution yet: I'm open to any suggestions, can think of anything I can do, to fix the problem, or to figure out more about it? Thanks.