recover dirty, degraded software raid 1 after power failure

Rascale · 07-29-2008, 11:26 AM

Hi,

I have a redhat ES 3 box with software raid 1 array that lost power (my boss was trying to figure out what was plugged into the ups, oops, he found out!) It booted back up but the array was dirty, degraded.

----------------- dmesg -----------------
md: superblock update time inconsistency -- using the most recent one
md: freshest: sdc1
md: kicking non-fresh sdb1 from array!

----------------------------------------
# mdadm -Q --detail /dev/md0
/dev/md0:
Version : 00.90.00
Creation Time : Tue Feb 15 07:30:50 2005
Raid Level : raid1
Array Size : 35881024 (34.22 GiB 36.74 GB)
Device Size : 35881024 (34.22 GiB 36.74 GB)
Raid Devices : 2
Total Devices : 1
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Thu Jul 17 01:59:43 2008
State : dirty, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0

UUID : 335f77df:7b4443c4:45a84a39:94c75971
Events : 0.56

Number Major Minor RaidDevice State
0 0 0 0 faulty removed
1 8 33 1 active sync /dev/sdc1
----------------------------------------

/dev/sdb1 thinks everything is OK.

mdadm -E /dev/sdb1
/dev/sdb1:
Magic : a92b4efc
Version : 00.90.00
UUID : 335f77df:7b4443c4:45a84a39:94c75971
Creation Time : Tue Feb 15 07:30:50 2005
Raid Level : raid1
Device Size : 35881024 (34.22 GiB 36.74 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0

Update Time : Tue May 29 03:24:41 2007
State : dirty
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Checksum : bcd14504 - correct
Events : 0.48

Number Major Minor RaidDevice State
this 0 8 17 0 active sync /dev/sdb1

0 0 8 17 0 active sync /dev/sdb1
1 1 8 33 1 active sync /dev/sdc1
----------------------------------------

/dev/sdc1 complains:

# mdadm -E /dev/sdc1
/dev/sdc1:
Magic : a92b4efc
Version : 00.90.00
UUID : 335f77df:7b4443c4:45a84a39:94c75971
Creation Time : Tue Feb 15 07:30:50 2005
Raid Level : raid1
Device Size : 35881024 (34.22 GiB 36.74 GB)
Raid Devices : 2
Total Devices : 1
Preferred Minor : 0

Update Time : Thu Jul 17 01:59:43 2008
State : dirty
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Checksum : bef44f37 - correct
Events : 0.56

Number Major Minor RaidDevice State
this 1 8 33 1 active sync /dev/sdc1

0 0 0 0 0 faulty removed
1 1 8 33 1 active sync /dev/sdc1
----------------------------------------

What's the best way to recover from this? Can I just force a fresh new superblock onto /dev/sdb1? I'm pretty sure the drive is OK. I do have a spare drive available.

Thanks!

->R

mostlyharmless · 07-29-2008, 12:51 PM

This looks like a good opportunity for a backup first

I don't *know*, but would think that using -add to put /dev/sdb1 back into /dev/md0 should work. Using the spare first might be safer. I'm interested to see if anyone else has a better idea.

kenoshi · 07-30-2008, 03:06 PM

Do a spot backup first before you do anything else.

Rebuild the array, power outage causes all kinds of problems unless you have a controller with a powered write cache.

If /dev/sdb is more than 3 years old, replace it with the spare. You should rotate out old drives once every 3 years anyway.

Forgot to add...tell your boss to fire himself

Rascale · 07-31-2008, 12:00 PM

Thanks for your comments. Here's what I did to recover.
1. shutdown all processes and databases using the array. lsof /dev/md0 is your friend.
2. Full backup, in addition to the usual nightly ones.
3. Stop the array mdadm -S /dev/md0
4. Added the drive back into the array. In this case,
mdadm /dev/md0 --add /dev/sdb1
5. Sit back and watch progress, watch -n 1 cat /proc/mdstat
6. Restart, dmesg says
raid1: device sdc1 operational as mirror 1
raid1: device sdb1 operational as mirror 0
raid1: raid set md0 active with 2 out of 2 mirrors
md: ... autorun DONE.

Leave work early and make the boss do Helpdesk, life is good