Help with RAID LVM Rescue

skydiverscott · 04-14-2008, 04:42 PM

Here is my situation:
I have an old Intel PIII 500MHz box as our home Samba server running Fedora Core 5. I have 4 Maxtor HDD's setup in RAID/LVM running on two add-on Promise IDE Cards. The system has an IDE DVRAM drive on /dev/hda (Motherboard IDE Channel 0)and the four HDD's are on the promise card as Cable Select Masters
/dev/hde - Promise card #1 IDE1 master
/dev/hdg - Promise card #1 IDE2 master
/dev/hdi - Promise card #2 IDE1 master
/dev/hdk - Promise card #2 IDE2 master

Each drive is partitioned as follows:
/dev/hde1 - RAID 1 (/dev/md0) /dev/hd[eg]1 Contains /boot partition
/dev/hde2 - RAID 5 (/dev/md1) with hd[egik]2 with LVM of VolGrp00
/dev/hdg1 - RAID 1 (/dev/md0) /dev/hd[eg]1 Contains /boot partition
/dev/hdg2 - RAID 5 (/dev/md1) with hd[egik]2 with LVM of VolGrp00
/dev/hdi1 - Linux Swap
/dev/hdi2 - RAID 5 (/dev/md1) with hd[egik]2 with LVM of VolGrp00
/dev/hdk1 - Linux Swap
/dev/hdk2 - RAID 5 (/dev/md1) with hd[egik]2 with LVM of VolGrp00

The LVM has all the other partition including the primary SMB-Shared data partition containing about 350GB of data. The system has been up and running for over a year and half as a samba domain controller and file server to three Windoze XP boxes with no major problems.

My system overheated last week to the point that the plastic removable chassis drawers containing the hard drives actually got scorched and deformed due to the heat. /dev/hdk actually started emitting a continuous beep noise that alerted me to a problem and when i checked the unit, it was overheating and I could barley touch the thing. Needless to say I had issues. I put a box fan blowing on the drives to cool them off until I could get a new case and drive and everything worked for about 2 days.

When the system locked up after two days of running with the fan, it finally refused to boot stating only 2/4 devices in the RAID array were working.

The result (after my panic subsided enough for me read the howtos) was a broken RAID set. I started by running SPINRITE (from GRC.com) on ALL of the RAID member HDD's to try to repair any damage. The devices marked hd[egi] checked out AOK with only one of the drives having ANY sectors required recovery and were reported as having been successfully recovered. The overheated drive (hdk) took three days with about 10 sectors that were partially recovered and about 5 that were fully recovered.

After a trip to Microcenter to buy new fan cooled aluminum drive carriers and a 250GB drive to replace the hdk drive and two 500GB drives to make a local backup of the array once I got it back up and running, I reassembled the computer and attempted to start to go about rebuilding my RAID ARRAYS.

My first attempt was to try to boot a Live CD but for some reason, I cannot get any flavor of Linux Live CD distributions to run, probably due to the combination of (unused) SCSI on the Mother board in addition to the add-on IDE PCi cards. I finally dug up my original FC5 install disk and have been working in recover mode.

After getting to the bash prompt in recover mode I set to work to try to get the array up again via mdadm with the help of the LinuxJournal article #8874 on "Recovery of RAID and LVM2 Volumes". I double checked the geometry in sfdisk and mdadm --examine --scan and everthing looked good there. I used this command output to manually create the mdadm.conf file in my rescue session to set about re-creating the MD1 RAID volume. After attempting to do a mdadm -A -s to start the RAID array, I got an error reporting the two drives, one the hdk2 device and the other the hdg2 device.

After doing a mdadm -A -s -f, the error count on the hdg drive was reset and everything appeared to check out with the hdk device reporting as a spare but all the other three devices showing as up and working. I issued the mdadm --add for the spinrite repaired hdk drive for the md1 array and it started to rebuild the array. I thought I was golden and sat back while monitoring the status via cat /pros/mdstat. It got to about 25% when I next check the status and the hdg2 drive showed as having been dropped from the array and resyncing onto hdk2 stopped. After a reboot and re-building the mdadm.conf again, I attempted again by way of the --force option to reset the errors and this time got to about 30% before the same thing again.

When the drop happens while resyncing, hdg2 reports as faulty spare. My questions are:
Is there a way to force the resyncing?

Would it be possible to do a DD copy of hddg to a new HDD and attempt again with the new copy or is it probably that I have a problem with the data on the drive?

Could this be caused by problems with the target tape drive hdk? I haven't replaced this HDD yet with one of the new drives I purchased since it was now reporting as 100% usable in Spinrite, albeit the current data being useless after the data recovery.

If I did indeed lose two of the disks have I lost everything?

Any guidance would be accepted. Once I get the RAID set rebuilt, I will set about getting to the LVM group.

I need to get to my tax info and my edited wedding photos before my wife kills me!!

Thanks in advanced for any suggestions and help.

Simon Bridge · 04-15-2008, 02:11 AM

I put a box fan blowing on the drives to cool them off until I could get a new case and drive and everything worked for about 2 days.[/quote]And during those two days, nobody thought to do a backup?! How much warning do you need?

Quote:

When the system locked up after two days of running with the fan, it finally refused to boot stating only 2/4 devices in the RAID array were working.

That would normally be recoverable - but you need to know which devices are working.

Quote:

The overheated drive (hdk) took three days with about 10 sectors that were partially recovered and about 5 that were fully recovered.

Looks like you need to replace hdk.

Quote:

After attempting to do a mdadm -A -s to start the RAID array, I got an error reporting the two drives, one the hdk2 device and the other the hdg2 device.

That sentence isn't complete. Try just quoting the errors. But it looks like hdg and hdk are the drives to be replaced.

Quote:

Is there a way to force the resyncing?

No. That's not what RAID is for.

Quote:

Would it be possible to do a DD copy of hdg to a new HDD and attempt again with the new copy or is it probably that I have a problem with the data on the drive?

Your problem is with the physical surface of the drive. Anyway, hdg will have different stripes.

You could try wiping the misbehaving drive, and repartitioning. Hopefully the partition+format steps will work around the unusable bits.

Your recovery strategy should go as follows:

1. recover the raid1 group - that's mirrored, so you need only one partition. Create a backup. This should be easy.

2. recover the LVM - pull the data: create a backup. This will be harder - especially if you have damage to some parts of the volume. It is possible that the repairwork already attempted will have irretrievably damaged it.

3. rebuild the raid5 - you now have a lot of freedom to mess with the data. Either reformat the hdk or use the new drive.

4. restore data from backups where needed.

5. Implement a backup policy.

Notice that I am not saying anything about spinrite. This is because the claims made by the vendors do not pass a basic snake-oil test.
http://www.grc.com/sroverview.htm (etc)

Quote:

SpinRite prevents mass storage systems from crashing

This is not possible.
The method claimed is that the software "exercizes" the disk space with multiple read/writes ... which can be expected to increase the likelihood of a crash.

Quote:

why can't we make a failure-proof disk drive? The answer, of course, is that we could

Wrong: the answer is that it is physically impossible.
And so on - there is a huge mound of technobabble mixed with pseudoscience. Maybe it's a good product? The trouble is, I cannot tell.

It claims to be 100% guaranteed - but they don't say what in being guaranteed - you can test that by writing them and saying that you have a drive reported as 100% recovered by spinrite but which is not useable.

skydiverscott · 04-15-2008, 10:07 AM

Thank you Simon for your guidance...

Please pardon my not posting exact error or status messages. I am posting this from work (where I spend most of my waking hours) and posting by memory. Besides I am working on this recovery from within a Linux Rescue boot session and cannot cut & paste the errors and commands status output to post here. I have several notebook pages of all the commands I have been working with and output from those commands documented as well in scrupulous detail in pencil (I suggest for others trying to learn from this to always do the same).

I am not a complete Newbie with Linux (been a user for about 5 years now) and CAN and DO read the MAN files, so I generally don't need help with command strings of Linux tools. I try to learn them by research and what better way than when under pressure

What I am looking for is a general progression of recovery steps. I will lookup the exact commands once I know the methodology.

Your comments on my first report suggested:
Me:

Quote:

When the system locked up after two days of running with the fan, it finally refused to boot stating only 2/4 devices in the RAID array were working.

You:

Quote:

That would normally be recoverable - but you need to know which devices are working.

Are you saying that if I can assemble 2 of the original 4 partitions of the RAID5 md1 set I should be able to pull off the LVM onto backup media? I know which devices are good (hd[ei] are solid with hdg the one that drops out when trying to rebuild the array)

When I asked about forcing a re-syncing you responded

Quote:

No. That's not what RAID is for.

The question was in regard to re-syncing the replacement drive (hdk) into the RAID 5 array. That IS what RAID is for. The issue is that is starts to re-sync or maybe a better term is re-add the replacement drive into the array and gets to about 30-45% when the hdg device drops out of the RAID array and the synchronization terminates.

Your suggestions for progression where:

Quote:

1. recover the raid1 group - that's mirrored, so you need only one partition. Create a backup. This should be easy.

Recovery of the RAID1 set is done on /dev/hd[eg]. This partition only contains the /boot partition

Quote:

2. recover the LVM - pull the data: create a backup. This will be harder - especially if you have damage to some parts of the volume. It is possible that the repair work already attempted will have irretrievably damaged it.

I thought that I cannot recover the LVM until I have the RAID5 set back online. I have been trying to re-add the 4th device in the set to bring it back online first. Are you suggesting to try to pull the LVM off the set with only two or three of the devices up and basically punt on the resyncing of the replacement drive in position hdk? I haven't considered that...

Quote:

3. rebuild the raid5 - you now have a lot of freedom to mess with the data. Either reformat the hdk or use the new drive.

Already assumed but I placed this before the step above.

Quote:

4. restore data from backups where needed.

Will do so from what i already have

Quote:

5. Implement a backup policy.

I didn't have a local alternative for backup due to lack of storage space. I have an offline policy in place albeit a slow one in progress via Jungledisk. With the purchase of two shiny new 500GB drives on my trek to Microcenter, I now have the ability to have a local copy.

Based on your feedback above, here is what I have in my recovery plan now:

1. Add one of the 500GB drives into my system and format it as one partition.
2. Mount the 500GB drive
3. Re-Build the md1 RAID array with the three drives that are still working
4. Attempt to copy the LVM on the md1 device to the 500GB device.

Are there any tricks to doing this or once I mount the MD1 device will the LVM appear as one big file?

If I have similar issues with one of three devices dropping off during the copying of the LVM can I still pull data with only two of the devices working? As I understand RAID the answer is no but your response above suggested that I will.

----------------------------------------------------------------------------------

FYI - I have an offline backup solution implemented via jungledisk but due to Comcast throttling, I was still about 24 days away from completing my offsite backup of the data on this RAID array.

Also for the record: After taking the computer offline after discovering the HDD screaming for help, I ran Spinrite in data recovery mode on all four disks in the set to verify the damage done. After running Spinrite in data recovery mode and the hd[egi] drives coming up clean with only the drive (hdk) that was suffering the beeping having un-recoverable errors, I addressed the underlying cause o the crash (heat) and brought the system back online to start a local backup. Since I didn't have an extra 350GB of storage space lying around I backed up what I could of the 350GB onto the 150GB of media that I had until I could secure more media to backup to and replace the failed drives in the ARRAY with. It was during this process the system, finally locked up and then failed to re-boot showing the 2/4 message.

I appreciate your your interpretation of the claims on GRC's site but all I have to say is that this is a tool that has proved itself to me on numerous occasions, recovering data on drives that were pronounced completely DOA to my customers by many different high-dollar data recovery companies when I was a consultant. I suggest that you read a bit more about how Spinrite works before evaluating the claims on it's website, but it is inconsequential to the matter at hand and I really don't want to start a flame war.

Thank you for your assistance, I look forward to more of your seasoned advice.

Simon Bridge · 04-16-2008, 11:10 PM

I understood that your LVM volumes exist seperate from the raid volumes - alongside them.