Thank you Simon for your guidance...
Please pardon my not posting exact error or status messages. I am posting this from work (where I spend most of my waking hours) and posting by memory. Besides I am working on this recovery from within a Linux Rescue boot session and cannot cut & paste the errors and commands status output to post here. I have several notebook pages of all the commands I have been working with and output from those commands documented as well in scrupulous detail in pencil (I suggest for others trying to learn from this to always do the same).
I am not a complete Newbie with Linux (been a user for about 5 years now) and CAN and DO read the MAN files, so I generally don't need help with command strings of Linux tools. I try to learn them by research and what better way than when under pressure
What I am looking for is a general progression of recovery steps. I will lookup the exact commands once I know the methodology.
Your comments on my first report suggested:
Me:
Quote:
When the system locked up after two days of running with the fan, it finally refused to boot stating only 2/4 devices in the RAID array were working.
|
You:
Quote:
That would normally be recoverable - but you need to know which devices are working.
|
Are you saying that if I can assemble 2 of the original 4 partitions of the RAID5 md1 set I should be able to pull off the LVM onto backup media? I know which devices are good (hd[ei] are solid with hdg the one that drops out when trying to rebuild the array)
When I asked about forcing a re-syncing you responded
Quote:
No. That's not what RAID is for.
|
The question was in regard to re-syncing the replacement drive (hdk) into the RAID 5 array. That IS what RAID is for. The issue is that is starts to re-sync or maybe a better term is re-add the replacement drive into the array and gets to about 30-45% when the hdg device drops out of the RAID array and the synchronization terminates.
Your suggestions for progression where:
Quote:
1. recover the raid1 group - that's mirrored, so you need only one partition. Create a backup. This should be easy.
|
Recovery of the RAID1 set is done on /dev/hd[eg]. This partition only contains the /boot partition
Quote:
2. recover the LVM - pull the data: create a backup. This will be harder - especially if you have damage to some parts of the volume. It is possible that the repair work already attempted will have irretrievably damaged it.
|
I thought that I cannot recover the LVM until I have the RAID5 set back online. I have been trying to re-add the 4th device in the set to bring it back online first. Are you suggesting to try to pull the LVM off the set with only two or three of the devices up and basically punt on the resyncing of the replacement drive in position hdk? I haven't considered that...
Quote:
3. rebuild the raid5 - you now have a lot of freedom to mess with the data. Either reformat the hdk or use the new drive.
|
Already assumed but I placed this before the step above.
Quote:
4. restore data from backups where needed.
|
Will do so from what i already have
Quote:
5. Implement a backup policy.
|
I didn't have a local alternative for backup due to lack of storage space. I have an offline policy in place albeit a slow one in progress via Jungledisk. With the purchase of two shiny new 500GB drives on my trek to Microcenter, I now have the ability to have a local copy.
Based on your feedback above, here is what I have in my recovery plan now:
1. Add one of the 500GB drives into my system and format it as one partition.
2. Mount the 500GB drive
3. Re-Build the md1 RAID array with the three drives that are still working
4. Attempt to copy the LVM on the md1 device to the 500GB device.
Are there any tricks to doing this or once I mount the MD1 device will the LVM appear as one big file?
If I have similar issues with one of three devices dropping off during the copying of the LVM can I still pull data with only two of the devices working? As I understand RAID the answer is no but your response above suggested that I will.
----------------------------------------------------------------------------------
FYI - I have an offline backup solution implemented via jungledisk but due to Comcast throttling, I was still about 24 days away from completing my offsite backup of the data on this RAID array.
Also for the record: After taking the computer offline after discovering the HDD screaming for help, I ran Spinrite in data recovery mode on all four disks in the set to verify the damage done. After running Spinrite in data recovery mode and the hd[egi] drives coming up clean with only the drive (hdk) that was suffering the beeping having un-recoverable errors, I addressed the underlying cause o the crash (heat) and brought the system back online to start a local backup. Since I didn't have an extra 350GB of storage space lying around I backed up what I could of the 350GB onto the 150GB of media that I had until I could secure more media to backup to and replace the failed drives in the ARRAY with. It was during this process the system, finally locked up and then failed to re-boot showing the 2/4 message.
I appreciate your your interpretation of the claims on GRC's site but all I have to say is that this is a tool that has proved itself to me on numerous occasions, recovering data on drives that were pronounced completely DOA to my customers by many different high-dollar data recovery companies when I was a consultant. I suggest that you read a bit more about how Spinrite works before evaluating the claims on it's website, but it is inconsequential to the matter at hand and I really don't want to start a flame war.
Thank you for your assistance, I look forward to more of your seasoned advice.