Recover Data From Striped Logical Volume Group With Failing Drive

r00tk1ll · 04-28-2016, 07:44 PM

Hello,
I am working on a CentOS 6 server that has 7 physical drives in a striped logical volume group. The server will not boot and fails with a kernel panic. I booted it up with a live CentOS cd and in the GUI under Utilities->Drives it shows one of the drives in red saying "Disk Likely to Fail Soon". The files I am looking for are in the /var/www/hmtl directory.

My initial thought was to read only mount the LVG and just copy the files in that directory to an external drive but the entire /var directory doesn't even appear when I give the "ls" command on the mounted LVG. It does list other folders though, I.E. /etc, /boot, /usr, etc..

So my question is what would be the next step to try to recover the data?

Also, being that it's a "striped" logical volume, would I be able to just replace the failing drive to repair the system or would that make matters worse?

I'm new to working with logical volumes and would appreciate any help,
Thanks

syg00 · 04-28-2016, 10:12 PM

Quote:

Originally Posted by r00tk1ll

My initial thought was to read only mount the LVG and just copy the files in that directory to an external drive but the entire /var directory doesn't even appear when I give the "ls" command on the mounted LVG. It does list other folders though, I.E. /etc, /boot, /usr, etc..

That might indicate /var was mounted using a separate lv; maybe on a separate vg - can you get to /etc/fstab ?. Let's see it.

Quote:

Also, being that it's a "striped" logical volume, would I be able to just replace the failing drive to repair the system or would that make matters worse?

No, you don't want to do that - to quote the linux RAID wiki

Quote:

RAID-0 has no redundancy, so when a disk dies, the array goes with it.

Here's another from that site

Quote:

It is, however, very important to understand that RAID is not a general substitute for good backups.

I don't have any good news for you if you don't have backups - you may get lucky and be able to retrieve your data, but it's not looking good.
If you were using one of the higher RAID level (5,6,10, ...) you would be much better placed to fail that disk and replace it - but I understand a "striped" disk to be RAID0.

r00tk1ll · 04-28-2016, 11:11 PM

Thanks syg00, I figured as much... It died during a backup at 75%, and the new files were the other 25%..

Here is /etc/fstab on the failed LVG:
/dev/VolGroup00/LogVol00 / ext3 defaults 1 1
LABEL=/boot /boot ext3 defaults 1 2
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
/dev/VolGroup00/LogVol01 swap swap defaults 0 0

Thank you

syg00 · 04-28-2016, 11:39 PM

Well that kills the separate /var theory.
Not much else than to replace the drive, rebuild the array and restore from an older backup.

gradinaruvasile · 04-28-2016, 11:49 PM

Maybe some data MAY be recovered IF you can mount your RAID (boot from a live media!) to with a failing HDD (which might mean very slow operations, HDD link resets). Failing HDD usually means a drive that has SMART errors logged, mainly the Reallocated_Event_Count and Current_Pending_Sector indices are non-zero.

r00tk1ll · 04-29-2016, 01:35 AM

Quote:

Originally Posted by gradinaruvasile

Maybe some data MAY be recovered IF you can mount your RAID (boot from a live media!) to with a failing HDD (which might mean very slow operations, HDD link resets). Failing HDD usually means a drive that has SMART errors logged, mainly the Reallocated_Event_Count and Current_Pending_Sector indices are non-zero.

Well at this point anything is worth a shot, I still can mount the failing drive. What would be the commands?

I had an idea of taking a DD image of the LVG, but it was going to copy everything including empty space (which would take forever), just for kicks could I DD the individual failing disk and try some sort of data recovery technique or would that only give partial data due to it being "striped" over 7 drives?

Thanks in advance

syg00 · 04-29-2016, 03:07 AM

I would use ddrescue (note not dd) to image the bad drive from a liveCD. Rerun as necessary - see the doco, it will try to "fill-in" what it missed previously.
Then introduce that drive to the array and see what happens. Nothing to lose by trying really.

r00tk1ll · 05-01-2016, 03:29 PM

Ok guys,
Thanks for the help so far, so here's where I'm at. I created an image of the failing drive using ddrescue and stored it on an external HDD. So here's the question I have now, the drive was part of a Logical Volume Group written across 7 drives, which has been encrypted with a LUKS passphrase (which I found a way to unlock using cryptsetup luksOpen), so from this point can I mount the image ddrescue made like a regular drive to search for the lost files or do I have to somehow add it back to the LVG array then search that?

Any help would be appreciated, Im really confused on what approach to take.

Thanks

rknichols · 05-01-2016, 07:06 PM

It depends on just where the encryption layer was. You can encrypt the partition or full drive and put the LVM PV inside the encrypted container, or you can encrypt the LV. The output from lsblk would be useful here, and also indicate exactly what you used as the source volume for ddrescue. Depending on just how you did things, I see the possibility of an array with 6 stripes encrypted and 1 not, which would be a mess.

The simplest thing to do is probably to unplug the failing drive and boot with the external drive connected. That should allow LVM to assemble the array. Having both the failing and external drives connected means you have 2 PVs with the same UUID, and that complicates things.

r00tk1ll · 05-01-2016, 07:43 PM

Ok so I'll attach what I'm looking at in the GUI and describe what I did. As you can see I have 7 1 TB Physical Drives with the 1 (/dev/sdd) in red. I clicked on the lock icon and unlocked it with the passphrase. Then I mounted the 2 TB Drive under the directory (/mnt/win) it's a NTFS drive connected via USB.

I then ran the command "ddrescue -n -N -vvv /dev/sdd /mnt/win/sdd.img /mnt/win/sdd.log"

That ran all night now I have the outputted 1 TB image, it appears the encryption was copied over as well because when I tried to mount the sdd.img file it says it is a file system type Luks.

Here is the output of lsblk:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 931.5G 0 disk
ââsda1 8:1 0 102M 0 part
ââsda2 8:2 0 931.4G 0 part
sdb 8:16 0 931.5G 0 disk
ââsdb1 8:17 0 931.5G 0 part
sdc 8:32 0 931.5G 0 disk
ââsdc1 8:33 0 931.5G 0 part
sdd 8:48 0 931.5G 0 disk
ââsdd1 8:49 0 931.5G 0 part
ââluks-25db43f5-fa88-4ab8-8568-0e439e1b62df
253:3 0 931.5G 0 crypt
sde 8:64 0 931.5G 0 disk
ââsde1 8:65 0 931.5G 0 part
sdf 8:80 0 931.5G 0 disk
ââsdf1 8:81 0 931.5G 0 part
sdg 8:96 0 931.5G 0 disk
sdh 8:112 0 1.8T 0 disk
ââsdh1 8:113 0 100M 0 part
ââsdh2 8:114 0 1.8T 0 part
sr0 11:0 1 696M 0 rom /run/initramfs/live
loop0 7:0 0 20K 1 loop
loop1 7:1 0 4.2M 1 loop
ââlive-osimg-min 253:2 0 8G 1 dm
loop2 7:2 0 626.1M 1 loop
loop3 7:3 0 8G 1 loop
ââlive-rw 253:0 0 8G 0 dm /
ââlive-base 253:1 0 8G 1 dm
ââlive-osimg-min 253:2 0 8G 1 dm
loop4 7:4 0 512M 0 loop
ââlive-rw 253:0 0 8G 0 dm /

I hope that helps explain the situation I'm in now.

rknichols · 05-01-2016, 08:47 PM

That is different from what I expected to see. When you said, "striped logical volume group," I thought you meant you had used LVM to do the striping. What I see in that lsblk output makes sense only for a hardware or software RAID array that is then encrypted as a single unit. My guess is that it's an MD RAID array with a version 0.9 or 1.0 superblock.

This would have been easier if you had used a raw disk drive rather than a file as the ddrescue destination. Trying to assemble a RAID array from 6 devices and 1 file would be difficult, perhaps impossible if that array is essential for booting.

What are your plans for reconstructing the system? If you already have a suitable replacement drive, the simplest thing to do would be to install that drive in place of the failing /dev/sdd (I presume that's the failing drive.) and copy the image back to the new drive:

Code:

dd if=/mnt/win/sdd.img of=/dev/sdd bs=256k

Then, everything should "just work." Do be absolutely sure that the "of=" is going to the right drive. It should be identifiable by its lack of a partition table prior to restoring the image to it. Get that wrong, and all is lost. That "bs=256k" is a fairly arbitrary block size. You just want something substantially larger than the default 512 bytes or the operation will be very slow.

r00tk1ll · 05-01-2016, 09:38 PM

Hello rknichols,
Thank you for your help.

Quote:

What I see in that lsblk output makes sense only for a hardware or software RAID array that is then encrypted as a single unit

That makes perfect sense because when the OS was installed, it was handled through the CENTOS installation DVD specifying to use all the detected SATA drives as one file system and encrypt it. There is no hardware RAID controller, all the drives are plugged directly into the motherboard.

The goal for me is to just extract any retrievable data from the /var/www/html directory, I do have a replacement drive for the failed one.

I wanted to consult with those who have more experience in this area before risk ruining my chance of recovering any data so I made the ddrescue image file in case the drive failed.

Replacing the drive and copying the image over to a new physical drive and putting it in place of the failed one does seem like the most logical solution, so my last questions are:

1.) Once I copy the ddrescue image of the failed drive to a new drive, and introduce it into the array, will the LVG become confused and need to be rebuilt? If so, where should I look for how to do that (being thats it's a Luks encrypted image, will it be different from a standard LVG rebuild procedure)

2.) Ive read on some other postings about multiple passes with ddrescue, would it be worth the trouble to do that or just try to recover with the image I just copied from the failed drive. Here is the post Im referring to:

Quote:

https://www.centos.org/forums/viewtopic.php?t=48634

Either way, I really appreciate everyone's help and if I solve this I will post a full breakdown of what it took to accomplish it as to help someone else out.

rknichols · 05-01-2016, 10:07 PM

1. The system won't see any difference with the new drive (except that it works, of course). Do be sure to unplug the old drive. You really don't want to have two identically labeled drives in the system.

2. How much was ddrescue able to recover? What did the final status display look like? If there are still unrecovered sectors, you can rerun it with a nonzero number for --retry-passes. Be sure to use the same image and log files, and ddrescue will pick up where it left off.

r00tk1ll · 05-01-2016, 10:55 PM

The log file only shows the start time and the finish time, from the original CLI it seemed like everything was ok there was nothing printed that said unrecoverable or anything of that nature. So from this point, would the procedure be copy the ddrescue image over to the new disk? Then use the CENTOS DVD in recovery mode to mount the LVG as if it weere never disturbed?

Thanks

rknichols · 05-02-2016, 09:09 AM

The statistics are just displayed on the screen, not recorded in the log file. The only indication that some sectors were not recovered is a non-zero number for "errors: " in the stats. If you just run the ddrescue command again (without requesting retries) you can see it. Or, you can look through the log file for lines with a status other than "+". From the manpage, the meaning of those status characters is

Code:

'?' 	non-tried block
'*' 	failed block non-trimmed
'/' 	failed block non-scraped
'-' 	failed block bad-sector(s)
'+' 	finished block

Since ddrescue did finish, I believe you should not see any status other than "+" or "-".

Back in #11, I gave the command for copying the image back to the new drive. Then you can assemble the array, unlock the encryption, and run lvscan to find the LVM logical volumes. I recommend running "fsck -f -n" on each of the filesystems to verify its condition. Do include the "-n" option so that fsck won't try to "fix" anything. That could be disasterous if there are unrecovered blocks in the filesystem metadata. You need to see the extent of any damage first.