root filesystem LV disappeared during power failure

dspear99ca · 07-07-2016, 01:05 PM

This system has two drives in software RAID-1 mirror. Boot is on /dev/md0 and the root filesystem is an LV in /dev/Volume00. After an extended power outage which outlasted the UPS, the machine crashed. Upon bootup everything looks normal until it's time to mount the root filesystem. The error message is something like "failed to mount root filesystem /dev/Volume00/RootVol on /mnt. No such filesystem." It drops down to a command prompt, and I can see and mount any of the other filesystems in /dev/Volume00... there are 4 other filesystems. it's like /dev/Volume00/RootVol disappeared. It doesn't show up in /dev/Volume00 or /dev/mapper at all. Where does the OS get the list of mountable LVM volumes from at bootup, is it metadata on the disk?

There are no LVM tools in my /boot partition and this is an old machine which cannot boot via USB. I'm currently downloading an ISO of Knoppix so I can have access to tools. It is a testament to the stability of linux that it's been doing it's thing virtually unmaintained and unattended for the past 5 years without a single issue (it serves multiuser accounting and reporting software).

What I'm thinking of doing is retrieving /etc/lvm/backup-Volume00 off the most recent tape backup to one of the mountable volumes in Volume00 (I have no non-RAID non-LVM volumes available) and using vgcfgrestore from the Knoppix distro... hopefully the hard drives all mount under knoppix with the correct device numbers/names.

I'm wondering if anyone can comment on this strategy, I have full tape backups of the root filesystem but am unsure as to how exactly I'd create the MD/PV/LV structure I need to restore onto from scratch... would vgcfgrestore do everything I need? I don't really want to go this route if I don't have to...

rknichols · 07-07-2016, 04:20 PM

The list of volume groups to activate and the identity of the root filesystem is passed by GRUB in the kernel command line. The exact format varies somewhat among releases (it's all processed by scripts in the initrd, so anything is possible), but if you look at the GRUB menu it should be apparent.

Restoring the LVM configuration for that VG should be a sound strategy, but you might want to save a vgcfgbackup file so that you can get back to what you have now if something goes awry.

jefro · 07-07-2016, 04:40 PM

Might poke around with a live media. You can generally add almost any tool even when live if you have access to network.

Not sure I'd use Knoppix for this but it may work. Fedora server 24 or one of the live clones of red hat.

dspear99ca · 07-07-2016, 05:20 PM

Quote:

Originally Posted by rknichols

The list of volume groups to activate and the identity of the root filesystem is passed by GRUB in the kernel command line. The exact format varies somewhat among releases (it's all processed by scripts in the initrd, so anything is possible), but if you look at the GRUB menu it should be apparent.

Restoring the LVM configuration for that VG should be a sound strategy, but you might want to save a vgcfgbackup file so that you can get back to what you have now if something goes awry.

I have only one volume group, it's being activated as I can access other logical volumes within the group, and the kernel is receiving the proper device to mount (/dev/Volume00/RootVol) but is unable to find/mount it so it sounds like everything's working as it should. I'll try restoring the volume group config and see if the root LV will be found.

If restoring the volume group config does not work, is there a "how-to" for disaster recovery of a RAID/LVM machine? I am thinking if it's only my root filesystem LV that's missing, I can manually re-create an LV with the same name, mount it, restore my "/" filesystem to it, and all should work... the boot stuff is the hard part and that part seems to be working...

And, will the knoppix distro recognize the md/lvm volumes without manual intervention? Guess I'll find out.

syg00 · 07-10-2016, 09:24 PM

More detail in this (closed) thread.
That array is degraded - one device is missing. Presumably missing metadata if fdisk sees both disks - let's see

Code:

lsblk -f

I would expect some messages. You can force the array to assemble and activate degraded by using "--run" - presumably systemrescueCD does this, and your initrd doesn't. This probably should be in the Slackware forum as they will be better aware of what is in the initrd. Hit the "Request" button on your initial post and ask to get it moved.

dspear99ca · 07-11-2016, 09:05 AM

More info:

I booted a Live USB system recovery CD. Right away I could see that RootVol showed up (the logical volume that does not exist when I try to boot normally), was mountable and looks fine. So I started looking at the raid array.

Output of cat /dev/mdstat is:

root@sysresccd /mnt/rootvol/etc % cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 dm-1[0]
488287488 blocks [2/1] [U_]

unused devices: <none>

Not what I expected. My two hard drives are /dev/sda and /dev/sdb, no errors in /var/log/messages about them although I have no ability to tweak loglevels in the Live CD version I am running. Why does /proc/mdstat not show actual devices? What are /dev/dm-# devices?

Output of mdadm --detail /dev/md0 is:

root@sysresccd /mnt/rootvol/etc % dmadm -D /dev/md0
zsh: correct 'dmadm' to 'mdadm' [nyae]? y
/dev/md0:
Version : 0.90
Creation Time : Thu Dec 3 11:53:48 2009
Raid Level : raid1
Array Size : 488287488 (465.67 GiB 500.01 GB)
Used Dev Size : 488287488 (465.67 GiB 500.01 GB)
Raid Devices : 2
Total Devices : 1
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Sun Jul 10 12:00:57 2016
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0

UUID : a89cbdf5:f83cf3f7:dcc86dce:213c81b2
Events : 0.38

Number Major Minor RaidDevice State
0 253 1 0 active sync /dev/dm-1
2 0 0 2 removed

I am guessing that either a) I have a failed disk or b) the array /dev/md0 is not synched, maybe thinks a disk has failed?

At any rate, the machine definitely will not boot from this state, and I can't figure out which, if any, of my hard disks are the problem, nor how to fix this mess. This is a production server with full backups... I could rebuild it, but really woudl rather not as it's a pretty tedious process... there's nothing wrong with the data nor, I'm guessing, either of the disks.

There is no mdadm.conf.

fdisk -l shows both disks as Linux Raid Autodetect, everything looks normal.

Output of dmsetup ls:

root@sysresccd /mnt/rootvol/etc % dmsetup ls
isw_bfdbfijegh_Volume01 (253:1)
isw_bfdbfijegh_Volume0 (253:0)
Volume00-MediaVol (253:9)
Volume00-RootSnap-cow (253:4)
Volume00-XplrVol (253:7)
Volume00-RootSnap (253:5)
Volume00-SwapVol (253:8)
Volume00-RootVol (253:3)
Volume00-RootVol-real (253:2)
Volume00-HomeVol (253:6)