LVM - Missing device reappeared but still missing

roelvdh · 05-27-2019, 04:44 AM

I have a VG called BU with 5 LVs and >20 PVs.
I once tried to start up the VG while a group of 13 PVs wasn't yet online, and vgchange failed as could be expected. After starting up the group of 13 PVs the vgchange -ay BU command still failed, suggesting using --activationmode partial.
I then ran
vgchange -ay --activationmode partial BU
which was succesful. All 5 LVs were activated and could be mounted correctly.
While this would be a useable way to start up my VG, it excludes me from making changes or adding a new LV to the VG, as the 13 PVs are still identified as missing (actually they are not!). Message:

Code:

  WARNING: Missing device /dev/sdu1 reappeared, updating metadata for VG BU to version 83.
  WARNING: Device /dev/sdu1 still marked missing because of allocated data on it, remove volumes and consider vgreduce --removemissing.
  Cannot change VG BU while PVs are missing.
  Consider vgreduce --removemissing.
  Cannot process volume group BU

I tried vgreduce --removemissing BU to no avail.
I would like to get rid of the error messages and fully use my VG (which is currently working OK as a single activation). Any suggestions would be greatly appreciated.

rknichols · 05-27-2019, 10:04 AM

Perhaps the output from "pvs -v --segments" would be helpful. At this point, I am as mystified as you are.

roelvdh · 05-27-2019, 10:48 AM

Code:

[root@san ~]# pvs -v --segments
    There are 13 physical volumes missing.
  PV         VG      Fmt  Attr PSize    PFree    Start  SSize  LV    Start   Type   PE Ranges
  /dev/sda1  BU      lvm2 a-m    <1,82t       0       0 476931 AV     668852 linear /dev/sda1:0-476930
  /dev/sdc2  clearos lvm2 a--   <36,27g    4,00m      0    512 swap        0 linear /dev/sdc2:0-511
  /dev/sdc2  clearos lvm2 a--   <36,27g    4,00m    512   8772 root        0 linear /dev/sdc2:512-9283
  /dev/sdc2  clearos lvm2 a--   <36,27g    4,00m   9284      1             0 free
  /dev/sdd1  BU      lvm2 a--  <931,51g       0       0 238466 DVD    715382 linear /dev/sdd1:0-238465
  /dev/sde1  BU      lvm2 a--    <1,54t   <1,54t      0 402823             0 free
  /dev/sdf1  BU      lvm2 a-m    <1,82t       0       0 476931 AV     191921 linear /dev/sdf1:0-476930
  /dev/sdg1  BU      lvm2 a-m    <1,82t       0       0 476931 Jeugd       0 linear /dev/sdg1:0-476930
  /dev/sdi1  BU      lvm2 a-m  <931,51g       0       0 238466 DVD    476916 linear /dev/sdi1:0-238465
  /dev/sdk1  BU      lvm2 a-m  <931,51g       0       0 238466 DVD    953848 linear /dev/sdk1:0-238465
  /dev/sdl1  BU      lvm2 a--     1,36t    1,36t      0 357699             0 free
  /dev/sdm1  BU      lvm2 a-m  <931,51g       0       0 238466 BU     503072 linear /dev/sdm1:0-238465
  /dev/sdn1  BU      lvm2 a-m  <931,51g       0       0 238466 PC     238466 linear /dev/sdn1:0-238465
  /dev/sdp1  BU      lvm2 a-m  <931,51g       0       0 238466 PC          0 linear /dev/sdp1:0-238465
  /dev/sdr1  BU      lvm2 a--  <465,70g <465,70g      0 119219             0 free
  /dev/sds1  BU      lvm2 a-m     1,36t       0       0 357683 BU     145389 linear /dev/sds1:0-357682
  /dev/sdt1  BU      lvm2 a--  <931,45g <931,45g      0 238451             0 free
  /dev/sdu1  BU      lvm2 a-m    <1,82t       0       0 476916 DVD         0 linear /dev/sdu1:0-476915
  /dev/sdv1  BU      lvm2 a-m     1,36t       0       0 164937 AV    1145783 linear /dev/sdv1:0-164936
  /dev/sdv1  BU      lvm2 a-m     1,36t       0  164937  47357 Jeugd  476931 linear /dev/sdv1:164937-212293
  /dev/sdv1  BU      lvm2 a-m     1,36t       0  212294 145389 BU          0 linear /dev/sdv1:212294-357682
  /dev/sdw1  BU      lvm2 a-m     1,36t       0       0 189443 AV       2478 linear /dev/sdw1:0-189442
  /dev/sdw1  BU      lvm2 a-m     1,36t       0  189443 118406 DVD   1192314 linear /dev/sdw1:189443-307848
  /dev/sdw1  BU      lvm2 a-m     1,36t       0  307849  47356 PC     476932 linear /dev/sdw1:307849-355204
  /dev/sdw1  BU      lvm2 a-m     1,36t       0  355205   2478 AV          0 linear /dev/sdw1:355205-357682
  /dev/sdx1  BU      lvm2 a-m     1,36t  197,83g      0 262144 PC     524288 linear /dev/sdx1:0-262143
  /dev/sdx1  BU      lvm2 a-m     1,36t  197,83g 262144  44894 BU     741538 linear /dev/sdx1:262144-307037
  /dev/sdx1  BU      lvm2 a-m     1,36t  197,83g 307038  50645             0 free
[root@san ~]#

Side-issue: The 13 PVs are external USB drives. I need to disconnect the USB cable before starting up the server, or else the server will freeze at the original splash screen, even before POST. I have been struggling with this for long time but disconnecting and reconnecting the USB cable after POST was a minor issue to me. Until I once forgot to reconnect the USB cable in time, causing above problem. Can you think of a fix for this minor side problem too, please? I have already tried various BIOS changes.

rknichols · 05-27-2019, 10:20 PM

Have you tried "pvscan --cache" ? That should cause a re-scan of the physical drives, ignoring the cached metadata and passing new metadata to the lvmetad daemon. Since you are apparently able to mount the affected filesystems, I'm guessing that the BU volume group is actually OK and that the problem is just bad cached metadata.

Quote from the pvscan manpage:

"When lvmetad is used, LVM commands avoid scanning disks by reading metadata from lvmetad. When new disks appear, they must be scanned so their metadata can be cached in lvmetad. This is done by the command pvscan --cache, which scans disks and passes the metadata to lvmetad."

roelvdh · 05-28-2019, 04:23 AM

Thanks. Like always, the Linux manpages are very concise.
pvscan --cache didn't solve the problem but your excellent explanation made me think. If the VG=BU is OK (like it is) and the lvm2 info from the relevant PVs is OK (it must be, otherwise I could not activate the VG), but lvmetad is still clobbered with odd info, this info could come from PVs belonging to VG=BU, but currently not actively containing LVs, i.e. /dev/sde1, /dev/sdl1, /dev/sdr1 and /dev/sdt1.
My suggestion would be trying to pvremove these currently unused PVs and vgreduce the VG accordingly. If that doesn't help either, I would suggest to pvcreate --uuid --restorefile all PVs, then pvremove same and vgcfgrestore BU. What do you think?

rknichols · 05-28-2019, 09:27 AM

The "pvscan -v --segments" output is showing the "m" (missing) flag on 13 PVs that do have LVs allocated. I'm pretty much at a loss for things to suggest. I don't understand how the system could be successfully mapping the LVs on those "missing" PVs. You couldn't mount those filesystems without that. What does "lsblk -f" report?

You might try stopping the lvm2-lvmetad service (perhaps rebooting with that service disabled/masked) and see if that changes anything. Somehow I doubt it, but it's one more straw to grasp for.

BTW, I hope you have some redundancy in those LVs. With 20+ physical devices the liklihood of a device failure is significant, and losing any part of an LV means its entire filesystem is lost.

roelvdh · 05-28-2019, 10:08 AM

Code:

[root@san ~]# lsblk -f
NAME         FSTYPE   LABEL UUID                                   MOUNTPOINT
sda
└─sda1       LVM2_mem       isUr6h-4uUf-OBuR-hUIv-SJrJ-Nfxk-mSvWfq
  └─BU-AV    ext4           a3fcef68-2566-45f1-a1c6-43159d6c5c46   /home/roel/av
sdb
└─sdb1       ext4           a113e884-4ff3-49b0-be45-b85a614aebbf
sdc
├─sdc1       xfs            da3a0924-93ba-48d2-9fde-82654ca3740d   /boot
└─sdc2       LVM2_mem       caNm24-yscf-r2JY-sXpC-Gley-cO2Q-ke4v3k
  ├─clearos-root
             xfs            d4da2533-5bab-44ee-8a38-4d8ffd2757a7   /
  └─clearos-swap
             swap           646e1b3b-967d-4ddd-acbd-b666c02c8515   [SWAP]
sdd
└─sdd1       LVM2_mem       GKI0E3-5JiC-2u3q-0vb1-Ckxf-DrKY-envqBj
  └─BU-DVD   ext4           c3036364-6293-46cb-96f6-39a272dee7d9   /home/roel/DVD
sde
└─sde1       LVM2_mem       eAwFNK-2sJy-Lzwd-5037-gsBg-31tH-mhQNRr
sdf
└─sdf1       LVM2_mem       3mAdrN-2sc5-IVV1-6ecg-Yc0X-X3VF-9DNVjV
sdg
└─sdg1       swap           48d8f698-8a8f-4d8f-9b6f-58e59d695acd
sdh
└─sdh1       LVM2_mem       XXvBzz-tQL4-JVEW-bre5-Zkkf-1ag6-ivicOt
  └─BU-DVD   ext4           c3036364-6293-46cb-96f6-39a272dee7d9   /home/roel/DVD
sdi
└─sdi1       LVM2_mem       XV1pod-3UzV-MStx-Qrqo-2fbt-A9w2-qAwR1Q
  └─BU-Jeugd ext4           61faa8f0-b527-40c2-8c9b-f95657ff0ff0   /home/roel/Jeugd
sdj
└─sdj1       LVM2_mem       kx2nD3-xkRn-yJX2-bC96-XhsK-cuK2-L79RR1
sdk
└─sdk1       LVM2_mem       NE1jjM-Kdl7-gCUy-lCEF-qrk9-2M3Y-NlI4Gx
sdl
└─sdl1       LVM2_mem       4rSmee-9OTW-3urU-S45I-BQUG-bhE2-V0UQmt
  └─BU-DVD   ext4           c3036364-6293-46cb-96f6-39a272dee7d9   /home/roel/DVD
sdm
└─sdm1       LVM2_mem       E16aYS-DdC3-3DOj-uxdD-2KdK-g08k-KsUmTI
  └─BU-BU    ext4           f36a9e0d-9a33-42d0-96e9-de346cf2be2f   /home/roel/BU
sdn
└─sdn1       LVM2_mem       90RMVw-znAd-nwJi-QAcd-pfYO-BYFt-MjfZo2
  └─BU-BU    ext4           f36a9e0d-9a33-42d0-96e9-de346cf2be2f   /home/roel/BU
sdo
└─sdo1
sdp
└─sdp1       LVM2_mem       uvF7C7-I8ft-aez6-33rZ-vv7h-Eu0P-kzQjzq
  ├─BU-AV    ext4           a3fcef68-2566-45f1-a1c6-43159d6c5c46   /home/roel/av
  ├─BU-Jeugd ext4           61faa8f0-b527-40c2-8c9b-f95657ff0ff0   /home/roel/Jeugd
  └─BU-BU    ext4           f36a9e0d-9a33-42d0-96e9-de346cf2be2f   /home/roel/BU
sdq
└─sdq1       LVM2_mem       Xm4w6R-PVzv-C8uG-rzuE-2xoF-mKbc-wQw3lV
  └─BU-PC    ext4           dd4e62ac-ea2d-416c-bbfa-14201d6e9f2e   /var/lib/BackupPC
sdr
└─sdr1       LVM2_mem       iTtxm6-Pa6U-0fiP-JqQD-ifIP-YbJT-dmGIvM
  ├─BU-DVD   ext4           c3036364-6293-46cb-96f6-39a272dee7d9   /home/roel/DVD
  ├─BU-PC    ext4           dd4e62ac-ea2d-416c-bbfa-14201d6e9f2e   /var/lib/BackupPC
  └─BU-AV    ext4           a3fcef68-2566-45f1-a1c6-43159d6c5c46   /home/roel/av
sds
└─sds1
sdt
└─sdt1       LVM2_mem       f4QkDb-3jqD-3z2q-py9d-cpng-KHyd-XXnK3q
  └─BU-DVD   ext4           c3036364-6293-46cb-96f6-39a272dee7d9   /home/roel/DVD
sdu
└─sdu1       LVM2_mem       VYr55x-96iW-Wagz-tIyB-uaZp-Z3v0-VIlvd1
  ├─BU-PC    ext4           dd4e62ac-ea2d-416c-bbfa-14201d6e9f2e   /var/lib/BackupPC
  └─BU-BU    ext4           f36a9e0d-9a33-42d0-96e9-de346cf2be2f   /home/roel/BU
sdv
└─sdv1       LVM2_mem       FJ8y5u-rRml-TSYZ-W3dj-QV9r-BkpI-UvruGR
  └─BU-AV    ext4           a3fcef68-2566-45f1-a1c6-43159d6c5c46   /home/roel/av
sdw
└─sdw1
sdx
└─sdx1       LVM2_mem       6JT927-JNd7-2ofL-X1DJ-1xoQ-kgne-P83KTr
  └─BU-PC    ext4           dd4e62ac-ea2d-416c-bbfa-14201d6e9f2e   /var/lib/BackupPC
[root@san ~]#

Running the server without lvmetad will be done after trying vgreduce -a first, report follows.

Regarding redundancies: Yes, I know. This system really is my Backup system for personal files, but it's also intended to learn about LVM by heavily using it. The disks are former harddisks from my production system that were replaced over the (many) years. Sometimes there are I/O-errors that ddrescue can resolve. If a HD will suddenly fail entirely I know I will have to rebuild the LV. It's not the most efficient way to backup files but I learn a lot. Thanks for your help, I really appreciate it.

Edit:
vgreduce -a BU :
Cannot change VG BU while PVs are missing.
Consider vgreduce --removemissing.
----- I think that could be a dangerous thing to do as it might wipe the PVs currently active

[root@san ~]# systemctl stop lvm2-lvmetad.service
Warning: Stopping lvm2-lvmetad.service, but it can still be activated by:
lvm2-lvmetad.socket
[root@san ~]# pvscan --cache
[root@san ~]# pvs -v --segments
----- Same output as before with "m" attributes. Haven't tried a reboot yet.

rknichols · 05-28-2019, 12:39 PM

I see things that are amiss. The first is that the pvs output shows /dev/sdg1 as an LVM2 PV, but lsblk shows it formatted as swap space and not LVM at all. Contrast that with the way that your actual swap space on /dev/sdc2 shows up as an LV that contains swap space.

Another that I notice (not looking in any particular order) is /dev/sds1, which pvs thinks is an LVM2 member but lsblk does not.

It looks like the LVM header on some of the partitions has become corrupted. One thing to note is that the entire structure of the VG is recorded on each of the PVs. That's all of the information that you can see in /etc/lvm/backup/BU, also in ASCII, just not as nicely formatted.

It would be interesting to see what "file -s /dev/sd?1" reports. I suspect it's going to say that some of those supposedly LVM partitions are not LVM2 members. If that's the case, you need to copy /etc/lvm/backup/BU to another location and then use that copy as the restorefile for pvcreate (with the -ff, --uuid, and --restorefile options) for each affected partition, and then a vgcfgrestore. You'll almost certainly have to unmount all of the LVs in the BU VG while doing that.

roelvdh · 05-29-2019, 11:53 AM

My fault, I was too quick responding to your earlier request and had not noticed the difference in device names after a reboot. Normally I get the same device names, not this time.
Your latest suggestion was exactly what I already intended to do. I have tried 1 PV and found it nearly impossible to pvremove a PV before pvcreating it. It takes pvcreate/lvmetad more than an hour to process 1 command, reflecting the complicated situation the VG is in. My suggestion is:
- list current device names with their respective PV UUIDs (and do not reboot !)
- remove all LVM info on all BU-disks, not by pvremove/lvmetad but through dd
- pvcreate all disks with --uuid --restorefile
- vgcfgrestore BU

Would that be a workable solution?

FYI: I checked one disk, it's LVM2:

Code:

[root@san ~]# file -s /dev/sdx1
/dev/sdx1: LVM2 PV (Linux Logical Volume Manager), UUID: 6JT927-JNd7-2ofL-X1DJ-1xoQ-kgne-P83KTr, size: 1000203820544
[root@san ~]#

rknichols · 05-29-2019, 01:00 PM

If "pvcreate -ff ..." is taking a long time, I guess it's worth removing the LVM signature with dd, though I'm not sure that's going to help. LVM will still try to find and update the state of all the devices. You might have to resort to having just one disk at a time online. Otherwise, I don't see anything better than your plan.

roelvdh · 05-29-2019, 01:31 PM

If I first do a dd on all VG=BU-disks, clearing the LVM signatures of all PVs in the VG, then pvcreate/LVM wouldn't have much to check anymore. I would expect it to quickly report 20 PV's missing. After doing that 16 times it would probably report 4 PVs still missing, these being the devices that were part of the original VG=BU but not actively used. These missing devices could then be removed by vgreduce -a --removemissing BU, followed by vgcfgrestore BU.
I will try this anyway and if this takes 20 times >1 hour, that's OK. If it takes much longer I will connect 1 disk at a time running pvdisplay -m every time in order to be aware of the exact PV UUIDs that I should link them to.
I will report back as soon as I'm done.

rknichols · 05-29-2019, 06:30 PM

Hmmm, ... but as you replace more and more of those LVM signatures, ... .

Whatever. I wish you good luck anyway.

roelvdh · 05-30-2019, 10:54 AM

I will leave this for a while. Running nearly any LVM command on VG=BU is blocked because of missing PVs. All LVs are still OK, so I can use the system except for making changes to the VG. Clearing all LVM signatures seems risky as I may not be able to use pvcreate to re-create the PVs or re-extend them to the VG.
Thanks anyway for your help, it has been a good learning experience. I may come back with more news.

roelvdh · 07-31-2019, 10:44 AM

I kind of solved the problem but there seems to be a small problem remaining.

Original problem: My server with 17 external USD-devices forming a LVM VG=BU had been started up with only 4 of 17 USB disks. After reconnecting the 13 external disks LVM found 13 duplicate PVs as "reappearing". All 5 LVs of the VG worked fine, but I could not change metadata untill the duplicate problem was solved. I left the problem untill I needed an extension to 1 of 5 LVs.

Solution: Edit /etc/lvm/backup/BU manually by removing all flags with "MISSING" or "LOCKED", then vgcfgrestore BU.
This worked fine, I could still use all 5 LVs.

Next step was to extend /dev/BU/AV from 5 Tb to 7 Tb, using
lvextend -L +2T BU/AV
Lvextend reported that the size was correctly increased from 5 Tb to 7 Tb but seemed to halt at changing the ext4 filesystem.

I tried a reboot and now 4 out of 5 LVs were active, but the extended LV=AV remains inactive.

Code:

[root@san ~]# lvscan
  ACTIVE            '/dev/clearos/swap' [2,00 GiB] inherit
  ACTIVE            '/dev/clearos/root' [<34,27 GiB] inherit
  ACTIVE            '/dev/BU/DVD' [5,00 TiB] inherit
  ACTIVE            '/dev/BU/PC' [3,00 TiB] inherit
  inactive          '/dev/BU/AV' [7,00 TiB] inherit
  ACTIVE            '/dev/BU/Jeugd' [2,00 TiB] inherit
  ACTIVE            '/dev/BU/BU' [3,00 TiB] inherit
[root@san ~]# lvchange -ay BU/AV
  device-mapper: reload ioctl on  (253:6) failed: Ongeldig argument
[root@san ~]#

I guess I can go back to a former version of the metadata but then I still don't have the extension to BU/AV.
Would it be possible to solve above ioctl error?

Edit:
I solved this problem by going back to an earlier *.vg file, redoing above procedure and rebooting. Funny enough it worked this time.

lin1 · 02-29-2020, 12:38 AM

fwiw, I think this would have done the trick

sudo vgextend BU /dev/sdu1 --restoremissing