Issue with duplicate PVs and failed / faulty dm devices

lhiggie1 · 06-19-2012, 09:20 PM

We are in the process of migrating from a Hitachi SAN to a EMC VNX SAN. I split the HBAs and set up mirroring. After about 1 month of stability, I have stopped the mirroring and am now cleaning up the old Hitachi device files.

We are currently running Red Hat Enterprise Linux AS release 4 (Nahant Update 9)

Here is a lvs listing:
[root@rpsior06 mapper]# lvs /dev/sda: read failed after 0 of 4096 at 0: Input/output error /dev/sdq: read failed after 0 of 4096 at 0: Input/output error
/dev/sdag: read failed after 0 of 4096 at 0: Input/output error
/dev/dm-14: read failed after 0 of 4096 at 0: Input/output error
/dev/dm-15: read failed after 0 of 4096 at 0: Input/output error
/dev/sdb: read failed after 0 of 4096 at 0: Input/output error
/dev/dm-16: read failed after 0 of 4096 at 0: Input/output error
/dev/dm-17: read failed after 0 of 4096 at 0: Input/output error
/dev/dm-18: read failed after 0 of 4096 at 0: Input/output error
/dev/dm-19: read failed after 0 of 4096 at 0: Input/output error
/dev/dm-20: read failed after 0 of 4096 at 0: Input/output error
/dev/dm-21: read failed after 0 of 4096 at 0: Input/output error
/dev/dm-22: read failed after 0 of 4096 at 0: Input/output error
/dev/dm-23: read failed after 0 of 4096 at 0: Input/output error
/dev/dm-24: read failed after 0 of 4096 at 0: Input/output error
Found duplicate PV vqZhTX2fyxODUXgb50PvIn26SZUKx3Wp: using /dev/dm-26 not /dev/dm-10
/dev/sdc: read failed after 0 of 4096 at 0: Input/output error
/dev/sds: read failed after 0 of 4096 at 0: Input/output error
/dev/sdd: read failed after 0 of 4096 at 0: Input/output error
/dev/sde: read failed after 0 of 4096 at 0: Input/output error
/dev/sdu: read failed after 0 of 4096 at 0: Input/output error
/dev/sdf: read failed after 0 of 4096 at 0: Input/output error
/dev/sdg: read failed after 0 of 4096 at 0: Input/output error
/dev/sdw: read failed after 0 of 4096 at 0: Input/output error
/dev/sdh: read failed after 0 of 4096 at 0: Input/output error
/dev/sdx: read failed after 0 of 4096 at 0: Input/output error
/dev/sdi: read failed after 0 of 4096 at 0: Input/output error
/dev/sdy: read failed after 0 of 4096 at 0: Input/output error
/dev/sdj: read failed after 0 of 4096 at 0: Input/output error
/dev/sdk: read failed after 0 of 4096 at 0: Input/output error
/dev/sdaa: read failed after 0 of 4096 at 0: Input/output error
/dev/sdac: read failed after 0 of 4096 at 0: Input/output error
/dev/sdo: read failed after 0 of 4096 at 0: Input/output error
/dev/sdae: read failed after 0 of 4096 at 0: Input/output error
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
LogVol00 VolGroup00 -wi-ao 10.00G
LogVol01 VolGroup00 -wi-ao 16.00G
LogVol02 VolGroup00 -wi-ao 50.00G
LogVol03 VolGroup00 -wi-ao 10.00G
LogVol04 VolGroup00 -wi-ao 10.00G
LogVol05 VolGroup00 -wi-ao 10.00G
LogVol00 VolGroup01 -wi-ao 100.00G
LogVol00 VolGroup02 -wi-ao 650.00G
LogVol00 VolGroup03 -wi-ao 650.00G
LogVol00 VolGroup04 -wi-ao 650.00G
LogVol00 VolGroup05 -wi-ao 50.00G
LogVol00 VolGroup06 -wi-ao 650.00G
LogVol00 VolGroup07 -wi-ao 100.00G
LogVol00 VolGroup08 -wi-ao 100.00G
LogVol00 VolGroup09 -wi-ao 65.00G
LogVol00 VolGroup10 -wi-ao 65.00G
LogVol00 VolGroup11 -wi-ao 65.00G
LogVol00 VolGroup12 -wi-ao 100.00G
[root@rpsior06 mapper]#

Here is a multipath listing:
[root@rpsior06 mapper]# multipath -v2
remove: mpath27 (dup of mpath18)
mpath27: map in use
remove: mpath28 (dup of mpath17)
mpath28: map in use
remove: mpath29 (dup of mpath27)
mpath29: map in use
remove: mpath30 (dup of mpath14)
mpath30: map in use
remove: mpath32 (dup of mpath37)
mpath32: map in use
remove: mpath33 (dup of mpath35)
mpath33: map in use
remove: mpath23 (dup of mpath31)
mpath23: map in use
remove: mpath24 (dup of mpath15)
mpath24: map in use
remove: mpath25 (dup of mpath28)
mpath25: map in use
[root@rpsior06 mapper]#

Here is a multipath -ll listing:

[root@rpsior06 mapper]# multipath -ll

mpath2 (360060e8005650e000000650e00004502)
[size=650 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [enabled]
\_ 0:0:0:2 sdc 8:32 [failed][faulty]

mpath23 (3600601608ef02c009485af40eb6de111)
[size=656 GB][features="0"][hwhandler="1 emc"]
\_ round-robin 0 [prio=1][active]
\_ 1:0:0:0 sdl 8:176 [active][ready]
mpath1 (360060e8005650e000000650e00004501)
[size=650 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [enabled]
\_ 0:0:0:1 sdb 8:16 [failed][faulty]
mpath0 (360060e8005650e000000650e00004500)
[size=650 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [enabled]
\_ 0:0:0:0 sda 8:0 [failed][faulty]
mpath33 (3600601608ef02c00debb7685eb6de111)
[size=100 GB][features="0"][hwhandler="1 emc"]
\_ round-robin 0 [prio=1][active]
\_ 1:0:0:10 sdv 65:80 [active][ready]
\_ round-robin 0 [enabled]
\_ 1:0:1:10 sdag 66:0 [active][ready]
mpath9 (360060e8005650e000000650e00004509)
[size=65 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [enabled]
\_ 0:0:0:9 sdj 8:144 [failed][faulty]
mpath32 (36006016030f02c00e21b007feb6de111)
[size=65 GB][features="0"][hwhandler="1 emc"]
\_ round-robin 0 [prio=1][active]
\_ 1:0:1:9 sdaf 65:240 [active][ready]
\_ round-robin 0 [enabled]
\_ 1:0:0:9 sdu 65:64 [active][ready]
mpath8 (360060e8005650e000000650e00004508)
[size=65 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [enabled]
\_ 0:0:0:8 sdi 8:128 [failed][faulty]
mpath29 (3600601608ef02c0008426f78eb6de111)
[size=100 GB][features="0"][hwhandler="1 emc"]
\_ round-robin 0 [prio=1][active]
\_ 1:0:0:6 sdr 65:16 [active][ready]
\_ round-robin 0 [enabled]
\_ 1:0:1:6 sdac 65:192 [active][ready]
mpath31 (3600601608ef02c00e0e4a77eeb6de111)
[size=65 GB][features="0"][hwhandler="1 emc"]
\_ round-robin 0 [prio=1][active]
\_ 1:0:0:8 sdt 65:48 [active][ready]
\_ round-robin 0 [enabled]
\_ 1:0:1:8 sdae 65:224 [active][ready]
mpath7 (360060e8005650e000000650e00004507)
[size=65 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [enabled]
\_ 0:0:0:7 sdh 8:112 [failed][faulty]
mpath28 (36006016030f02c00d0c45c72eb6de111)
[size=100 GB][features="0"][hwhandler="1 emc"]
\_ round-robin 0 [prio=1][active]
\_ 1:0:1:5 sdab 65:176 [active][ready]
\_ round-robin 0 [enabled]
\_ 1:0:0:5 sdq 65:0 [active][ready]
mpath30 (36006016030f02c00e67b8978eb6de111)
[size=65 GB][features="0"][hwhandler="1 emc"]
\_ round-robin 0 [prio=1][active]
\_ 1:0:1:7 sdad 65:208 [active][ready]
\_ round-robin 0 [enabled]
\_ 1:0:0:7 sds 65:32 [active][ready]
mpath6 (360060e8005650e000000650e00004506)
[size=100 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [enabled]
\_ 0:0:0:6 sdg 8:96 [failed][faulty]
mpath12 (3600605b00377425015c15d7923d84bce)
[size=407 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [prio=1][active]
\_ 2:2:1:0 sdai 66:32 [active][ready]
mpath27 (3600601608ef02c00a84b476aeb6de111)
[size=656 GB][features="0"][hwhandler="1 emc"]
\_ round-robin 0 [prio=1][active]
\_ 1:0:0:4 sdp 8:240 [active][ready]
\_ round-robin 0 [enabled]
\_ 1:0:1:4 sdaa 65:160 [active][ready]
mpath5 (360060e8005650e000000650e00004505)
[size=100 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [enabled]
\_ 0:0:0:5 sdf 8:80 [failed][faulty]
mpath26 (36006016030f02c001ce40964eb6de111)
[size=50 GB][features="1 queue_if_no_path"][hwhandler="1 emc"]
\_ round-robin 0 [prio=1][active]
\_ 1:0:1:3 sdz 65:144 [active][ready]
\_ round-robin 0 [enabled]
\_ 1:0:0:3 sdo 8:224 [active][ready]
mpath4 (360060e8005650e000000650e00004504)
[size=650 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [enabled]
\_ 0:0:0:4 sde 8:64 [failed][faulty]
mpath25 (3600601608ef02c00f4109259eb6de111)
[size=656 GB][features="0"][hwhandler="1 emc"]
\_ round-robin 0 [prio=1][active]
\_ 1:0:0:2 sdn 8:208 [active][ready]
\_ round-robin 0 [enabled]
\_ 1:0:1:2 sdy 65:128 [active][ready]
mpath10 (360060e8005650e000000650e0000450a)
[size=100 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [enabled]
\_ 0:0:0:10 sdk 8:160 [failed][faulty]
mpath3 (360060e8005650e000000650e00004503)
[size=50 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [enabled]
\_ 0:0:0:3 sdd 8:48 [failed][faulty]
mpath24 (36006016030f02c00accd9649eb6de111)
[size=65 GB][features="1 queue_if_no_path"][hwhandler="1 emc"]
\_ round-robin 0 [prio=1][active]
\_ 1:0:0:8 sdt 65:48 [active][ready]
\_ round-robin 0 [enabled]
\_ 1:0:1:8 sdae 65:224 [active][ready]
[root@rpsior06 mapper]#

Here is the df -h:

[root@rpsior06 mapper]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00 9.7G 263M 9.0G 3% /
/dev/sdah1 190M 15M 166M 9% /boot none 12G 0 12G 0% /dev/shm
/dev/mapper/VolGroup00-LogVol02 49G 199M 46G 1% /home
/dev/mapper/VolGroup00-LogVol05 9.7G 59M 9.2G 1% /tmp
/dev/mapper/VolGroup00-LogVol04 9.7G 2.4G 6.9G 26% /usr
/dev/mapper/VolGroup00-LogVol03 9.7G 214M 9.0G 3% /var
/dev/mapper/VolGroup01-LogVol00 97G 3.5G 89G 4% /oracle_binaries
/dev/mapper/VolGroup02-LogVol00 640G 428G 180G 71% /nebsqa
/dev/mapper/VolGroup03-LogVol00 640G 420G 189G 70% /nebsdev
/dev/mapper/VolGroup04-LogVol00 640G 425G 183G 70% /nebsuat
/dev/mapper/VolGroup05-LogVol00 50G 6.0G 41G 13% /d05
/dev/mapper/VolGroup06-LogVol00 640G 312G 296G 52% /d08
/dev/mapper/VolGroup07-LogVol00 99G 125M 94G 1% /d26
/dev/mapper/VolGroup08-LogVol00 99G 180M 94G 1% /d27
/dev/mapper/VolGroup09-LogVol00 64G 2.7G 59G 5% /u01
/dev/mapper/VolGroup10-LogVol00 64G 5.3G 56G 9% /u02
/dev/mapper/VolGroup11-LogVol00 64G 3.5G 58G 6% /u03
/dev/mapper/VolGroup12-LogVol00 99G 2.4G 92G 3% /u04
[root@rpsior06 mapper]#

I'm trying to fix the failed and faulty problems and the duplicate PVs and cannot find out how to fix them. Any and all help would be greatly appreciated! If you need any further information please feel free to ask. Sincerely, Lee

anomie · 06-25-2012, 01:22 PM

Three things, in order of relative importance:

Are you using LVM2 with SAN LUNs? If so, it requires special care (and is extremely dangerous to your data, if that care is not followed). LVM2 is not cluster-aware.
RHEL4's official life cycle has ended. That has little or nothing to do with your question, but keep it in mind if you aren't already.
Please use code tags when posting command output.

If the answer to the first question is "yes", a good rule of thumb to follow is "any time you make PV, VG, or LV changes, reboot all systems that access the affected LUN(s)". I know that sounds excessive. But - at least in my case - data integrity is the most important consideration.

References:

If the answer to the first question is "no", then never mind.

lhiggie1 · 06-28-2012, 10:20 AM

anomie,

Thank you for your reply. Here are your answers:

1. We are using LVM2 with SAN LUNs. However, we are not in a clustered environment. I'm in the process of migrating the data from one SAN (Hitachi) to a new SAN (EMC VNX). I'm using LVM mirroring and when I stopped the mirror and disconnected the fibre from the Hitachi and connected the fibre to the EMC is when I received these errors.

2. Yes, I'm fully aware of the lifecycle of RHEL4. I had no choice in the matter due to the outdated version of Oracle eBS we are running.

3. I apologize for not using code tags. I've never used them, so a learning curve has just been eclipsed. I'll look into it.

I understand about the rule of thumb, however, with production business systems that isn't always a possibility. Fortunately, this is not a production server, however, I still have to schedule reboots due to development work being done on the systems.

To continue about my question, after the split of the fibre, I lost one of my vgs and had to return to the split fibre between the two SANs. Now, I'm trying to clean it up and like you stated earlier, before I go any further, a reboot is in order. I will post back the results after I can get the reboot scheduled.

Again, thank you for your reply and the references!

Sincerely,
Lee