LinuxQuestions.org - Software RAID replace failed disk . mdadm , grub and lvm ??

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - Software RAID replace failed disk . mdadm , grub and lvm ?? (https://www.linuxquestions.org/questions/linux-general-1/software-raid-replace-failed-disk-mdadm-grub-and-lvm-4175652925/)

Software RAID replace failed disk . mdadm , grub and lvm ??

Hello Admins,

one of the SuSE SLES 12 LinuxServers has reported disk failure. Fortunately the Database Server has Software Raid hence the system is still up and running.
But as recommended, we would like to replace the failed disk with a new one and rebuild the software raid on it.

System Information is :

Total 4 Internal Disks. sda, sdb , sdc and sdd

The fdisk partitions are :

Quote:

# fdisk -l
Disk /dev/sdb: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x0007d757

Device Boot Start End Sectors Size Id Type
/dev/sdb1 * 2048 2105343 2103296 1G fd Linux raid autodetect
/dev/sdb2 2105344 39858175 37752832 18G fd Linux raid autodetect
/dev/sdb3 39858176 1953523711 1913665536 912.5G fd Linux raid autodetect

Disk /dev/sdc: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x000a0e8a

Device Boot Start End Sectors Size Id Type
/dev/sdc1 2048 1953523711 1953521664 931.5G fd Linux raid autodetect

Disk /dev/sdd: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x000caaad

Device Boot Start End Sectors Size Id Type
/dev/sdd1 2048 1953523711 1953521664 931.5G fd Linux raid autodetect

Software RAID --> sda + sdb (sda is failed disk)
Software RAID --> sdb + sdc

Quote:

DBServer# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sdc1[0] sdd1[1]
976760640 blocks super 1.0 [2/2] [UU]
bitmap: 2/8 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[1] sda1[0](F)
1051584 blocks super 1.0 [2/1] [_U]
bitmap: 1/1 pages [4KB], 65536KB chunk

md2 : active raid1 sdb3[1] sda3[0](F)
956832576 blocks super 1.0 [2/1] [_U]
bitmap: 2/8 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[1] sda2[0](F)
18876288 blocks super 1.0 [2/1] [_U]
bitmap: 0/1 pages [0KB], 65536KB chunk
unused devices: <none>

So md0,md1 and md2 have failed devices namely sda1,sda2 and sda3

Please note that it also has 2 VGs defined as shown below,

1 VG - system (/dev/md2)
2 VG - ora_db (/dev/md3)

Quote:

# pvdisplay

--- Physical volume ---
PV Name /dev/md3
VG Name ora_db
PV Size 931.51 GiB / not usable 3.81 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 238466
Free PE 84866
Allocated PE 153600
PV UUID vgPdWQ-x6CW-vvdF-moxh-FKyb-wpSU-NdJqSm

--- Physical volume ---
PV Name /dev/md2
VG Name system
PV Size 912.51 GiB / not usable 2.81 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 233601
Free PE 182401
Allocated PE 51200
PV UUID rdff2n-ztxd-lcBY-nAqk-8O9u-fnFG-BVI91v

The grub.conf shows : (Relevant part)

Quote:

if [ x$feature_default_font_path = xy ] ; then
font=unicode
else
insmod part_msdos msdos
insmod diskfilter mdraid1x lvm
insmod ext2
set root='lvmid/m7AEp0-79EG-D2Vi-ELzE-BTzh-C8mN-CLxrpz/S0eZEl-PlBX-E1ZL-oCwL-SmUx-4Qe4-Mz9NHX'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint='lvmid/m7AEp0-79EG-D2Vi-ELzE-BTzh-C8mN-CLxrpz/S0eZEl-PlBX-E1ZL-oCwL-SmUx-4Qe4-Mz9NHX' 7c2e3a9c-5f5b-47e3-8a0a-d1e66f12747c
else
search --no-floppy --fs-uuid --set=root 7c2e3a9c-5f5b-47e3-8a0a-d1e66f12747c
fi
font="/share/grub2/unicode.pf2"
fi

if loadfont $font ; then
set gfxmode=auto
load_video
insmod gfxterm
set locale_dir=$prefix/locale
set lang=POSIX
insmod gettext
fi
terminal_output gfxterm
insmod part_msdos msdos
insmod diskfilter mdraid1x
insmod ext2
set root='mduuid/531cd341e2c7d5a71c542ad04d9ea589'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint='mduuid/531cd341e2c7d5a71c542ad04d9ea589' 96c11697-c3b7-4f11-90fc-3aef207db526
else
search --no-floppy --fs-uuid --set=root 96c11697-c3b7-4f11-90fc-3aef207db526
fi

-------------------------------------

Quote:

The procedure to follow should go like this,

1. First we mark /dev/sda1 as failed:
mdadm --manage /dev/md0 --fail /dev/sda1

2. Then we remove /dev/sda1 from /dev/md0:
mdadm --manage /dev/md0 --remove /dev/sda1

3. Now we do the same steps again for /dev/sda2 and sda3 (which is part of /dev/md1 and /dev/md2)

4. Then power down the system:

shutdown -h now
and replace the old /dev/sdb hard drive with a new one

5. After inserting new SATA disk /dev/sda, boot the system.

6. Then we create the exact same partitioning as on /dev/sda. We can do this with one simple command:

sfdisk -d /dev/sdb | sfdisk /dev/sda

7. Check if both the disks have same partitions (fdisk -l)

8.Next we add /dev/sda1 to /dev/md0 and /dev/sda2 to /dev/md1 and /dev/sda3 to /dev/md3:

mdadm --manage /dev/md0 --add /dev/sda1
mdadm --manage /dev/md1 --add /dev/sda2
mdadm --manage /dev/md2 --add /dev/sda3

9. Confirm the synchronisation in progress

cat /proc/mdstat

-------------------------------------------------

Please let me know if I have missed something. 2 important points I guess would be, how should I take care of lvm and grub in this case.

1. Do I have to do something extra to take care of it or the command sfdisk -d /dev/sdb | sfdisk /dev/sda , should take care of LVM as well.

2. How should I take care of grub in this case? As grun.conf shows entries pertaining to LVM as well as MDADM. Do I have to change anything here before I shutdown the system?

I understand the system has 2 pointers to take care of mdadm+lvm. Which have complicated things. Else would it be easier to setup completely new system??

Kindly guide me to the relevant information.
Thanks.

Regards,
Admin

After rebooting the names of the disks can change depending on the order they are found by the system. Make sure you copy the correct disk. The rest looks good.

Typically mdadm is pretty robust - the initrd can be a different matter; the (re-)boot may fail if the initrd hasn't been built to handle RAID in degraded mode. Only a boot will tell in all likelihood.
As for LVM, it won't care.

Thanks smallpond.

Quote:

After rebooting the names of the disks can change depending on the order they are found by the system. Make sure you copy the correct disk. The rest looks good.

That is one point where I am as well not sure because /etc/fstab has following,

Quote:

UUID=94924e0d-94d2-4622-b824-e1735dbc4f66 swap swap defaults 0 0
/dev/system/root / ext4 acl,user_xattr 1 1
UUID=96c11697-c3b7-4f11-90fc-3aef207db526 /boot ext4 acl,user_xattr 1 2
/dev/system/opt /opt ext4 acl,user_xattr 1 2
/dev/ora_db/ora_save /opt/oracle/backup ext4 acl,user_xattr 1 2
/dev/ora_db/ora_data /opt/oracle/data ext4 acl,user_xattr,noatime 1 2
/dev/ora_db/ora_utl /opt/oracle/data/utl ext4 acl,user_xattr 1 2
/dev/ora_db/ora_software /opt/oracle/product ext4 acl,user_xattr 1 2
/dev/system/usr /usr ext4 acl,user_xattr 1 2
/dev/system/var /var ext4 acl,user_xattr 1 2

If you notice, swap and boot are defined as UUID. Would that create a problem after reboot?

Quote:

# cat /etc/mdadm.conf

DEVICE containers partitions
ARRAY /dev/md0 UUID=531cd341:e2c7d5a7:1c542ad0:4d9ea589
ARRAY /dev/md1 UUID=fa4682d2:61901280:67e70eb9:c0335a53
ARRAY /dev/md2 UUID=885a178f:328855d9:beb12cf1:193904d1
ARRAY /dev/md3 UUID=a443e415:28f75b00:2fd4779f:44fdd524

So my concern is , after reboot , will the system correctly recognize the boot, swap and root partitions?
------------------------------------------

Thanks syg00.

Quote:

How can I make sure that initrd would not create a problem in this case.

LVM is missing now 1 sda disk altogether so why do you think that would not create a problem?

Quote:

# pvdisplay
/dev/sda: read failed after 0 of 4096 at 0: Input/output error
/dev/sda: read failed after 0 of 4096 at 1000204795904: Input/output error
/dev/sda: read failed after 0 of 4096 at 1000204877824: Input/output error
/dev/sda: read failed after 0 of 4096 at 4096: Input/output error
/dev/sda1: read failed after 0 of 4096 at 1076822016: Input/output error
/dev/sda1: read failed after 0 of 4096 at 1076879360: Input/output error
/dev/sda1: read failed after 0 of 4096 at 0: Input/output error
/dev/sda1: read failed after 0 of 4096 at 4096: Input/output error
/dev/sda2: read failed after 0 of 4096 at 19329384448: Input/output error
/dev/sda2: read failed after 0 of 4096 at 19329441792: Input/output error
/dev/sda2: read failed after 0 of 4096 at 0: Input/output error
/dev/sda2: read failed after 0 of 4096 at 4096: Input/output error
/dev/sda3: read failed after 0 of 4096 at 979796688896: Input/output error
/dev/sda3: read failed after 0 of 4096 at 979796746240: Input/output error
/dev/sda3: read failed after 0 of 4096 at 0: Input/output error
/dev/sda3: read failed after 0 of 4096 at 4096: Input/output error
--- Physical volume ---
PV Name /dev/md3
VG Name ora_db
PV Size 931.51 GiB / not usable 3.81 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 238466
Free PE 84866
Allocated PE 153600
PV UUID vgPdWQ-x6CW-vvdF-moxh-FKyb-wpSU-NdJqSm

--- Physical volume ---
PV Name /dev/md2
VG Name system
PV Size 912.51 GiB / not usable 2.81 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 233601
Free PE 182401
Allocated PE 51200
PV UUID rdff2n-ztxd-lcBY-nAqk-8O9u-fnFG-BVI91v

Your LVM is built on top of mdadm - LVM neither knows nor cares of the physical devices; it cares about /dev/md[012]. Those messages are from layers below mdadm.
The pv size (of RAID1) is unchanged by losing 1 device, so LVM carries on regardless. Likewise filesystem UUIDs are not dependent on adding new device(s) to the array.

fstab having UUIDs is the right way to do it. The device names can change but the UUID won't.

Quote:

Originally Posted by smallpond (Post 5989473)

fstab having UUIDs is the right way to do it. The device names can change but the UUID won't.

Thanks for clarification.

I have one more query. Is it required to install grub on the non-failed disk? One of the Instructions states:

Quote:

Grub should be installed on another hard disk, so the system can still boot with the primary boot device removed. If this is not completed you will need to have recovery media available to boot from.

Launch grub as the root user.

From the grub shell run the following commands to install grub onto the disk /dev/sdb

grub> device (hd0) /dev/sdb # maps /dev/sdb onto the "hd0" label (temporary, lasts until you quit GRUB)
grub> root (hd0,0) # tell GRUB where to find the root of the filesystem (and thus /boot)
grub> setup (hd0) # installs GRUB to hd0 aka. /dev/sdb
grub> quit
You now have grub installed onto /dev/sdb. If you need to boot of this disk, you will need to set it as the primary boot drive in the bios or boot menu prompt.

Is it required in our case or not?
Thanks in advance.

One more update,

I did run the command (# dd bs=512 count=1 if=/dev/sda 2>/dev/null| strings) for /dev/sdb to check if grub is installed on the disk /dev/sdb. But it did not give any results. So looks like grub is not installed on sdb. Or am I missing something. If it is really not installed , as per your recommendations , before rebooting this host I should run grub-install /dev/sdb .. is this a safe command? I mean now the system is up and running and I do not wish to disturb it until I have a disk or system replacement plan.

You should have grub installed on both, but you may have to manually change the boot order if sda fails.