LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 01-09-2021, 06:19 AM   #1
LinuGeek
Member
 
Registered: Jun 2008
Posts: 126

Rep: Reputation: 0
SuSE Linux RAID Faulty disk replacement


Hello Experts,

We have a important Database Server with SUSE Linux Enterprise Server 12. The previous admin has setup it as follows.

4 internal disks :

1+1 --RAID-1 Software RAID --> ROOT Partitions
1+1 --RAID-1 Software RAID --> Data Partitions with Database.

Root Partitions have further LVM on top of it and then sliced to have Logical volumes of /usr /boot etc.

So there are 2 Volume groups. 1 System VG and 2. Data VG.

There are 4 Disks sda+sdb and sdc+sdd

Recently we noticed that one ofthe disks out of Software RAID group System is gone bad and the server
continued to work without any problem (Thanks to RAID 1 Mirroring).
See below, 3 Software RAID partitions are marked as Failed/degraded. md0, md1 and md2.
Which are System Partitions. md3 is for database.
So sda1,sda2 and sda3

Code:
#cat /proc/mdstat
Personalities : [raid1]

md0 : active raid1 sdb1[1] sda1[0](F)	<<<<<-------------
      1051584 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md1 : active raid1 sdb2[1] sda2[0](F)	<<<<<-------------
      18876288 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md2 : active raid1 sdb3[1] sda3[0](F)	<<<<<-------------
      956832576 blocks super 1.0 [2/1] [_U]
      bitmap: 2/8 pages [8KB], 65536KB chunk

md3 : active raid1 sdc1[0] sdd1[1]
      976760640 blocks super 1.0 [2/2] [UU]
      bitmap: 2/8 pages [8KB], 65536KB chunk

unused devices: <none>

We have to replace the faulty disk (sda) so that it builds back the original structure.

I have come up with following plan. Please suggest modifications.

1. Shutdown the server that will eventually also take down the database.
2. Take out the faulty disk
3. Replace with new one
4. And restart the server
5. Auto-Build process of mirroring the new disk from the existing one should start.

This sounds more of an automated process.


If this does not work then we can manually do few more steps.
Quote:
Question can we do this on existing runlevel without any problem??
1. Mark the disk as failed if it is not already marked F by the system.

Code:
# mdadm --manage /dev/md0 --fail /dev/sda1
# mdadm --manage /dev/md1 --fail /dev/sda2
# mdadm --manage /dev/md2 --fail /dev/sda3
To verify that the disk is failed, check /proc/mdstat:

2. Remove the disk by mdadm
Code:
# mdadm --manage /dev/md0 --remove /dev/sda1
# mdadm --manage /dev/md1 --remove /dev/sda2
# mdadm --manage /dev/md2 --remove /dev/sda3
3. Replace the disk
Quote:
Question how to identify the faulty disk??
4. Copy the partition table to the new disk
(Caution: This sfdisk command will replace the entire partition table on the target disk with that of the source disk – use an alternative command if you need to preserve other partition information)

Code:
# sfdisk -d /dev/sdb | sfdisk /dev/sda
5. Create the mirror of the disk:

Code:
# mdadm --manage /dev/md0 --add /dev/sda1
# mdadm --manage /dev/md1 --add /dev/sda2
# mdadm --manage /dev/md2 --add /dev/sda3
6. To test the setup, enter the below command:

Code:
# /sbin/mdadm --detail /dev/md0
The following command will show the current progress of the recovery of the mirror disk:

Code:
7.# cat /proc/mdstat
System backup is in place.
Please give your valuable inputs.
Quote:
If there is any better option?

Thank you in advance.


Regards,
Admin

Last edited by LinuGeek; 01-09-2021 at 06:22 AM.
 
Old 01-09-2021, 08:22 AM   #2
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 3,340

Rep: Reputation: Disabled
Quote:
Originally Posted by LinuGeek View Post
1. Shutdown the server that will eventually also take down the database.
2. Take out the faulty disk
3. Replace with new one
4. And restart the server
If you have to. Most SATA controllers/drivers support hotplugging, but unless the drive is in a hotplug tray, you'd better shut down the server first.

Also, I'd add this step from your alternative routine:

0. Remove the failed partitions with:
Code:
mdadm --manage /dev/md0 --remove /dev/sda1
mdadm --manage /dev/md1 --remove /dev/sda2
mdadm --manage /dev/md2 --remove /dev/sda2
You should definitely do this before powering down the server.

And you will obviously also have to find a way to identify the failed drive before powering down. If it isn't obvious which drive is which, run smartctl on the working drives and make a note of the serial numbers.
Quote:
Originally Posted by LinuGeek View Post
5. Auto-Build process of mirroring the new disk from the existing one should start.
Nope. This isn't hardware RAID.

Since the RAID components are partitions rather than drives, you'll have to create the partitions manually and then run mdadm --manage /dev/mdx --add /dev/sday for each RAID device and partition. That will start the rebuild process.

Last edited by Ser Olmy; 01-09-2021 at 09:09 AM.
 
1 members found this post helpful.
Old 01-10-2021, 03:29 AM   #3
LinuGeek
Member
 
Registered: Jun 2008
Posts: 126

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by Ser Olmy View Post
Also, I'd add this step from your alternative routine:

0. Remove the failed partitions with:
Code:
mdadm --manage /dev/md0 --remove /dev/sda1
mdadm --manage /dev/md1 --remove /dev/sda2
mdadm --manage /dev/md2 --remove /dev/sda2
You should definitely do this before powering down the server.
u meant
Code:
mdadm --manage /dev/md2 --remove /dev/sda3.
??
And is it not the same as Step no. 2 in my comment? Or is the order not correct?
 
Old 01-10-2021, 03:32 AM   #4
LinuGeek
Member
 
Registered: Jun 2008
Posts: 126

Original Poster
Rep: Reputation: 0
One more question,

since I have to reboot the system couple of times and one of the systems disks sda is not in place, will it create any problem while booting the OS?

My grub is as follows,

cat /boot/grub2/grub.cfg (Relevant part only

Quote:
insmod part_msdos msdos
insmod diskfilter mdraid1x lvm
insmod ext2
set root='lvmid/m7AEp0-79EG-D2Vi-ELzE-BTzh-C8mN-CLxrpz/S0eZEl-PlBX-E1ZL-oCwL-SmUx-4Qe4-Mz9NHX'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint='lvmid/m7AEp0-79EG-D2Vi-ELzE-BTzh-C8mN-CLxrpz/S0eZEl-PlBX-E1ZL-oCwL-SmUx-4Qe4-Mz9NHX' 7c2e3a9c-5f5b-47e3-8a0a-d1e66f12747c
else
search --no-floppy --fs-uuid --set=root 7c2e3a9c-5f5b-47e3-8a0a-d1e66f12747c
fi
font="/share/grub2/unicode.pf2"
fi

if loadfont $font ; then
set gfxmode=auto
load_video
insmod gfxterm
set locale_dir=$prefix/locale
set lang=POSIX
insmod gettext
fi
terminal_output gfxterm
insmod part_msdos msdos
insmod diskfilter mdraid1x
insmod ext2
set root='mduuid/531cd341e2c7d5a71c542ad04d9ea589'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint='mduuid/531cd341e2c7d5a71c542ad04d9ea589' 96c11697-c3b7-4f11-90fc-3aef207db526
else
search --no-floppy --fs-uuid --set=root 96c11697-c3b7-4f11-90fc-3aef207db526
fi
insmod gfxmenu
loadfont ($root)/grub2/themes/SLE/DejaVuSans-Bold14.pf2
loadfont ($root)/grub2/themes/SLE/DejaVuSans10.pf2
loadfont ($root)/grub2/themes/SLE/DejaVuSans12.pf2
loadfont ($root)/grub2/themes/SLE/ascii.pf2
insmod png
set theme=($root)/grub2/themes/SLE/theme.txt
export theme
if [ x${boot_once} = xtrue ]; then
set timeout=0
elif [ x$feature_timeout_style = xy ] ; then
set timeout_style=menu
set timeout=8
# Fallback normal timeout code in case the timeout_style feature is
# unavailable.
else
set timeout=8
fi
### END /etc/grub.d/00_header ###

### BEGIN /etc/grub.d/10_linux ###
menuentry 'SLES12' --class sles12 --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-690785da-f0f0-4250-b693-5a008acbba10' {
load_video
set gfxpayload=keep
insmod gzio
insmod part_msdos msdos
insmod diskfilter mdraid1x
insmod ext2
set root='mduuid/531cd341e2c7d5a71c542ad04d9ea589'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint='mduuid/531cd341e2c7d5a71c542ad04d9ea589' 96c11697-c3b7-4f11-90fc-3aef207db526
else
search --no-floppy --fs-uuid --set=root 96c11697-c3b7-4f11-90fc-3aef207db526
fi
echo'Loading Linux 3.12.28-4-default ...'
linux/vmlinuz-3.12.28-4-default root=UUID=690785da-f0f0-4250-b693-5a008acbba10 resume=/dev/md1 splash=silent quiet crashkernel=232M-:116M showopts
echo'Loading initial ramdisk ...'
initrd/initrd-3.12.28-4-default
}
submenu 'Advanced options for SLES12' --hotkey=1 $menuentry_id_option 'gnulinux-advanced-690785da-f0f0-4250-b693-5a008acbba10' {
menuentry 'SLES12, with Linux 3.12.28-4-default' --hotkey=2 --class sles12 --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-3.12.28-4-default-advanced-690785da-f0f0-4250-b693-5a008acbba10' {
load_video
set gfxpayload=keep
insmod gzio
insmod part_msdos msdos
insmod diskfilter mdraid1x
insmod ext2
set root='mduuid/531cd341e2c7d5a71c542ad04d9ea589'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint='mduuid/531cd341e2c7d5a71c542ad04d9ea589' 96c11697-c3b7-4f11-90fc-3aef207db526
else
search --no-floppy --fs-uuid --set=root 96c11697-c3b7-4f11-90fc-3aef207db526
fi
echo'Loading Linux 3.12.28-4-default ...'
linux/vmlinuz-3.12.28-4-default root=UUID=690785da-f0f0-4250-b693-5a008acbba10 resume=/dev/md1 splash=silent quiet crashkernel=232M-:116M showopts
echo'Loading initial ramdisk ...'
initrd/initrd-3.12.28-4-default
}
menuentry 'SLES12, with Linux 3.12.28-4-default (recovery mode)' --hotkey=3 --class sles12 --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-3.12.28-4-default-recovery-690785da-f0f0-4250-b693-5a008acbba10' {
load_video
set gfxpayload=keep
insmod gzio
insmod part_msdos msdos
insmod diskfilter mdraid1x
insmod ext2
set root='mduuid/531cd341e2c7d5a71c542ad04d9ea589'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint='mduuid/531cd341e2c7d5a71c542ad04d9ea589' 96c11697-c3b7-4f11-90fc-3aef207db526
else
search --no-floppy --fs-uuid --set=root 96c11697-c3b7-4f11-90fc-3aef207db526
fi
echo'Loading Linux 3.12.28-4-default ...'
linux/vmlinuz-3.12.28-4-default root=UUID=690785da-f0f0-4250-b693-5a008acbba10 showopts apm=off noresume edd=off powersaved=off nohz=off highres=off processor.max_cstate=1 nomodeset x11failsafe crashkernel=232M-:116M
echo'Loading initial ramdisk ...'
initrd/initrd-3.12.28-4-default
}
}

Secondly, I think it is also necessary to install the grub onto the new drive as shown below.

For GRUB2 running grub-install on the new drive is enough. For example:

Quote:
grub-install /dev/sda
Will it be enough then? Or am I missing something?

Thanks in advance.

Last edited by LinuGeek; 01-10-2021 at 04:09 AM.
 
Old 01-10-2021, 04:37 AM   #5
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 3,340

Rep: Reputation: Disabled
Quote:
Originally Posted by LinuGeek View Post
u meant
Code:
mdadm --manage /dev/md2 --remove /dev/sda3.
??
Indeed. A typo on my part.
Quote:
Originally Posted by LinuGeek View Post
And is it not the same as Step no. 2 in my comment? Or is the order not correct?
You should do it before powering down and adding the new drive, that's all.

Quote:
Originally Posted by LinuGeek View Post
since I have to reboot the system couple of times and one of the systems disks sda is not in place, will it create any problem while booting the OS?
Good point. GRUB may or may not handle booting from a mirror set, but the BIOS will default to booting from the first hard drive regardless.

If GRUB has duplicated the boot sector to both drives (and that's a big "if"), you could possibly boot directly from the second drive from the server's boot menu, or by booting from a media with a boot loader that allows booting from other drives.

Otherwise, it will be necessary to manually install GRUB to the new drive before you can boot. And you will definitely have to re-run the GRUB installer afterwards anyway.

(BTW, all this would have been unnecessary if the drives had been mounted in hotplug trays, of if this had been a hardware RAID setup or even a so-called "fakeRAID" volume.)
 
1 members found this post helpful.
Old 01-10-2021, 04:54 AM   #6
LinuGeek
Member
 
Registered: Jun 2008
Posts: 126

Original Poster
Rep: Reputation: 0
Another thing with Swap,

RAID Device md1 is actually swap.

Quote:
#cat /proc/swaps

Filename Type Size Used Priority
/dev/md1 partition 18876284 146068 -1
So before the step 1. I should be doing,


Quote:
#swapoff /dev/md1
And then proceed with the step 1 of marking the RAID partitions as failed etc.

Later after the last step when the rebuild is complete , I have to enable the swap partition once again.

After the rebuild process is complete then the swap should be enabled on md1 Device

Quote:
#mkswap /dev/md1
#swapon -a
Is this okay??
 
Old 01-11-2021, 02:59 AM   #7
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 3,340

Rep: Reputation: Disabled
/dev/md1 isn't going to disappear during this procedure, so I see no reason to deactivate the swap partition.
 
1 members found this post helpful.
Old 01-11-2021, 04:11 AM   #8
LinuGeek
Member
 
Registered: Jun 2008
Posts: 126

Original Poster
Rep: Reputation: 0
Thanks for the reply Ser Olmy.

Couple of questions come to my mind,

1. Sometimes when there are 2 disks lets say sda and sdb and one of them is non-functional e.g. sda in this case, after the reboot the one which was sdb, can be identified as sda in the absence of original sda. This will definately affect the set of commands which I have prepared for.

2. As said in 1st post, there is a LVM Layer sitting under Software RAID.


The fstab file shows this,

Quote:
Filesystem Mounted on
devtmpfs /dev
tmpfs /dev/shm
tmpfs /run
tmpfs /sys/fs/cgroup

/dev/md0 /boot
/dev/mapper/system-root /
/dev/mapper/system-usr /usr
/dev/mapper/system-var /var
/dev/mapper/system-opt /opt
So system is one of the VGs present on sdb and the faulty disk sda. Does it make any difference to the set of commands then?
After the RAID is built once again, the LVMs should be back as expected?

Your thoughts on these questions please.

Thanks in advance.
 
Old 01-11-2021, 04:30 AM   #9
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 3,340

Rep: Reputation: Disabled
Quote:
Originally Posted by LinuGeek View Post
1. Sometimes when there are 2 disks lets say sda and sdb and one of them is non-functional e.g. sda in this case, after the reboot the one which was sdb, can be identified as sda in the absence of original sda. This will definately affect the set of commands which I have prepared for.
Yes, that is absolutely the case if you boot the server without a replacement disk.

The order in which Linux kernel enumerates drives (and devices in general) is determined by when the driver for the controller is loaded and how that driver then access the devices attached to said controller.

For SATA/SAS drives, the ports are always enumerated in the same order (which may or may not be the same order used by the motherboard BIOS), and the first drive found is assigned /dev/sda. So yes, if you simply remove the first drive and then boot the server, what used to be /dev/sdb is likely to appear as /dev/sda.

However, if you install a replacement drive and connect it to the same port, that drive will become the new /dev/sda.

To make absolutely sure you're operating on the right drive, check the make, model, and serial number with smartctl.
Quote:
Originally Posted by LinuGeek View Post
2. As said in 1st post, there is a LVM Layer sitting under Software RAID.
That's actually an advantage.

LVM volumes are identified by metadata on the LVM partitions. Drives may appear with different device node names, but as long as pvscan finds the physical volumes at boot, everything will be identified properly.

The same goes for md devices, at least as long as /etc/mdadm.conf is accessible and hasn't been edited to contain hardcoded references to device nodes.
 
1 members found this post helpful.
Old 01-11-2021, 04:34 AM   #10
LinuGeek
Member
 
Registered: Jun 2008
Posts: 126

Original Poster
Rep: Reputation: 0
Very well explained, thanks.

/etc/mdadm.conf is:


DEVICE containers partitions
ARRAY /dev/md0 UUID=531cd341:e2c7d5a7:1c542ad0:4d9ea589
ARRAY /dev/md1 UUID=fa4682d2:61901280:67e70eb9:c0335a53
ARRAY /dev/md2 UUID=885a178f:328855d9:beb12cf1:193904d1
ARRAY /dev/md3 UUID=a443e415:28f75b00:2fd4779f:44fdd524

I guess, it should be ok then?
 
Old 01-11-2021, 04:38 AM   #11
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 3,340

Rep: Reputation: Disabled
Quote:
Originally Posted by LinuGeek View Post
/etc/mdadm.conf is:


DEVICE containers partitions
ARRAY /dev/md0 UUID=531cd341:e2c7d5a7:1c542ad0:4d9ea589
ARRAY /dev/md1 UUID=fa4682d2:61901280:67e70eb9:c0335a53
ARRAY /dev/md2 UUID=885a178f:328855d9:beb12cf1:193904d1
ARRAY /dev/md3 UUID=a443e415:28f75b00:2fd4779f:44fdd524

I guess, it should be ok then?
Definitely. No matter what the devices or partitions may end up being called, the UUIDs in the RAID metadata will remain the same.
 
1 members found this post helpful.
Old 01-11-2021, 04:40 AM   #12
LinuGeek
Member
 
Registered: Jun 2008
Posts: 126

Original Poster
Rep: Reputation: 0
Thanks a ton. Will get back for further questions if needed.
 
Old 01-11-2021, 06:23 AM   #13
LinuGeek
Member
 
Registered: Jun 2008
Posts: 126

Original Poster
Rep: Reputation: 0
Just a small query. What happens if we restart the server without doing anything. Will it come up without any problems?
 
Old 01-11-2021, 07:09 AM   #14
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 3,340

Rep: Reputation: Disabled
Unlikely, as the defective drive is almost certainly the boot device.

If the drive is totally dead, the next drive on the controller (currently seen as /dev/sdb by the OS) will become the boot device. It probably lacks the GRUB bootloader, so the boot process will fail or hang.

If the drive has multiple bad sectors but is still running, the server will attempt to boot from it. If, by pure coincidence, none of the sectors holding the GRUB loader are bad, you may be able to successfully boot. But I wouldn't count on it.
 
1 members found this post helpful.
Old 01-11-2021, 08:02 AM   #15
LinuGeek
Member
 
Registered: Jun 2008
Posts: 126

Original Poster
Rep: Reputation: 0
Once again good point.

Quote:
GRUB may or may not handle booting from a mirror set, but the BIOS will default to booting from the first hard drive regardless.
If we consider this, then for the system to boot, I need to first recreate the boot. In that case the order should be like this,

1. Note the serialnumbers of the disks using smartctl
2. Mark the faulty disk as failed. And remove it from mdadm RAID.
3. Shutdown the server
4. Take out the faulty disk manually and Replace with the new disk.
5. Boot the server??
This will fail as the new disk does not have any booting information on it. So I guess I will have to use Boot-CD in order to get to the Recovery mode.
The obvious question will be, will I be able to execute the next set of commands in the recovery mode? Or chroot will be required? Its getting complicated now I guess.
6. Once the raid has been reconstructed, I would boot the system normally in the hope that it boots.

Am I missing something here?
 
  


Reply

Tags
disk, linux, lvm, raid, suse



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: How to replace Faulty Linux RAID disk LXer Syndicated Linux News 0 03-30-2017 09:03 AM
LXer: How to replace Faulty Linux RAID disk LXer Syndicated Linux News 0 02-22-2017 01:51 PM
RAID1: can't replace faulty spare (marked again as 'faulty spare' within seconds) Thambry Linux - General 2 11-14-2013 07:31 AM
Faulty card or faulty config? svar Linux - Networking 4 09-02-2009 09:39 AM
software raid - add device wrongly marked faulty back into array? snoozy Linux - General 2 06-27-2003 02:11 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 06:35 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration