-   Linux - Server (
-   -   Software RAID issue with a RHEL 4 server (

Vanyel 08-28-2007 03:23 PM

Software RAID issue with a RHEL 4 server
Under RHEL 4, I'm confused over a software RAID issue, but I'll need to give a little detail first.

I have two servers, Larry and Moe.

Each one has 2 disks in a RAID 1 (mirror) configuration.

Someone else was tasked with making Larry and Moe identical, as Moe is just a spare machine. They messed up and Moe was an incomplete copy. It would boot, but didn't function right and things were missing. Then this became my task.

I removed one drive from Larry and one drive from Moe, placed the Moe drive in Larry, and then used Ghost for Linux to clone Larry's good drive to the drive from Moe.

So Larry is fine as ever, and Moe functions perfectly too, on one drive, which thinks it's part of a broken RAID 1. We'll call this Drive A.

MY QUESTION IS - if I put back Moe's other drive (Drive B), which is a member of the previous RAID with the bad installation, how do I make sure Drive A is dominant and wipes out/rebuilds itself onto Drive B? I don't want Drive B to come up on boot and then rebuild it's damaged self onto the good Drive A! Haven't done much with software RAID before and in the past I was always adding a blank drive into the mix, never one that already has System Software and could be a potential "competitor".

Can someone give me some advice on getting this RAID back functioning again?


ajg 08-28-2007 05:56 PM

What does


cat /proc/mdstat
show on Larry and Moe?

Vanyel 08-29-2007 03:47 PM


Personalities : [raid1]
md2 : active raid1 sdb5[1] sda5[0]
2048192 blocks [2/2] [UU]

md1 : active raid1 sda6[0]
237633344 blocks [2/1] [U_]

md0 : active raid1 sdb3[1] sda3[0]
200704 blocks [2/2] [UU]

Hey! I didn't realize that failure there. Not sure what that's about. But let's concentrate on Moe.
Drive A is present Drive B is disconnected.

Personalities : [raid1]
md2 : active raid1 sda5[1]
2048192 blocks [2/1] [_U]

md1 : active raid1 sda6[1]
237633344 blocks [2/1] [_U]

md0 : active raid1 sda3[1]
200704 blocks [2/1] [_U]

unused devices: <none>

- Van

ajg 08-29-2007 04:18 PM

OK, that failed partition on Larry is interesting, but we can come to that later.

Moe has 3 RAID partitions on /dev/sda. Is it SATA or SCSI? It looks like SATA, and this can make a difference in the drive ordering - if you removed what was /dev/sda, then what was /dev/sdb is now /dev/sda. If you put the old drive back in, that will now be /dev/sda, and the drive you want to keep will be /dev/sdb - this gets really confusing. :D

I see sda3, sda5 and sda6 as part of the mirror sets - are sda1, sda2 and sda4 unmirrored or something else?

I really want to be sure of where I am before I give you any advice and instructions. A copy of the partition table from fdisk would be handy!


fdisk /dev/sda

you need to be root to see the device.

Vanyel 08-30-2007 09:30 AM

No problem. I'm root.

These are SATA drives, btw.

You can see sda1, sda2 and sda4 in the fdisk output, below.

THANK YOU for your help do far!

[van@<machine> ~]$ sudo fdisk /dev/sda

The number of cylinders for this disk is set to 30394.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs

Command (m for help): p

Disk /dev/sda: 250.0 GB, 250000000000 bytes
255 heads, 63 sectors/track, 30394 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sda1 * 1 7 56196 de Dell Utility
/dev/sda2 8 530 4200997+ c W95 FAT32 (LBA)
/dev/sda3 531 555 200812+ fd Linux raid autodetect
/dev/sda4 556 30394 239681767+ 5 Extended
/dev/sda5 556 810 2048256 fd Linux raid autodetect
/dev/sda6 811 30394 237633448+ fd Linux raid autodetect

ajg 08-31-2007 06:41 AM

Right. Do you know which SATA port the drive you have running is installed on? It needs to be on the first port or we'll end up getting confused when we add the other drive back in.

I would have preferred to wipe the drive we're putting back in completely, but I guess given that it's full of Dell system partitions (this makes me suspect it was on port 1 of the SATA controller initially) that's not an option, so we'll have to hope that everything goes by the book.

So ... what I would do:

1) Take a backup. There is a small chance that this process could go horribly, catastrophically wrong.

2) Make sure the existing drive is on the first SATA port in the system.

3) If you have to change it over, boot the system and do a

cat /proc/mdstat
to make sure it all looks good (nothing should change from when you last looked at it).

4) Install the second drive on the second SATA controller. For this process to work following my instructions, Linux has to see it as /dev/sdb. Things will go horribly wrong if it isn't.

5) Boot the system and do a

cat /proc/mdstat
If things are going by-the-book, it should show that all the /dev/sdaX volumes are up, and the /dev/sdbX are still down, so we need to add them back into the array. It may figure it out and try to remirror things by itself - the mdstat will tell you remirroring progress, but this has never happened in my experience. If it does, you'll have to wait for it to finish, then verify your data. If there's anything wrong, go for your backup. If by some miracle it remirrors automatically with no problems, then you're done. I strongly suspect this won't be the case, and you'll have to tell it to remirror though.

6) So ... if all is looking good, do:

mdadm /dev/md0 --add /dev/sdb3
mdadm /dev/md2 --add /dev/sdb5
mdadm /dev/md1 --add /dev/sdb6

keep checking

cat /proc/mdstat
to verify progress of the remirroring. You can also do this on Larry with the /dev/md1 to try and mirror that back up too:

mdadm /dev/md1 --add /dev/sdb6
If you're not sure about anything, or something is unclear, come back to me before leaping in with this! I cannot stress how horribly things can go wrong when mucking around with RAID sets!

strick1226 09-04-2007 12:33 PM

Great advice. Only thing I can add is the following:

watch -n x cat /proc/mdstat
(where x= number of seconds between updates)

If you're sitting at a terminal and plan to watch it finish, this is the way to go.

Good luck!

Vanyel 09-04-2007 03:02 PM

Strick - thanks for the Watch command! I'd never heard of it. Good tool!

AJG - Thanks for ALL your help so far!!! So here's how it went -

After getting some hardware advice from Dell on how to tell which drive should be dominant on reboot (which turned out to be WRONG!) I finally got sick of it and just plugged in Moe B. In the end, Moe A/B is only a copy of Larry A/B anyway, so I could always go back to the source.

No matter WHICH hardware SATA connection the drives were plugged into, Moe B (the Bad drive) was always dominant! It was however, more messed up than I remembered and never really booted, so Moe A didn't get harmed.

I then remembered SATA *is* hot-pluggable, so I booted up with power and SATA connected to Moe A and only power connected to Moe B. Good drive came up as sda. Then logged in, I plugged in Moe B's sata cable and it became sdb.

From there ajg, I just followed your instructions and the remirroring seems to be coming along fine! I'll let you know how it finishes!

- Van

Vanyel 09-04-2007 04:42 PM

Hmmm ... It's done and everything seems fine, except

mdadm /dev/md0 --add /dev/sdb3

doesn't stick. After issuing the command seeing a quick recovery process, After I reboot, I get

[van@<machine> ~]$ cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb5[0] sda5[1]
2048192 blocks [2/2] [UU]

md1 : active raid1 sdb6[0] sda6[1]
237633344 blocks [2/2] [UU]

md0 : active raid1 sda3[1]
200704 blocks [2/1] [_U]

unused devices: <none>

Why does one half of md0 not come back after reboot?

ajg 09-07-2007 04:00 AM

A good question, and one that I've never been able to get to the bottom of. It may be something to do with failed blocks on the drive you are trying to mirror to - it's possible that it no longer has enough good blocks to mirror the whole data set. I have one like this, but it's not a production server so I've never bothered to find out why. Could be worth having a look with mdadm to see if this is the case.

All times are GMT -5. The time now is 08:28 PM.