Raid1 problem

davejones · 08-08-2016, 07:31 AM

I have recently bought a dedicated server, I could not afford a managed server and I'm will to learn but this problem has come to quick for me so excuse my ignorance.

Within a few weeks of getting the server it's got a fault drive in a 2 drive raid. After whatt seems like days of reading I've managed to a) remove from the array the wrong drive and b) re add it :-) I've now got part way to removing the fauilty drive but this won't work, here is the print out; can anyone give me any help on whats wrong here, I need to remove drive sda as that's the faulty drive.

Filesystem Size Used Avail Use% Mounted on
/dev/md2 1008G 20G 937G 3% /
tmpfs 16G 0 16G 0% /dev/shm
/dev/md1 496M 35M 436M 8% /boot
/dev/md3 1.7T 5.3G 1.6T 1% /home

[root@svr1 ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
16777088 blocks super 1.0 [2/1] [_U]

md3 : active raid1 sdb5[1]
1839216960 blocks super 1.0 [2/1] [_U]
bitmap: 11/14 pages [44KB], 65536KB chunk

md2 : active raid1 sdb3[1]
1073741632 blocks super 1.0 [2/1] [_U]
bitmap: 6/8 pages [24KB], 65536KB chunk

md1 : active raid1 sda2[0]
524224 blocks [2/1] [U_]

unused devices: <none>
[root@svr1 ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set device faulty failed for /dev/sda2: Device or resource busy
[root@svr1 ~]# mdadm --manage /dev/md1 --remove /dev/sda2
mdadm: hot remove failed for /dev/sda2: Device or resource busy
[root@svr1 ~]# mdadm --manage /dev/md1 --stop
mdadm: Cannot get exclusive access to /dev/md1:Perhaps a running process, mounted filesystem or active volume group?

jpollard · 08-08-2016, 08:02 AM

None of your devices has more than one disk... you can't fail the last disk of a raid1. You have to install a new disk, partition it appropriately, and then add the appropriate partition to the raid1. Once you have two disks in a raid1 you can then fail one of the two out. If in the past you DID have more than two disks installed, the system has already removed the faulty one (though I thought that was a manual operation and not automatic; but I can see the possibility that it happened during a boot - I didn't test for that).

/dev/sda2 is in use by md1 - and there are no other disks in use.

md0, md2, and md3 are all on the SAME disk (/dev/sdb), so you have no redundancy anywhere. Lose that disk and you lose all three filesystems, and with no recovery possible.

davejones · 08-08-2016, 08:31 AM

I really don't understand raid at all do I :-)

This is how the raid was before I attempted to remove the faulty drive.

Software RAID:
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
16777088 blocks super 1.0 [2/2] [UU]

md3 : active raid1 sdb5[1]
1839216960 blocks super 1.0 [2/1] [_U]
bitmap: 11/14 pages [44KB], 65536KB chunk

md2 : active raid1 sdb3[1]
1073741632 blocks super 1.0 [2/1] [_U]
bitmap: 6/8 pages [24KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[1]
524224 blocks [2/2] [UU]

unused devices:
Partition info:
Filesystem Size Used Avail Use% Mounted on
/dev/md2 1008G 20G 937G 3% /
tmpfs 16G 0 16G 0% /dev/shm
/dev/md1 496M 35M 436M 8% /boot
/dev/md3 1.7T 5.3G 1.6T 1% /home

From their I attempted to remove sdb by mistake.
I then added it back and it did rebuild.
Last thing was I tried to do was removed sda and added grub to sdb.
I don't really know where I am now, are you saying I can now get the host to physically replace the damaged drive sda ?

Bare in mind for other reasons I've not slept for 36 hours now :-)

suicidaleggroll · 08-08-2016, 09:56 AM

Even in that output you have two "raid" arrays with only one partition in each, md2 and md3. Who set up this system originally? Is it possible sda had already ejected itself from those two arrays before you got that output? Do you have the mdadm status from when everything was working correctly?

If you want to remove sda, you'll need to add sdb2 back into md1, let it rebuild and sync, and then you can remove sda2 from md1. At that point, sda will not be in use by any arrays and can be removed from the system. If you were to remove sda now, you would lose md1 which contains your /boot partition.

davejones · 08-08-2016, 10:05 AM

As I said this is an unmanaged server, my first so I installed the OS from the given image in the host admin area.

I don't have the details of the system before the HD went faulty unles this is it

Software RAID:
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
16777088 blocks super 1.0 [2/2] [UU]

md3 : active raid1 sdb5[1]
1839216960 blocks super 1.0 [2/1] [_U]
bitmap: 11/14 pages [44KB], 65536KB chunk

md2 : active raid1 sdb3[1]
1073741632 blocks super 1.0 [2/1] [_U]
bitmap: 6/8 pages [24KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[1]
524224 blocks [2/2] [UU]

unused devices:
Partition info:
Filesystem Size Used Avail Use% Mounted on
/dev/md2 1008G 20G 937G 3% /
tmpfs 16G 0 16G 0% /dev/shm
/dev/md1 496M 35M 436M 8% /boot
/dev/md3 1.7T 5.3G 1.6T 1% /home

. it's possible that I messed it up in my efforts.

I'll do as you suggest, this help is very much appreciated. Once I have this sorted I think I will setup some old hardware I have and really nail the understand of raid. Then perhaps set this server up correctly.

suicidaleggroll · 08-08-2016, 10:08 AM

No need for dedicated hardware for testing, just use virtual machines. Give your VM two disks, and then inside the VM you can paritition and raid them however you like. If you screw something up, just restore from a backup or snapshot.

davejones · 08-08-2016, 11:16 AM

Still can not remove sda2

[root@svr1 ~]# mdadm --manage /dev/md1 --add /dev/sdb2

[root@svr1 ~]# cat /proc/mdstat

Personalities : [raid1]
md0 : active raid1 sdb1[2]
16777088 blocks super 1.0 [2/1] [_U]

md3 : active raid1 sdb5[1]
1839216960 blocks super 1.0 [2/1] [_U]
bitmap: 11/14 pages [44KB], 65536KB chunk

md2 : active raid1 sdb3[1]
1073741632 blocks super 1.0 [2/1] [_U]
bitmap: 6/8 pages [24KB], 65536KB chunk

md1 : active raid1 sdb2[1] sda2[0]
524224 blocks [2/2] [UU]

unused devices: <none>
[root@svr1 ~]# mdadm --manage /dev/md1 --remove /dev/sda2
mdadm: hot remove failed for /dev/sda2: Device or resource busy
[root@svr1 ~]#

suicidaleggroll · 08-08-2016, 11:52 AM

Use "mdadm --detail /dev/md1" to see the current status of md1. It's likely using sda2 to rebuild sdb2. As I mentioned in my steps before:

Quote:

you'll need to add sdb2 back into md1, let it rebuild and sync, and then you can remove sda2 from md1

This is the output from a sync'd and clean raid 1:

Code:

# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Fri Oct 17 10:00:28 2014
     Raid Level : raid1
     Array Size : 1953381184 (1862.89 GiB 2000.26 GB)
  Used Dev Size : 1953381184 (1862.89 GiB 2000.26 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Mon Aug  8 10:52:28 2016
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : gauss:0  (local to host gauss)
           UUID : 7b746352:9db52f98:c2b0b38e:fb041063
         Events : 5416

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1

davejones · 08-08-2016, 11:59 AM

Filesystem Size Used Avail Use% Mounted on
/dev/md2 1008G 20G 937G 3% /
tmpfs 16G 0 16G 0% /dev/shm
/dev/md1 496M 35M 436M 8% /boot
/dev/md3 1.7T 5.3G 1.6T 1% /home

[root@svr1 ~]# mdadm --detail /dev/md1
/dev/md1:
Version : 0.90
Creation Time : Tue Jul 5 14:29:48 2016
Raid Level : raid1
Array Size : 524224 (511.94 MiB 536.81 MB)
Used Dev Size : 524224 (511.94 MiB 536.81 MB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Mon Aug 8 16:58:09 2016
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

UUID : c3cfb83a:5b7936f1:776c2c25:004bd7b2
Events : 0.90

Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
1 8 18 1 active sync /dev/sdb2
[root@svr1 ~]# mdadm --manage /dev/md1 --remove /dev/sda2
mdadm: hot remove failed for /dev/sda2: Device or resource busy
[root@svr1 ~]#

suicidaleggroll · 08-08-2016, 12:01 PM

How about if you add the fail flag:

Code:

mdadm /dev/md1 --fail /dev/sda2 --remove /dev/sda2

davejones · 08-08-2016, 12:12 PM

Ha ha ! that's got it thanks.

[root@svr1 ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
16777088 blocks super 1.0 [2/1] [_U]

md3 : active raid1 sdb5[1]
1839216960 blocks super 1.0 [2/1] [_U]
bitmap: 11/14 pages [44KB], 65536KB chunk

md2 : active raid1 sdb3[1]
1073741632 blocks super 1.0 [2/1] [_U]
bitmap: 6/8 pages [24KB], 65536KB chunk

md1 : active raid1 sdb2[1]
524224 blocks [2/1] [_U]

unused devices: <none>

Now I need to contact the host to replace the drive. Is the anything else I should do first?

When they hand it back I THINK have to do this, is this ok?

sfdisk -d /dev.sdb | sfdisk --force /dev/sda

then add parts

mdadm /dev/md0 -a /dev/sda1
mdadm /dev/md1 -a /dev/sda2
mdadm /dev/md2 -a /dev/sda3
mdadm /dev/md3 -a /dev/sda5

grub -install /dev/sda

jpollard · 08-08-2016, 03:08 PM

Yes, but make sure the partitioning is correct first (as in verify the resulting partitions). The reason for verifying is that SOMETIMES (due to physical disk geometry) sizes will not match. Also there can be other things that affect it - such as a 4Kb block vs 512b block. A new disk with 4k blocks will not perform very well when it is treated as a 512b block. It should work - but will be much slower than a real 512b block device. What happens is that the disk has to read the 4K block, update the appropriate 512 byte section, then write the entire 4K block back.

For a raid 1 (mirroring) they have to be at least the same size as the active partition.

NORMALLY, for something like this I would have expected a single raid 1 device, which is then partitioned for use.

It makes it simpler when a disk fails - only one raid device has to be dealt with as all partitions would be processed simultaneously.

The way it is, each raid device has to be handled separately, which increases the possibility of error.

davejones · 08-08-2016, 03:25 PM

Got this to reference https://www.youtube.com/watch?v=jZp2IP27pcQ

Thanks for you help, I'll just wait for the drive to be replaced.