LinuxQuestions.org - RAID degraded, partition missing from md0

- Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)

- - RAID degraded, partition missing from md0 (https://www.linuxquestions.org/questions/linux-hardware-18/raid-degraded-partition-missing-from-md0-4175483697/)

A resync/check could have a slight effect on performance, but nothing anyone will notice unless the system is under significant load.

I probably wouldn't bother with the smartctl reports, as smartd monitors the exact same parameters. Just make sure to run regular tests, as smartd sends notifications only when a parameter actually changes.

Quote:

Originally Posted by Ser Olmy (Post 5065333)

Ah okay - was wondering how a resync would react to files changing on the drive as it's trying to sync them (like a user saving a couple new documents on his homedir while a resync is active on that drive).
Thanks for the advice re smartd - I'll study it a bit and set it up.
Will let you know how the resync went and I'll probably shout for some more advice when the time comes to replace the drives, if you don't mind :)

Quote:

Originally Posted by Ser Olmy (Post 5065333)

Wait, sorry - do you mean I have to run this regularly:

Code:

smartctl -t long /dev/sda

In addition to actually checking what smartctl (or smartd) says? How "outdated" would the smartctl -a output be if I don't run a longtest? Ooorr, does smartctl -a always show the latest information, whereas smartd only shows updated info after a test has been run?

How regularly would you recommend? Not sure how long it would take on a 3TB drive, want to see if I can work it into the nightly schedule.

Quote:

Originally Posted by reano (Post 5065368)

Wait, sorry - do you mean I have to run this regularly:

Code:

smartctl -t long /dev/sda

No, the "test" I was referring to, is a test of smartd's capability to send mail. Since it only sends e-mails whenever there's actually something to report, a "silent failure" may go undetected.

I have a separate smartd configuration file with the "-M test" parameter (/etc/smartd-test.conf), and a cron job that runs smartd -q onecheck -c /etc/smartd-test.conf >/dev/null once a month.

Hi again,

Ok, the replacement drives have arrived from our suppliers. I've read up on http://www.howtoforge.com/replacing_..._a_raid1_array

The way I *think* that I have to proceed now is:

1. Shut down the system
2. Insert the new hard drive (it will probably be sdf)
3. Copy the partition tables from sda to sdf, with:

Code:

sfdisk -d /dev/sda | sfdisk /dev/sdf

4. Add the sdf partitions to the md0, md1 and md2 arrays:

Code:

mdadm --manage /dev/md0 --add /dev/sdf1

mdadm --manage /dev/md1 --add /dev/sdf2

mdadm --manage /dev/md2 --add /dev/sdf3

Now a few questions:

a) How do I know the partition table copy will copy the structure in the right order? What I mean is, will the size of sda1 be equal to the size of sdf1, or might it mix it up and match sdf2 instead, and sda2 matches sdf1, etc? If you know what I mean?

b) sfdisk won't work, as these are GPT partition tables. What do I use in its stead?

c) Can I add all 3 partitions to the 3 arrays (as demonstrated in point 4 above) at the same time, or do I have to add them one by one and let each one sync first?

d) Anything I'm missing? Am I missing any steps in my list above? Am I correct in my assumptions in points 1 - 4?

e) Do I not have to format the sdf partitions after copying the partition structure from sda to sdf?

PS: I'm only replacing the faulty sdb (which will now probably be sdf) tonight. If it goes well I'll replace sdc later in the week.

If you plug the new drives into the same SATA ports as the old ones, they will probably be enumerated in the same order as the old disks. And then there's a chance udev will mess it all up and rename them to /dev/sdg or somesuch, but you'll see soon enough.

Don't copy the partition table from another disk! GPT partition are called "GUID Partition Tables" for a reason; there's a GUID in there, and under no circumstances do you want disks with duplicate GUIDs on your system.

parted does GUID tables just fine. Look at the partition table of the mirror disk, and just create partition of the same size.

Quote:

Originally Posted by Ser Olmy (Post 5066967)

What about:

Code:

sgdisk -R=/dev/sdb /dev/sda

sgdisk -G /dev/sdb

That supposedly copies the GPT partition table from sda to sdb, and the second line randomizes the GUID's on sdb.

Any feedback on the other questions I got? :)

Sorry about the late reply. I suppose you've replaced the drives by now.

If you followed the procedure you outlined above, you should be up and running with fully functional RAID arrays.

Just an update on the situation..
The first situation has been resolved, drives have been replaced, etc.
However, last night, the md3 array failed. The problem is that md3 contains the /home partition (and nothing else). Luckily we do have a full daily homedirectory backup on a NAS drive, however, the challenge is to get the array back up.

md3 consists of sdc1 and sdd1 (only one partition per drive, so the entire sdc and sdd drives were involved). sdc went down completely, with sdd still up but severely damaged. So my first step was to replace sdc with a new drive and attempt a resync/recovery by adding the new drive into md3. However, the resync kept on failing because of the read errors on sdd. Hours later, I was running out of ideas, and decided to get rid of sdd as well and start the md3 array fresh with no data on it, and then copy the info from the NAS drive.

The only way I managed to do this was to comment out the /home mount on md3 in fstab and reboot into recovery mode. This then enabled me to --stop the md3 array. (At this point, both the sdd and sdc drives were physically disconnected, and the new drive was in the machine. Ubuntu saw this new drive as sdc. So I then recreated md3 with raid type 1, but raid devices also as 1, and specified the device as sdc1 (I copied the partition structure from sdd before I removed it).

This works well - md3 was up with only sdc (the new drive). I checked blkid on md3, and the UUID matches the old md3 UUID. Good so far. I then edited fstab to again mount /home on that UUID and rebooted into recovery mode again. Ubuntu detected tons of filesystem errors (not sure why?) and asked if it should repair. I said yes. After that, it continued the startup process and was up and running, with /home mounted. Except md3 now changed into md127. Apart from that, everything seems fine.

The weird thing is that, whenever I plug sdd (the one faulty drive) back in (SATA port #4), it brings it up on a reboot as md3 with /home mounted on that, and brings up the NEW sdc drive as md127 but with nothing mounted on it (I don't think so anyway). Why on earth would this happen?

So my questions:

1) I probably did not follow the 100% correct procedure (but I didn't know what else to do, it was more an act of desperation) to get the new drive online with /home mounted on it as a new array (md127). But will it work? And I assume I can then "grow" that array once a second hard drive arrives from the suppliers to get 2 devices on that array again?

2) Why does it become md127 instead of md3 after a reboot?

3) Why does md3 come back with the old drive as a device on it (and /home mounted on THAT) whenever I plug the old faulty drive in again and reboot? Is it because of the SATA port, or because of the drive UUID? I'm now too "scared" to plug the second new drive (once it arrives from the suppliers) into that SATA port, for fear that it would try bring that up as md3 and bring down my home directories (that are currently on md127) again. It's almost as if md3 is still in a configuration saved somewhere with the old drives in it or something. I don't know...