LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)
-   -   RAID-5 Recovery problems after drive errors (https://www.linuxquestions.org/questions/linux-hardware-18/raid-5-recovery-problems-after-drive-errors-874371/)

Ta-mater 04-11-2011 01:42 PM

RAID-5 Recovery problems after drive errors
 
I'm a bit of a Linux newbie, but I did manage to set up the following RAID-5 system:

1x 500GB system drive on ATA IDE
4x 1TB SATA drives in software RAID
Linux = Fedora 13

So here's what happened. I set up the system to send me an email every time the mdadm stat file changed, so it would send me emails when in periodically ran a self-test. I was away and noticed that the self-test was going incredibly slow (usually took 8 hours...was on course for taking 16 weeks!) A colleague decided to just reboot the system.

Afterwards, the system would not boot and, while all 5 drives were connected, would stop at an endlessly scrolling error message of:
Quote:

ata4.01: exception Emask 0x0 SErr 0x0 action 0x0
ata4.01: BMDMA stay 0x64
ata4.01: failed command: READ DMA
ata4.01: (a bunch of hex numbers)
ata4.01: (a bunch of hex numbers, again)
ata4.01: status {DRDY ERR}
ata4.01: error: {UNC}
I worked out that it was a single drive that was causing said error. When just the system drive and the other 3 RAID drives were connected, it would get past that error, yet stop at a filesystem error and want to drop me to a recovery terminal and wouldn't go any further. When trying to run a fsck scan, it kept saying bad superblock on the failed drive, but none of the suggestions it gave would work.

When attempting to boot to a Live-CD version of Fedora (and Ubuntu) with all 4 RAID drives attached, the same error occurs.

With only the other 3 drives attached, it boots into Live-CD Linux just fine.

In Palimpsest, it shows the 3 drives as healthy and as parts of a RAID array.

However, when I try to start the array through Palimpsest, it says that there are not enough disks to start the array....even though there are 3, which is what RAID 5 was supposed to be about. (The drives contain backups of important research data)

So, what do I do now? Do I need to have a 4th blank working drive in there to recover the data? SHOULD it be able to start the array with just the 3 drives, or does it need a blank new drive to rebuild the array? How do I do that?

Thanks,

Ta-mater

Sjonnie48 04-12-2011 07:27 AM

First that you can do is edit /etc/fstab. Comment out the line that describes /dev/md0 by placing a # in position 1 of that line, to prevent the system mounting it.
After that your system should start without difficulties. When the system is up and running you are able to find out what's up with he drive, and repair your array with mdadm.

Ta-mater 04-26-2011 01:12 PM

Alright, so I did the above suggestion and was able to boot into Fedora.

Now, I have sent the defective drive back to WD and received the new one today.

I have put the new drive in and it boots up (with the /dev/md127 commented out in fstab)

So, where do I go from here? I've been searching around for how to replace a drive and rebuild parity in Linux Software RAID 5, but all I've found is confusion and peoples' stories that don't match my situation.

There should be a simple process to replacing a failed drive and rebuilding the array in this instance. Can anyone point me in the right direction?

Ta-mater 04-26-2011 01:42 PM

I found this page describing what I need to do: http://wumple.com/blog/2007/01/23/re...-raid-5-array/

Specifically, he lists the steps as:

Quote:

1. I shut down the machine and removed the bad drive, because my SATA controller and the libata driver for it does not support drive hot-swap.
2. I installed the new drive in the same bay and SATA controller port.
3. I restarted the machine. The RAID array came up in the degraded condition since it was missing an active drive.
4. I used "/sbin/fdisk /dev/sdb" to look at the partition table of another disk in the array to know what to create on the new drive. Alternatively, if the old drive is still functional enough its partition table could be used as an example.
5. I used "/sbin/fdisk /dev/sdc" to create the RAID partition on the new drive:

‘n’ to create the new partition. I created primary partition #1 and sized it to use the whole disk.
‘t’ to change the partition id. I choose type fd (that is in hexadecimal, aka 0xfd), "Linux raid autodetect".
‘a’ to toggle the bootable flag since the Fedora install set all RAID partitions bootable on my disks during the original install/upgrade process.
‘w’ to write the new partition table and exit fdisk.

6. "/sbin/mdadm /dev/md0 -a /dev/sdc1" to add the new RAID partition on the new disk to the array.

The kernel then rebuilt the disk automatically over the next few hours. The progress of the rebuild can be checked by "cat /proc/mdstat" or continually via "watch -n .1 cat /proc/mdstat". "dmesg" will also display a message at the start and the completion of the rebuild.
My problem is that step 3 doesn't happen that way for me. When I start up the machine with the new drive installed (and /dev/md127 NOT commented out in fstab) The system still boots to an error:

Checking Filesystems...

/dev/md127: The Superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem, then the superblock is corrupt, and you might try running e2fsck with an alternate superblock

*** An error occurred during the file system check.
***Dropping you to a shell: the system will reboot
***When you leave the shell.

And then the filesystem is Read-only.

Thing is, the RAID filesystem is ext4, so I'm not sure why it is saying ext2.

Do I need to stop this filesystem check from happening automatically??

Ta-mater 04-26-2011 01:44 PM

Further, when I boot up Fedora with /dev/md127 commented out, the system does not "start" the array with the existing 3 drives....it says there are not enough components to start the array....which doesn't sound like what is supposed to happen. Shouldn't the system start it as degraded?

I can't add the new disk to the array as stated in the guide above because I can't "start" the array in this state.....bah


All times are GMT -5. The time now is 10:39 PM.