[SOLVED] Salvaging a RAID 5 array with 2 failed drives
Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I'll try and be concise, and please don't fill the thread with "you should have better backups" messages, most of the important stuff is safe, it's the "I wasn't too bothered about this stuff until it broke" stuff I want to recover...
Anyway...
I have (had?) an mdraid RAID 5 array across 4 3TB Seagate Barracuda drives.
The array booted out a drive, SMART data for the drive was scrambled, data has read errors and I/O errors (we'll call this one drive 1)
After a reboot I could hear a drive clicking (fubar) and removed it, it wasn't the drive that was kicked out... (we'll call this drive 4)
I now have this:
Drive 1 - Random read errors and I/O issues
Drives 2 & 3 - Array members, recently scrubbed, should be fine
Drive 4 - fubar
I am cloning what I can from drive 1, with dd and a small block size - the first errors on read seem to have been about 1.5TB through the clone and not particularly numerous.
So, before I ham-fistedly try and get mdadm to have a go at rebuilding, is there any special command options I should enforce, or tricks to try..?
I do want to try and salvage the array, although I am also prepared to flatten it if worst comes to worst...
further thought: should I put in a blank spare straight away or try and initialise in a degraded state?
To my knowledge: if a RAID5 array with any number of drives loses one, no data is lost. It rebuilds the missing data on the fly from the crc information on the remaining drives.
If a RAID5 array loses two drives, ALL data is lost: the array has lost sync and balance, there is not adequate information to rebuild lost data or even maintain the array.
RAID6 double up the crc data and allows you to run with two drives down, though with two down the performance drags significantly.
I would do what you are doing, try to re-create the last drive that dropped (or was pulled for errors) and see if the array will rebuild. If it will, you are VERY lucky, but it is worth a shot.
If it will not rebuild (what I would expect) then the normal procedure would be to take the surviving good drives and some replacement drives and start clean. Build a new array, install new and restore data from backups, and drive on.
Frankly, trying to recover failed arrays is not something I have spent a lot of time on. If I have a good backup, it is hard to imagine that I would WANT to spend much time. Someone with more experience in that may chime in with better advice for your current case.
smartctl couldn't read past random points (I don't recall the exact error, but it was along the lines of couldn't read past a point), although it does report now
To my knowledge: if a RAID5 array with any number of drives loses one, no data is lost. It rebuilds the missing data on the fly from the crc information on the remaining drives.
If a RAID5 array loses two drives, ALL data is lost: the array has lost sync and balance, there is not adequate information to rebuild lost data or even maintain the array.
Exactly this. RAID5 prevents downtime in case of one drive failing. With more drives failing you are pretty much doomed.
flangemonkey, you are on the right track attempting to image Drive 1. But I would suggest the Linux ddrescue command for the job as it is designed to operate on failing drives.
I've worked on many similar RAID recoveries and can say a 99+ percent recovery is often achievable. Your level of success of will depend on the quality of the image you make and the degree to which the failing drive is out of sync.
To assist further post the output of mdadm --examine against the RAID member partitions. So we can assess the state of your RAID5.
# mdadm --assemble /dev/md1 --scan --force
mdadm: forcing event count in /dev/sda1(0) from 351397 upto 355232
mdadm: Marking array /dev/md1 as 'clean'
mdadm: /dev/md1 assembled from 3 drives - not enough to start the array.
[root@anu phill]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md1 : inactive sda1[0](S) sdc1[2](S) sdb1[1](S)
8784093337 blocks super 1.2
unused devices: <none>
And all of the rest of the information I have found:
fdisk -l /dev/sd{a,b,c,d}
Code:
Disk /dev/sda: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 9C113EDE-CFE9-4F87-8D28-ED138A3DEA32
Device Start End Sectors Size Type
/dev/sda1 2048 5856326416 5856324369 2.7T Linux RAID
Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: B21FD950-DDF3-441A-AD3E-4E7C09253920
Device Start End Sectors Size Type
/dev/sdb1 2048 5856326416 5856324369 2.7T Linux RAID
Disk /dev/sdc: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 230BDB8E-FE1C-4392-BB06-1424686031DB
Device Start End Sectors Size Type
/dev/sdc1 2048 5856326416 5856324369 2.7T Linux RAID
Disk /dev/sdd: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
mdadm –examine /dev/sd{a,b,c}
Code:
/dev/sda1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 294e5cbd:82264ac6:4e11d1fd:9295556f
Name : archiso:md1
Creation Time : Wed Jul 3 00:10:15 2013
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 5856062225 (2792.39 GiB 2998.30 GB)
Array Size : 8784092928 (8377.16 GiB 8994.91 GB)
Used Dev Size : 5856061952 (2792.39 GiB 2998.30 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=273 sectors
State : active
Device UUID : 6dade4db:030d150a:771eb5f1:2f50eec3
Update Time : Thu Aug 20 16:06:16 2015
Checksum : 3a255bb9 - correct
Events : 351397
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 0
Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 294e5cbd:82264ac6:4e11d1fd:9295556f
Name : archiso:md1
Creation Time : Wed Jul 3 00:10:15 2013
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 5856062225 (2792.39 GiB 2998.30 GB)
Array Size : 8784092928 (8377.16 GiB 8994.91 GB)
Used Dev Size : 5856061952 (2792.39 GiB 2998.30 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=273 sectors
State : clean
Device UUID : 0842eeef:9304704d:e9c84c39:f13e59d3
Update Time : Mon Aug 24 22:33:45 2015
Checksum : e89433cd - correct
Events : 355232
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 1
Array State : .AA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 294e5cbd:82264ac6:4e11d1fd:9295556f
Name : archiso:md1
Creation Time : Wed Jul 3 00:10:15 2013
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 5856062225 (2792.39 GiB 2998.30 GB)
Array Size : 8784092928 (8377.16 GiB 8994.91 GB)
Used Dev Size : 5856061952 (2792.39 GiB 2998.30 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=273 sectors
State : clean
Device UUID : 858f9dfb:ae734ca7:216839e5:b53f24f8
Update Time : Mon Aug 24 22:33:45 2015
Checksum : 1ed78d6c - correct
Events : 355232
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 2
Array State : .AA. ('A' == active, '.' == missing, 'R' == replacing)
And finally;
Code:
# mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Raid Level : raid0
Total Devices : 3
Persistence : Superblock is persistent
State : inactive
Name : archiso:md1
UUID : 294e5cbd:82264ac6:4e11d1fd:9295556f
Events : 351397
Number Major Minor RaidDevice
- 8 1 - /dev/sda1
- 8 17 - /dev/sdb1
- 8 33 - /dev/sdc1
Does anyone know how I can assemble with a missing drive? (when I tried I got this: )
Code:
mdadm --assemble /dev/md1 /dev/sd{a,b,c}1 missing
mdadm: cannot open device missing: No such file or directory
mdadm: missing has no superblock - assembly aborted
Brought up the array degraded and it's rebuilding now...
And as a percentage 90MB is nowt... There'll be a few things that won't work later though.
I invoked ddrescue (amateurishly) without the logfile option and as the drive was putting out a lot of I/O errors I was happy to only lose 90MB - especially as I think another drive in the array sounds poorly... smartctl can't get power on hours...
Last edited by flangemonkey; 09-02-2015 at 03:22 PM.
Reason: Code was wrong!
Disk 1: (Seagate) I/O errors and was cloned 90MB data loss
Disk 2: (Seagate) I/O errors, 1 URE (unrecoverable read error) cloned, 4096 bytes data loss
Disk 3: (Seagate) Should I worry about this one..?
Disk 4: (Seagate) Fails to spin up at all
I replaced drives 1 and 4 with like for like Seagate drives, and drive 2 with a Toshiba; I will be adding another Toshiba into the mix to make it RAID 6 at some point now...
Now - to the present...
fsck says the filesystem is a mess - lots of shared inodes and several corrupt files... I suspect there will be some more data loss, I've already lost a few directories of files and suspect the fsck repair will be the point where I lose a few more...
For those who are in the same boat, here is how I did some of the stages:
Cloning drives:
Code:
ddrescue /dev/sdX /dev/sdY logfile
(where sdX is the failing drive and sdY the new one
Copying a filesystem layout / partition table for the blank drive so the partition was the same size as the other members:
Code:
sfdisk -d /dev/sdL > partition1.txt
sfdisk -d /dev/sdM > partition2.txt
sfdisk /dev/sdL < partition1.txt
#sdL is new drive, sdM is drive you want to copy
(where the mdN is the ID of your array and {a,b,c,d} are the drives in your array and assuming you are using parition 1 - (note, I assembled it first with a,b,c in a degraded state and then mdadm -A /dev/sdd1 to add the blank drive and rebuild.
to check your files, mount the array:
Code:
mount /dev/mdN /mountpoint
and have a look around
to check your filesystem, it should not be mounted (very important!!!)
Code:
fsck /dev/mdN
if you want to check without making changes add -n after the space after fsck, if you want it to just fix the errors it finds add a -y
hope this helps someone - and don't forget to scrub your arrays - not that it helped me...
Last edited by flangemonkey; 09-07-2015 at 01:27 AM.
Reason: code was wrong...
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.