LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 09-28-2014, 05:49 AM   #1
dragonfly-uk
Member
 
Registered: Feb 2013
Posts: 36

Rep: Reputation: Disabled
Rebuilding a Failed Raid10


Okay I have a copy of OpenMediaVault (Debian Based) that's been running without issue for over a year. It is a decent processor 32Gb Ram, 4x4Tb HD, 1 x SSD and a seperate Boot Hard Drive.

Recently due to a broken fan, one of the hard disks shut down and dropped out of the Raid (Raid 10 incase it makes a difference), I've replaced the fan, and got everything up and running again, however when I add the Hard Drive back into the raid, it starts the recovery then at 21% it stops trying to add the drive and just marks it as removed "mdadm -D /dev/md0" gives the following

Code:
    root@fileserver:~# mdadm -D /dev/md0
    /dev/md0:
    Version : 1.2
    Creation Time : Fri Jun 14 20:06:24 2013
    Raid Level : raid10
    Array Size : 7814034432 (7452.04 GiB 8001.57 GB)
    Used Dev Size : 3907017216 (3726.02 GiB 4000.79 GB)
    Raid Devices : 4
    Total Devices : 3
    Persistence : Superblock is persistent
    Update Time : Tue Sep 16 10:56:15 2014
    State : clean, degraded
    Active Devices : 3
    Working Devices : 3
    Failed Devices : 0
    Spare Devices : 0
    Layout : near=2
    Chunk Size : 512K
    Name : fileserver:0 (local to host fileserver)
    UUID : 7e556cd4:f56c995e:68f72813:eeb2a61c
    Events : 5024563
    Number Major Minor RaidDevice State
    0 8 0 0 active sync /dev/sda
    1 8 32 1 active sync /dev/sdc
    2 0 0 2 removed
    4 8 48 3 active sync /dev/sdd
Note the device is marked as removed, and not failed or spare.

Thinking the disk could have failed I ran badblocks which gave it a clean bill of health. So then I ran fdisk to remove and partition information, so it should effectively be a clean disk, and tried again. I get exactly the same results.

Anybody got any ideas, on repairing the raid to full stength?
 
Old 09-28-2014, 08:56 PM   #2
GaWdLy
Member
 
Registered: Feb 2013
Location: San Jose, CA
Distribution: RHEL/CentOS/Fedora
Posts: 457

Rep: Reputation: Disabled
mdadm sucks.

What is in your /var/log/messages when the RAID sync fails?

I've seen something similar when the SOURCE disk in a CCISS/mdadm config (RAID1) had a bad block. When mdadm kept hitting the bad block on the source, the sync failed.

The recovery process was a huge pain in the ass.
 
Old 09-29-2014, 03:36 AM   #3
dragonfly-uk
Member
 
Registered: Feb 2013
Posts: 36

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by GaWdLy View Post
mdadm sucks.

What is in your /var/log/messages when the RAID sync fails?

I've seen something similar when the SOURCE disk in a CCISS/mdadm config (RAID1) had a bad block. When mdadm kept hitting the bad block on the source, the sync failed.

The recovery process was a huge pain in the ass.
I'll try re-runnung the sync later today, and post any relevant messages.
 
Old 09-30-2014, 07:00 AM   #4
dragonfly-uk
Member
 
Registered: Feb 2013
Posts: 36

Original Poster
Rep: Reputation: Disabled
Write I've looked at the logs in more detail and it looks like there is also a problem on a different disk.

Code:
Sep 30 11:37:12 fileserver kernel: [1118701.495772] ata4.00: configured for UDMA/133
Sep 30 11:37:12 fileserver kernel: [1118701.495787] sd 3:0:0:0: [sdd] Unhandled sense code
Sep 30 11:37:12 fileserver kernel: [1118701.495791] sd 3:0:0:0: [sdd]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 30 11:37:12 fileserver kernel: [1118701.495797] sd 3:0:0:0: [sdd]  Sense Key : Medium Error [current] [descriptor]
Sep 30 11:37:12 fileserver kernel: [1118701.495804] Descriptor sense data with sense descriptors (in hex):
Sep 30 11:37:12 fileserver kernel: [1118701.495807]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Sep 30 11:37:12 fileserver kernel: [1118701.495820]         64 f7 a7 48 
Sep 30 11:37:12 fileserver kernel: [1118701.495825] sd 3:0:0:0: [sdd]  Add. Sense: Unrecovered read error - auto reallocate failed
Sep 30 11:37:12 fileserver kernel: [1118701.495832] sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 64 f7 a7 48 00 00 08 00
Sep 30 11:37:12 fileserver kernel: [1118701.495872] ata4: EH complete
Sep 30 11:37:12 fileserver kernel: [1118701.495879] md/raid10:md0: recovery aborted due to read error
Sep 30 11:37:12 fileserver kernel: [1118701.691419] md: md0: recovery done.
I do have a spare disk, but no spare sata connector on the board (although I could connect it via USB enclosure if that helps at all)

So given that I now have a raid 10 running on 3 out of 4 disks, and one of those has read errors, what are my options for recovery.
 
Old 09-30-2014, 12:56 PM   #5
GaWdLy
Member
 
Registered: Feb 2013
Location: San Jose, CA
Distribution: RHEL/CentOS/Fedora
Posts: 457

Rep: Reputation: Disabled
/me not a storage guy!

It sounds like a similar issue-where one of the SOURCE disks is damaged and cannot be synced. This leaves the DEST disk with an incomplete copy of the data.

Here is what we constructed for the customer:

- Step 1: Construct a 1-legged mdadm
- Step 2: pvmove -n /phys/vol
- Step 3: add disks back to RAID

So in their case it was much less complicated-2 disks, RAID1. It made it easy to make a copy of the data and put it on a new software RAID. pvmove will copy the physical extents over to the new disk, but be forewarned: if the damaged area on that disk is in a data area, you may never be able to get this to work.
 
Old 09-30-2014, 03:01 PM   #6
deathsfriend99
Member
 
Registered: Nov 2007
Distribution: CentOS 6
Posts: 198

Rep: Reputation: 22
I had this happen with a JBOD case. Turned out the controller was just plain bad on the backplate of one particular drive bay. For me, the original drive probably never was bad. Not sure what sort of hardware you're using, but it's worth a check.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
rebuilding a failed portion of a SlackBuilds build, instead of rebuilding everything Geremia Slackware 26 01-21-2015 06:56 PM
RAID10 write speed decreased to normal SSD speed after rebuilding the array. mke2k Linux - Server 2 07-11-2014 05:15 AM
virt-v2v and RAID10 just a man Red Hat 5 12-16-2011 04:42 AM
mdadm RAID1 failed and not rebuilding indienick Linux - Hardware 7 01-20-2009 11:45 AM
Many Raid1 vs a Raid10 humbletech99 Linux - Hardware 2 06-21-2006 08:37 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 02:39 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration