LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 08-17-2014, 02:19 AM   #1
duffrecords
LQ Newbie
 
Registered: Nov 2009
Location: Los Angeles, CA
Posts: 29

Rep: Reputation: 0
software RAID failed -- not enough operational mirrors


I came home the other day and walked in just in time to see a kernel panic on the screen and a message that the system would reboot in 300 seconds, which it promptly did. After the reboot, the system was unable to start my 6-drive RAID 10 array because there were not enough operational mirrors. I should mention the array contains everything--root, boot, swap, etc. Suffice it to say the kernel is not the only thing panicking now.

My first guess was hardware failure (it's been very hot in here lately) but BIOS detected all six disks. Moreover, I booted from a live DVD and gparted detected not only all six disks but also the partitions of each disk. These are 3 TB disks and had to be partitioned with GPT; I can see the 1 MB BIOS boot partition as well as the massive ~3 TB partition on each disk.

If I try to boot in recovery mode, the RAID fails to start because three of the six disks are removed:
Code:
(initramfs) mdadm --detail /dev/md0
mdadm: CREATE user root not found
mdadm: CREATE group disk not found
/dev/md0:
        Version : 1.2
  Creation Time : Tue Apr 23 08:34:34 2013
     Raid Level : raid10
  Used Dev Size : -1
   Raid Devices : 6
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Fri Aug 15 05:03:53 2014
          State : active, FAILED, Not Started
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : dtla:0
           UUID : dd800f45:01b7629e:fbae3456:9c7dbde1
         Events : 14232644

    Number   Major   Minor   RaidDevice State
       6       8       2         0      active sync   /dev/sda2
       1       8      18         1      active sync   /dev/sdb2
       2       0       0         2      removed
       3       0       0         3      removed
       4       8      66         4      active sync   /dev/sde2
       5       0       0         5      removed
When I examine each disk individually, there appears to be some discrepancy about the state of the array:
Code:
(initramfs) for i in a b c d e f; do mdadm -E /dev/md${i}2 | grep State; done
          State : clean
   Array State : AAA.A. ('A' == active, '.' == missing)
          State : clean
   Array State : AAA.A. ('A' == active, '.' == missing)
          State : clean
   Array State : AAA.A. ('A' == active, '.' == missing)
          State : active
   Array State : AAAAA. ('A' == active, '.' == missing)
          State : clean
   Array State : AAA.A. ('A' == active, '.' == missing)
          State : active
   Array State : AAAAAA ('A' == active, '.' == missing)
Perhaps I'm reading that incorrectly. Does this necessarily mean the RAID is unrecoverable? I've seen some forum posts recommending running something like
Code:
mdadm --assemble --force /dev/md0 /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sdd2 /dev/sde2 /dev/sdf2
but I'm very cautious about running commands unless I know with 100% certainty it is the correct course of action, especially when data loss is one possible outcome. Should I try forcing the array to assemble or try adding the missing disks?
 
Old 09-06-2014, 04:43 AM   #2
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600
I'd say check if you have good backups, then 'mdadm --assemble --force /dev/md0 /dev/sda2 /dev/sdb2 /dev/sde2'. If that works then try a 'mdadm --add --force /dev/md0 ' for each of /dev/sdc2, /dev/sdd2 and /dev/sdf2.
 
Old 09-22-2014, 12:19 PM   #3
duffrecords
LQ Newbie
 
Registered: Nov 2009
Location: Los Angeles, CA
Posts: 29

Original Poster
Rep: Reputation: 0
I forced the three good disks and the one that was behind by two events to assemble:

mdadm --assemble --force /dev/md0 /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sde2

Then I added the other two disks and let it sync overnight:

mdadm --add --force /dev/md0 /dev/sdd2
mdadm --add --force /dev/md0 /dev/sdf2

I rebooted the system in recovery mode and the root filesystem is back! However, / is read-only and my /srv partition, which is the largest and has most of my data, can't mount. When I try to examine the array, it says "no md superblock detected on /dev/md0." On top of the software RAID, I have four logical volumes. Here is the full LVM configuration:

http://pastebin.com/gzdZq5DL

How do I recover the superblock?
 
Old 09-23-2014, 12:09 PM   #4
duffrecords
LQ Newbie
 
Registered: Nov 2009
Location: Los Angeles, CA
Posts: 29

Original Poster
Rep: Reputation: 0
I booted from a live CD so I could use version 3.1.10 of xfs_repair (versions < 3.1.8 reportedly have a bug when using ag_stride), then ran the following command:
Code:
xfs_repair -P -o bhash=16384 -o ihash=16384 -o ag_stride=16 /dev/mapper/vg_raid10-srv
It stopped after a few seconds, saying:
Code:
xfs_repair: read failed: Input/output error
XFS: failed to find log head
zero_log: cannot find log head/tail (xlog_find_tail=5), zeroing it anyway
xfs_repair: libxfs_device_zero write failed: Input/output error
However, I was able to mount the volume after that and my data was still there!
 
Old 10-04-2014, 11:16 PM   #5
duffrecords
LQ Newbie
 
Registered: Nov 2009
Location: Los Angeles, CA
Posts: 29

Original Poster
Rep: Reputation: 0
I received two replacement drives and added them to the array. When it finished rebuilding, I checked the status but the last disk (sdf) has been treated as a spare:
Code:
/dev/md127:
        Version : 1.2
  Creation Time : Tue Apr 23 04:34:34 2013
     Raid Level : raid10
     Array Size : 8790397440 (8383.18 GiB 9001.37 GB)
  Used Dev Size : 2930132480 (2794.39 GiB 3000.46 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Sat Oct  4 14:58:29 2014
          State : clean, degraded
 Active Devices : 5
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 1

         Layout : near=2
     Chunk Size : 512K

           Name : dtla:0
           UUID : dd800f45:01b7629e:fbae3456:9c7dbde1
         Events : 15050926

    Number   Major   Minor   RaidDevice State
       6       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
       2       8       34        2      active sync   /dev/sdc2
       7       8       50        3      active sync   /dev/sdd2
       4       8       66        4      active sync   /dev/sde2
       5       0        0        5      removed

       8       8       82        -      spare   /dev/sdf2
Note the RAID device name is now md127 because I'm running from a live CD. I've been reading posts by people who have had a similar problem and tried two solutions:

1. Stop the array and recreate it, using the --assume-clean option.
2. Grow the array one disk larger, causing mdadm to utilize the spare, then shrink the array back down to the correct number of devices.

Neither of these have worked for me because:

1. I cannot stop the array, even while booted into a live CD environment or while in recovery mode.
Code:
mdadm: Cannot get exclusive access to /dev/md127:Perhaps a running process, mounted filesystem or active volume group?
2. Apparently I cannot grow the array by one disk because it is RAID 10. This makes sense, as RAID 10 requires an even number of disks.
Code:
mdadm: RAID10 can only be changed to RAID0
I definitely don't want to convert it to RAID 0.

I read that one condition that can cause a new disk to become a spare instead of an active member is if the disk with the data has errors on it. I ran smartctl on it and found this:
Code:
[root@localhost ~]# smartctl -t short /dev/sde
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.6.10-4.fc18.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Sun Oct  5 00:09:42 2014

Use smartctl -X to abort test.
[root@localhost ~]# smartctl -l selftest /dev/sde
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.6.10-4.fc18.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     11822         1187144704
# 2  Short offline       Completed: read failure       90%     11814         1187144704
[root@localhost ~]# smartctl -H /dev/sde
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.6.10-4.fc18.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   069   033   045    Old_age   Always   In_the_past 31 (72 84 37 28 0)
It appears sde is on its way out as well. That's OK because it's still under warranty but obviously I need to mirror sde to sdf before it fails or there won't be enough operational mirrors again. Should I try to repair sde first or try to mirror it somehow? How do I do either of those things? I should also mention there are LVM volumes on top of the RAID 10.

Last edited by duffrecords; 10-04-2014 at 11:18 PM.
 
Old 10-14-2014, 08:47 PM   #6
duffrecords
LQ Newbie
 
Registered: Nov 2009
Location: Los Angeles, CA
Posts: 29

Original Poster
Rep: Reputation: 0
I was able to stop the array by deactivating the volume group that was on top of it.
Code:
[root@localhost ~]# lvchange -a n /dev/vg_raid10/
[root@localhost ~]# swapoff /dev/dm-3
[root@localhost ~]# lvchange -a n /dev/vg_raid10/swap
[root@localhost ~]# vgchange -a n vg_raid10
[root@localhost ~]# mdadm --stop /dev/md127
Then I force assembled it and force added the extra disk
Code:
[root@localhost ~]# mdadm --force --assemble /dev/md127 /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sdd2 /dev/sde2
[root@localhost ~]# mdadm --force --assemble /dev/md127 /dev/sdf2
Unfortunately, it tries to add the extra disk as a spare again.
Code:
[root@localhost ~]# mdadm --detail /dev/md127
/dev/md127:
        Version : 1.2
  Creation Time : Tue Apr 23 04:34:34 2013
     Raid Level : raid10
     Array Size : 8790397440 (8383.18 GiB 9001.37 GB)
  Used Dev Size : 2930132480 (2794.39 GiB 3000.46 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Tue Oct 14 21:27:43 2014
          State : clean, degraded, recovering
 Active Devices : 5
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 1

         Layout : near=2
     Chunk Size : 512K

 Rebuild Status : 0% complete

           Name : dtla:0
           UUID : dd800f45:01b7629e:fbae3456:9c7dbde1
         Events : 15051863

    Number   Major   Minor   RaidDevice State
       6       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
       2       8       34        2      active sync   /dev/sdc2
       7       8       50        3      active sync   /dev/sdd2
       4       8       66        4      active sync   /dev/sde2
       8       8       82        5      spare rebuilding   /dev/sdf2
It's because there are read errors on sde. I can't remove sde because it's the last mirror in its set (I'm assuming sda and sdb are mirrors, sdc and sdd are mirrors, and sde and sdf are mirrors, and the data is striped across each set). Is there a way to mark these sectors as bad and then assemble the array or is there a way to force sdf to be an active member despite the read errors?
 
Old 10-15-2014, 03:07 PM   #7
duffrecords
LQ Newbie
 
Registered: Nov 2009
Location: Los Angeles, CA
Posts: 29

Original Poster
Rep: Reputation: 0
I read that the following command would cause md to discover the bad blocks and retire them:
Code:
echo 'check' > /sys/block/md0/md/sync_action
However, the check finished immediately and nothing was changed. I stopped the array and tried running this command instead:
Code:
badblocks -b 32768 -c 2048 -n -s /dev/sde2
Big mistake. After taking almost an entire day to run (and despite supposedly being non-destructive), it removed all partitions on /dev/sde. I tried copying the partitions from another disk but, now that the number of blocks has been reduced, the partitions won't fit on sde. testdisk doesn't even list sde anymore, let alone search for lost partitions on it. Is it safe to say this array is officially destroyed beyond any hope of repair?
 
Old 10-16-2014, 01:42 PM   #8
duffrecords
LQ Newbie
 
Registered: Nov 2009
Location: Los Angeles, CA
Posts: 29

Original Poster
Rep: Reputation: 0
Out of frustration, I decided to scrap the whole thing and start over by reinstalling the operating system from scratch. When I booted from the Ubuntu CD, I got to the menu where I can select whether to install Ubuntu, test the CD, boot from the first hard drive, etc. I thought, "What the hell, why not?" and tried booting from the first hard drive. The system started booting and let me start the array in a degraded state. The partitions on /dev/sde have magically reappeared. /dev/sdf is a spare again, of course. Does anyone know how to transfer the data from sde to sdf and promote sdf so that I can fail sde and get a replacement for it? I can't imagine I'm the only person who has tried to do this.
Code:
/dev/md0:
        Version : 1.2
  Creation Time : Tue Apr 23 01:34:34 2013
     Raid Level : raid10
     Array Size : 8790397440 (8383.18 GiB 9001.37 GB)
  Used Dev Size : 2930132480 (2794.39 GiB 3000.46 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Thu Oct 16 11:42:14 2014
          State : clean, degraded 
 Active Devices : 5
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 1

         Layout : near=2
     Chunk Size : 512K

           Name : dtla:0  (local to host dtla)
           UUID : dd800f45:01b7629e:fbae3456:9c7dbde1
         Events : 15091096

    Number   Major   Minor   RaidDevice State
       6       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
       2       8       34        2      active sync   /dev/sdc2
       7       8       50        3      active sync   /dev/sdd2
       4       8       66        4      active sync   /dev/sde2
       5       0        0        5      removed

       8       8       82        -      spare   /dev/sdf2
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Failed/Failing software (mdadm) raid whodah Linux - Server 2 01-07-2014 11:39 PM
Drive Failed on Software RAID carlosinfl Linux - Server 7 12-05-2012 05:51 PM
LXer: A company, zero to operational and profitable, in 5 days with free software LXer Syndicated Linux News 0 01-18-2012 04:20 AM
Opensuse 11 software RAID 5 failed: how to recover ? laufandreas Linux - Server 3 06-30-2009 04:51 AM
>raid5 : not enough operational devices for md0 (2/3 failed) targi Linux - Newbie 3 04-08-2006 12:03 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 07:12 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration