Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
08-17-2014, 02:19 AM
|
#1
|
LQ Newbie
Registered: Nov 2009
Location: Los Angeles, CA
Posts: 29
Rep:
|
software RAID failed -- not enough operational mirrors
I came home the other day and walked in just in time to see a kernel panic on the screen and a message that the system would reboot in 300 seconds, which it promptly did. After the reboot, the system was unable to start my 6-drive RAID 10 array because there were not enough operational mirrors. I should mention the array contains everything--root, boot, swap, etc. Suffice it to say the kernel is not the only thing panicking now.
My first guess was hardware failure (it's been very hot in here lately) but BIOS detected all six disks. Moreover, I booted from a live DVD and gparted detected not only all six disks but also the partitions of each disk. These are 3 TB disks and had to be partitioned with GPT; I can see the 1 MB BIOS boot partition as well as the massive ~3 TB partition on each disk.
If I try to boot in recovery mode, the RAID fails to start because three of the six disks are removed:
Code:
(initramfs) mdadm --detail /dev/md0
mdadm: CREATE user root not found
mdadm: CREATE group disk not found
/dev/md0:
Version : 1.2
Creation Time : Tue Apr 23 08:34:34 2013
Raid Level : raid10
Used Dev Size : -1
Raid Devices : 6
Total Devices : 3
Persistence : Superblock is persistent
Update Time : Fri Aug 15 05:03:53 2014
State : active, FAILED, Not Started
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : near=2
Chunk Size : 512K
Name : dtla:0
UUID : dd800f45:01b7629e:fbae3456:9c7dbde1
Events : 14232644
Number Major Minor RaidDevice State
6 8 2 0 active sync /dev/sda2
1 8 18 1 active sync /dev/sdb2
2 0 0 2 removed
3 0 0 3 removed
4 8 66 4 active sync /dev/sde2
5 0 0 5 removed
When I examine each disk individually, there appears to be some discrepancy about the state of the array:
Code:
(initramfs) for i in a b c d e f; do mdadm -E /dev/md${i}2 | grep State; done
State : clean
Array State : AAA.A. ('A' == active, '.' == missing)
State : clean
Array State : AAA.A. ('A' == active, '.' == missing)
State : clean
Array State : AAA.A. ('A' == active, '.' == missing)
State : active
Array State : AAAAA. ('A' == active, '.' == missing)
State : clean
Array State : AAA.A. ('A' == active, '.' == missing)
State : active
Array State : AAAAAA ('A' == active, '.' == missing)
Perhaps I'm reading that incorrectly. Does this necessarily mean the RAID is unrecoverable? I've seen some forum posts recommending running something like
Code:
mdadm --assemble --force /dev/md0 /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sdd2 /dev/sde2 /dev/sdf2
but I'm very cautious about running commands unless I know with 100% certainty it is the correct course of action, especially when data loss is one possible outcome. Should I try forcing the array to assemble or try adding the missing disks?
|
|
|
09-06-2014, 04:43 AM
|
#2
|
Moderator
Registered: May 2001
Posts: 29,415
|
I'd say check if you have good backups, then 'mdadm --assemble --force /dev/md0 /dev/sda2 /dev/sdb2 /dev/sde2'. If that works then try a 'mdadm --add --force /dev/md0 ' for each of /dev/sdc2, /dev/sdd2 and /dev/sdf2.
|
|
|
09-22-2014, 12:19 PM
|
#3
|
LQ Newbie
Registered: Nov 2009
Location: Los Angeles, CA
Posts: 29
Original Poster
Rep:
|
I forced the three good disks and the one that was behind by two events to assemble:
mdadm --assemble --force /dev/md0 /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sde2
Then I added the other two disks and let it sync overnight:
mdadm --add --force /dev/md0 /dev/sdd2
mdadm --add --force /dev/md0 /dev/sdf2
I rebooted the system in recovery mode and the root filesystem is back! However, / is read-only and my /srv partition, which is the largest and has most of my data, can't mount. When I try to examine the array, it says "no md superblock detected on /dev/md0." On top of the software RAID, I have four logical volumes. Here is the full LVM configuration:
http://pastebin.com/gzdZq5DL
How do I recover the superblock?
|
|
|
09-23-2014, 12:09 PM
|
#4
|
LQ Newbie
Registered: Nov 2009
Location: Los Angeles, CA
Posts: 29
Original Poster
Rep:
|
I booted from a live CD so I could use version 3.1.10 of xfs_repair (versions < 3.1.8 reportedly have a bug when using ag_stride), then ran the following command:
Code:
xfs_repair -P -o bhash=16384 -o ihash=16384 -o ag_stride=16 /dev/mapper/vg_raid10-srv
It stopped after a few seconds, saying:
Code:
xfs_repair: read failed: Input/output error
XFS: failed to find log head
zero_log: cannot find log head/tail (xlog_find_tail=5), zeroing it anyway
xfs_repair: libxfs_device_zero write failed: Input/output error
However, I was able to mount the volume after that and my data was still there!
|
|
|
10-04-2014, 11:16 PM
|
#5
|
LQ Newbie
Registered: Nov 2009
Location: Los Angeles, CA
Posts: 29
Original Poster
Rep:
|
I received two replacement drives and added them to the array. When it finished rebuilding, I checked the status but the last disk (sdf) has been treated as a spare:
Code:
/dev/md127:
Version : 1.2
Creation Time : Tue Apr 23 04:34:34 2013
Raid Level : raid10
Array Size : 8790397440 (8383.18 GiB 9001.37 GB)
Used Dev Size : 2930132480 (2794.39 GiB 3000.46 GB)
Raid Devices : 6
Total Devices : 6
Persistence : Superblock is persistent
Update Time : Sat Oct 4 14:58:29 2014
State : clean, degraded
Active Devices : 5
Working Devices : 6
Failed Devices : 0
Spare Devices : 1
Layout : near=2
Chunk Size : 512K
Name : dtla:0
UUID : dd800f45:01b7629e:fbae3456:9c7dbde1
Events : 15050926
Number Major Minor RaidDevice State
6 8 2 0 active sync /dev/sda2
1 8 18 1 active sync /dev/sdb2
2 8 34 2 active sync /dev/sdc2
7 8 50 3 active sync /dev/sdd2
4 8 66 4 active sync /dev/sde2
5 0 0 5 removed
8 8 82 - spare /dev/sdf2
Note the RAID device name is now md127 because I'm running from a live CD. I've been reading posts by people who have had a similar problem and tried two solutions:
1. Stop the array and recreate it, using the --assume-clean option.
2. Grow the array one disk larger, causing mdadm to utilize the spare, then shrink the array back down to the correct number of devices.
Neither of these have worked for me because:
1. I cannot stop the array, even while booted into a live CD environment or while in recovery mode.
Code:
mdadm: Cannot get exclusive access to /dev/md127:Perhaps a running process, mounted filesystem or active volume group?
2. Apparently I cannot grow the array by one disk because it is RAID 10. This makes sense, as RAID 10 requires an even number of disks.
Code:
mdadm: RAID10 can only be changed to RAID0
I definitely don't want to convert it to RAID 0.
I read that one condition that can cause a new disk to become a spare instead of an active member is if the disk with the data has errors on it. I ran smartctl on it and found this:
Code:
[root@localhost ~]# smartctl -t short /dev/sde
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.6.10-4.fc18.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Sun Oct 5 00:09:42 2014
Use smartctl -X to abort test.
[root@localhost ~]# smartctl -l selftest /dev/sde
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.6.10-4.fc18.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 11822 1187144704
# 2 Short offline Completed: read failure 90% 11814 1187144704
[root@localhost ~]# smartctl -H /dev/sde
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.6.10-4.fc18.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022 069 033 045 Old_age Always In_the_past 31 (72 84 37 28 0)
It appears sde is on its way out as well. That's OK because it's still under warranty but obviously I need to mirror sde to sdf before it fails or there won't be enough operational mirrors again. Should I try to repair sde first or try to mirror it somehow? How do I do either of those things? I should also mention there are LVM volumes on top of the RAID 10.
Last edited by duffrecords; 10-04-2014 at 11:18 PM.
|
|
|
10-14-2014, 08:47 PM
|
#6
|
LQ Newbie
Registered: Nov 2009
Location: Los Angeles, CA
Posts: 29
Original Poster
Rep:
|
I was able to stop the array by deactivating the volume group that was on top of it.
Code:
[root@localhost ~]# lvchange -a n /dev/vg_raid10/
[root@localhost ~]# swapoff /dev/dm-3
[root@localhost ~]# lvchange -a n /dev/vg_raid10/swap
[root@localhost ~]# vgchange -a n vg_raid10
[root@localhost ~]# mdadm --stop /dev/md127
Then I force assembled it and force added the extra disk
Code:
[root@localhost ~]# mdadm --force --assemble /dev/md127 /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sdd2 /dev/sde2
[root@localhost ~]# mdadm --force --assemble /dev/md127 /dev/sdf2
Unfortunately, it tries to add the extra disk as a spare again.
Code:
[root@localhost ~]# mdadm --detail /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Tue Apr 23 04:34:34 2013
Raid Level : raid10
Array Size : 8790397440 (8383.18 GiB 9001.37 GB)
Used Dev Size : 2930132480 (2794.39 GiB 3000.46 GB)
Raid Devices : 6
Total Devices : 6
Persistence : Superblock is persistent
Update Time : Tue Oct 14 21:27:43 2014
State : clean, degraded, recovering
Active Devices : 5
Working Devices : 6
Failed Devices : 0
Spare Devices : 1
Layout : near=2
Chunk Size : 512K
Rebuild Status : 0% complete
Name : dtla:0
UUID : dd800f45:01b7629e:fbae3456:9c7dbde1
Events : 15051863
Number Major Minor RaidDevice State
6 8 2 0 active sync /dev/sda2
1 8 18 1 active sync /dev/sdb2
2 8 34 2 active sync /dev/sdc2
7 8 50 3 active sync /dev/sdd2
4 8 66 4 active sync /dev/sde2
8 8 82 5 spare rebuilding /dev/sdf2
It's because there are read errors on sde. I can't remove sde because it's the last mirror in its set (I'm assuming sda and sdb are mirrors, sdc and sdd are mirrors, and sde and sdf are mirrors, and the data is striped across each set). Is there a way to mark these sectors as bad and then assemble the array or is there a way to force sdf to be an active member despite the read errors?
|
|
|
10-15-2014, 03:07 PM
|
#7
|
LQ Newbie
Registered: Nov 2009
Location: Los Angeles, CA
Posts: 29
Original Poster
Rep:
|
I read that the following command would cause md to discover the bad blocks and retire them:
Code:
echo 'check' > /sys/block/md0/md/sync_action
However, the check finished immediately and nothing was changed. I stopped the array and tried running this command instead:
Code:
badblocks -b 32768 -c 2048 -n -s /dev/sde2
Big mistake. After taking almost an entire day to run (and despite supposedly being non-destructive), it removed all partitions on /dev/sde. I tried copying the partitions from another disk but, now that the number of blocks has been reduced, the partitions won't fit on sde. testdisk doesn't even list sde anymore, let alone search for lost partitions on it. Is it safe to say this array is officially destroyed beyond any hope of repair?
|
|
|
10-16-2014, 01:42 PM
|
#8
|
LQ Newbie
Registered: Nov 2009
Location: Los Angeles, CA
Posts: 29
Original Poster
Rep:
|
Out of frustration, I decided to scrap the whole thing and start over by reinstalling the operating system from scratch. When I booted from the Ubuntu CD, I got to the menu where I can select whether to install Ubuntu, test the CD, boot from the first hard drive, etc. I thought, "What the hell, why not?" and tried booting from the first hard drive. The system started booting and let me start the array in a degraded state. The partitions on /dev/sde have magically reappeared. /dev/sdf is a spare again, of course. Does anyone know how to transfer the data from sde to sdf and promote sdf so that I can fail sde and get a replacement for it? I can't imagine I'm the only person who has tried to do this.
Code:
/dev/md0:
Version : 1.2
Creation Time : Tue Apr 23 01:34:34 2013
Raid Level : raid10
Array Size : 8790397440 (8383.18 GiB 9001.37 GB)
Used Dev Size : 2930132480 (2794.39 GiB 3000.46 GB)
Raid Devices : 6
Total Devices : 6
Persistence : Superblock is persistent
Update Time : Thu Oct 16 11:42:14 2014
State : clean, degraded
Active Devices : 5
Working Devices : 6
Failed Devices : 0
Spare Devices : 1
Layout : near=2
Chunk Size : 512K
Name : dtla:0 (local to host dtla)
UUID : dd800f45:01b7629e:fbae3456:9c7dbde1
Events : 15091096
Number Major Minor RaidDevice State
6 8 2 0 active sync /dev/sda2
1 8 18 1 active sync /dev/sdb2
2 8 34 2 active sync /dev/sdc2
7 8 50 3 active sync /dev/sdd2
4 8 66 4 active sync /dev/sde2
5 0 0 5 removed
8 8 82 - spare /dev/sdf2
|
|
|
All times are GMT -5. The time now is 07:12 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|