LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices



Reply
 
Search this Thread
Old 04-11-2013, 09:44 AM   #1
davediehose
LQ Newbie
 
Registered: Apr 2013
Location: Germany
Distribution: OpenSUSE, Fedora, Xubuntu
Posts: 4

Rep: Reputation: Disabled
Drives dropping out of mdadm RAID10 randomly on boot


Hello fellow Linuxers,

I have a problem with my mdadm RAID10 which I am running on a machine with OpenSUSE 12.3. It appeared today, apparently after a normal reboot.

On boot, I see behavior similar to this:

Code:
[    2.572122] md: md0 stopped.
[    2.588542] md: bind<sdb1>
[    2.603699] md: bind<sdd1>
[    2.624639] md: bind<sde1>
[    2.624665] md: could not open unknown-block(8,33).
[    2.624666] md: md_import_device returned -16
[    2.624692] md: kicking non-fresh sde1 from array!
[    2.624695] md: unbind<sde1>
[    2.635518] md: export_rdev(sde1)
[    2.635542] md: kicking non-fresh sdb1 from array!
[    2.635546] md: unbind<sdb1>
[    2.641204] md: export_rdev(sdb1)
[    2.642475] md: raid10 personality registered for level 10
[    2.642933] md/raid10:md0: not enough operational mirrors.
[    2.642947] md: pers->run() failed ...
I say similar, because I have seen different drives and even different numbers of drives drop from the array. The dropping out seems unnecessary, because I can re-add the missing drives to the array and it doesn't even rebuild most of the time (it only did so once):
Code:
[  304.380667] md: bind<sdb1>
[  304.407601] md: recovery of RAID array md0
[  304.407607] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  304.407609] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[  304.407615] md: using 128k window, over a total of 1610611456k.
[  305.313459] md: md0: recovery done.
[  307.552017] md: bind<sde1>
[  307.579897] md: recovery of RAID array md0
[  307.579903] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  307.579905] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[  307.579910] md: using 128k window, over a total of 1610611456k.
[  308.509127] md: md0: recovery done.
If one of the two mirrors is completely dropped, I have to stop the array and restart it, where it restarts with two disks (one in each mirror) and I can then add the two other drives.
Once it is up and running, the array works, but on a reboot I see the same problem again every time.

The RAID10 partition doesn't take up all of the space on the drives, I also run a RAID1 and a RAID0 on them, which both work without problem on every boot. This leads me to assume that there isn't an actual drive failure, because even the RAID0 works, which should be the most vulnerable to every hardware crisis. When I fix the RAID10 by hand, all RAIDs look good on paper:
Code:
 cat /proc/mdstat
Personalities : [raid10] [raid0] [raid1]
md0 : active raid10 sde1[3] sdb1[0] sdc1[1] sdd1[2]
      3221222912 blocks super 1.0 256K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 0/24 pages [0KB], 65536KB chunk

md2 : active raid1 sdb3[0] sde3[3] sdd3[2] sdc3[1]
      157284224 blocks super 1.0 [4/4] [UUUU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

md1 : active raid0 sdb2[0] sde2[3] sdd2[2] sdc2[1]
      524295936 blocks super 1.0 64k chunks
BTW, I use GPT on all of the disks, my system is an UEFI one, but I run my OS on compatibility with GRUB, not on EFI boot or EFI-GRUB or anything. I don't think this could cause any problems, or does it? It hasn't before this situation.

Out of lack of ideas, I have just started a filesystem badsector search on the LVM volumes on the RAID10, with the intention to find out whether there are actual bad blocks. However, I rather suspect something to be wrong on the mdadm/superblock level, but I am not that experienced there.

As I can not point to any particular cause or even fix for this behavior, I would greatly appreciate your help. Whatever additional information you need, I will provide it.

Regards,
dave

Last edited by davediehose; 04-11-2013 at 10:54 AM.
 
Old 04-11-2013, 01:42 PM   #2
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: CentOS 6 (pre-systemd)
Posts: 1,773

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
The error -16 is -EBUSY. md_import_drive is failing when it tries to get exclusive ownership of the drives but they are in use by some other program. Check for any programs running before the md errors appear in the logs that may be using the disks.
 
Old 04-11-2013, 02:35 PM   #3
davediehose
LQ Newbie
 
Registered: Apr 2013
Location: Germany
Distribution: OpenSUSE, Fedora, Xubuntu
Posts: 4

Original Poster
Rep: Reputation: Disabled
Thanks for the response. I looked into the logs and found some stuff, even from some days ago, so the problem existed unnoticed by me for some time now. Creepy.

It most probably is some race condition or a process polling/accessing the drives earlier than it's supposed to. I have lines saying this:
Code:
2013-04-01T22:54:06.102933+02:00 davederserver boot.md[356]: Starting MD RAID mdadm: failed to add /dev/sdd1 to /dev/md/0: Device or resource busy
2013-04-01T22:54:06.102938+02:00 davederserver boot.md[356]: mdadm: failed to add /dev/sdb1 to /dev/md/0: Device or resource busy
2013-04-01T22:54:06.102942+02:00 davederserver boot.md[356]: mdadm: /dev/md/0 has been started with 2 drives (out of 4).
Nothing around those jumped at me that would busy the drives, though. Doesn't mean there isn't anything, of course. Is there an elegant way to delay the md stuff on boot? It takes place right in between so much seemingly unrelated stuff.

When my fsck finishes, I will next try booting without md assemble at boot (kernel raid=noautomount should do this, I assume) to see how it goes when I do everything by hand from the start.
 
Old 04-11-2013, 02:52 PM   #4
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: CentOS 6 (pre-systemd)
Posts: 1,773

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Only thing that might conflict that early would be udev. See if udev is maybe doing something funny with the disks.

Also check /etc/mdadm.conf and make sure you aren't assembling the same arrays twice.
 
Old 04-11-2013, 03:23 PM   #5
davediehose
LQ Newbie
 
Registered: Apr 2013
Location: Germany
Distribution: OpenSUSE, Fedora, Xubuntu
Posts: 4

Original Poster
Rep: Reputation: Disabled
Ok, situation changed somewhat. I now saw three clean assembles out of three reboots, one of them a cold boot. Thing is, I can't figure out why it works.

The only thing fsck found was two cases of too high directory depth on an inode. Could hardly have been the problem, I guess.

I disabled the startup of some services through chkconfig (nfsserver, libvirtd), maybe that helped. I also inserted raid=noautomount in the kernel options. This didn't disable md, as I would have expected, but maybe it changed a significant detail in boot not visible to me (?).

Anyway, I now want to manually create the arrays later from a script. That'd be a good way to work around this kind of problems in the future, and I have to mount LUKS manually anyways. After the kernel options didn't work as expected, I just tried to disable boot.md to keep md from assembling my arrays, but it still did it anyway. Could you help me out on the best way forward here?
 
Old 04-11-2013, 05:06 PM   #6
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: CentOS 6 (pre-systemd)
Posts: 1,773

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Put in /etc/mdadm.conf:

Code:
AUTO -all
Then it should not assemble any arrays automatically.
 
Old 04-12-2013, 05:58 AM   #7
davediehose
LQ Newbie
 
Registered: Apr 2013
Location: Germany
Distribution: OpenSUSE, Fedora, Xubuntu
Posts: 4

Original Poster
Rep: Reputation: Disabled
Thanks for the info. So catch-"all" with minus without anything else in the config means do not assemble anything. Now the mdadm.conf manual makes sense to me
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
mdadm RAID10 layout : near vs. far badkuk Linux - Software 1 07-14-2012 02:55 AM
mdadm reports RAID10 has layout near=2,far=1 badkuk Linux - Software 1 06-08-2012 03:36 AM
mdadm RAID10 failure(s) grimm26 Linux - Server 1 02-14-2011 03:32 PM
soft raid10 with mdadm with everything ONLY on raid Alkisx Ubuntu 3 03-01-2009 05:41 PM
RAID10 Recovery Issue - mdadm segfault marc2112 Linux - Server 2 02-15-2009 09:25 AM


All times are GMT -5. The time now is 10:18 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration