Hello fellow Linuxers,
I have a problem with my mdadm RAID10 which I am running on a machine with OpenSUSE 12.3. It appeared today, apparently after a normal reboot.
On boot, I see behavior similar to this:
Code:
[ 2.572122] md: md0 stopped.
[ 2.588542] md: bind<sdb1>
[ 2.603699] md: bind<sdd1>
[ 2.624639] md: bind<sde1>
[ 2.624665] md: could not open unknown-block(8,33).
[ 2.624666] md: md_import_device returned -16
[ 2.624692] md: kicking non-fresh sde1 from array!
[ 2.624695] md: unbind<sde1>
[ 2.635518] md: export_rdev(sde1)
[ 2.635542] md: kicking non-fresh sdb1 from array!
[ 2.635546] md: unbind<sdb1>
[ 2.641204] md: export_rdev(sdb1)
[ 2.642475] md: raid10 personality registered for level 10
[ 2.642933] md/raid10:md0: not enough operational mirrors.
[ 2.642947] md: pers->run() failed ...
I say
similar, because I have seen different drives and even different numbers of drives drop from the array. The dropping out seems unnecessary, because I can re-add the missing drives to the array and it doesn't even rebuild most of the time (it only did so once):
Code:
[ 304.380667] md: bind<sdb1>
[ 304.407601] md: recovery of RAID array md0
[ 304.407607] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[ 304.407609] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[ 304.407615] md: using 128k window, over a total of 1610611456k.
[ 305.313459] md: md0: recovery done.
[ 307.552017] md: bind<sde1>
[ 307.579897] md: recovery of RAID array md0
[ 307.579903] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[ 307.579905] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[ 307.579910] md: using 128k window, over a total of 1610611456k.
[ 308.509127] md: md0: recovery done.
If one of the two mirrors is completely dropped, I have to stop the array and restart it, where it restarts with two disks (one in each mirror) and I can then add the two other drives.
Once it is up and running, the array works, but on a reboot I see the same problem again every time.
The RAID10 partition doesn't take up all of the space on the drives, I also run a RAID1 and a RAID0 on them, which both work without problem on every boot. This leads me to assume that there isn't an actual drive failure, because even the RAID0 works, which should be the most vulnerable to every hardware crisis. When I fix the RAID10 by hand, all RAIDs look good on paper:
Code:
cat /proc/mdstat
Personalities : [raid10] [raid0] [raid1]
md0 : active raid10 sde1[3] sdb1[0] sdc1[1] sdd1[2]
3221222912 blocks super 1.0 256K chunks 2 near-copies [4/4] [UUUU]
bitmap: 0/24 pages [0KB], 65536KB chunk
md2 : active raid1 sdb3[0] sde3[3] sdd3[2] sdc3[1]
157284224 blocks super 1.0 [4/4] [UUUU]
bitmap: 0/2 pages [0KB], 65536KB chunk
md1 : active raid0 sdb2[0] sde2[3] sdd2[2] sdc2[1]
524295936 blocks super 1.0 64k chunks
BTW, I use GPT on all of the disks, my system is an UEFI one, but I run my OS on compatibility with GRUB, not on EFI boot or EFI-GRUB or anything. I don't think this could cause any problems, or does it? It hasn't before this situation.
Out of lack of ideas, I have just started a filesystem badsector search on the LVM volumes on the RAID10, with the intention to find out whether there are actual bad blocks. However, I rather suspect something to be wrong on the mdadm/superblock level, but I am not that experienced there.
As I can not point to any particular cause or even fix for this behavior, I would greatly appreciate your help. Whatever additional information you need, I will provide it.
Regards,
dave