Hello,
This happens on Slackware 14.1, Kernel is 3.10.17 stock x86_64 as well as stock i686, mdadm 3.2.6 (tested also with 3.3.4)
(The first issue was noticed on a Xen machine but was reproduced on stock kernel as well as on another machine that has 32bit Slackware installed)
Given a RAID1 array, a device fails and is hot-replaced. The rebuild starts normally. However, if the machine is rebooted before the rebuild is finished, the array no longer appears as degraded,recovering and the data is corrupted (given that one HDD is brand new).
Searching only provided a similar bug from 2012 in Fedora:
https://bugzilla.redhat.com/show_bug.cgi?id=817039
The suggestion there was to update mdadm, however mdadm installed in slackware is newer than the one in that bug report. Besides, I don't think the problem to be with mdadm (a user-mode program) but with the md driver in the kernel. However, just to be on the safe side I downloaded and compiled mdadm 3.3.4. The problem persists.
Everything RAID-related was done using only Linux tools (mdadm), i.e. MB BIOS was not configured for RAID. The same problem appears on two different systems (different CPU, MB etc.), so it's unlikely to be a hardware related or compatibility issue.
Is there an option/flag/switch that I am missing, or is a bug somewhere? Besides the kernel and mdadm are there any other components involved in Linux RAID?
Simple steps to reproduce are below.
WARNING! /dev/sdb1 and /dev/sdc1 will be ERASED, don't try this unless you know what you are doing!
I used sdb1 and sdc1 as RAID autodetect partitions (0xfd)
Create the array:
Code:
root@nxen:~# mdadm -C /dev/md127 --level=1 --raid-devices=2 /dev/sdb1 /dev/sdc1
After it has finished the initial resync:
Code:
root@nxen:~# mdadm --manage /dev/md127 --fail /dev/sdc1
mdadm: set /dev/sdc1 faulty in /dev/md127
root@nxen:~# mdadm --manage /dev/md127 --remove failed
mdadm: hot removed 8:33 from /dev/md127
root@nxen:~# mdadm --manage /dev/md127 -a /dev/sdc1
mdadm: added /dev/sdc1
check that rebuild begun:
Code:
root@nxen:~# mdadm --detail /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Wed Sep 30 14:48:05 2015
Raid Level : raid1
Array Size : 20955136 (19.98 GiB 21.46 GB)
Used Dev Size : 20955136 (19.98 GiB 21.46 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Wed Sep 30 14:55:52 2015
State : active, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Rebuild Status : 1% complete
Name : nxen:127 (local to host nxen)
UUID : 7589d8f8:0d8b5716:06e07bfa:28407522
Events : 23
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
2 8 33 1 spare rebuilding /dev/sdc1
reboot the machine and check again:
Code:
root@nxen:~# mdadm --detail /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Wed Sep 30 14:48:05 2015
Raid Level : raid1
Array Size : 20955136 (19.98 GiB 21.46 GB)
Used Dev Size : 20955136 (19.98 GiB 21.46 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Wed Sep 30 14:56:35 2015
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : nxen:127 (local to host nxen)
UUID : 7589d8f8:0d8b5716:06e07bfa:28407522
Events : 26
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
2 8 33 1 active sync /dev/sdc1
It shows as clean instead of rebuilding. Also in the first scenario - where a HDD was actually replaced - the data was corrupted, which is to be expected when the rebuild is considered done but isn't.
Any ideas? Thanks in advance!