LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware
User Name
Password
Slackware This Forum is for the discussion of Slackware Linux.

Notices


Reply
  Search this Thread
Old 09-30-2015, 07:41 AM   #1
gaitos
LQ Newbie
 
Registered: Oct 2014
Distribution: Slackware
Posts: 28

Rep: Reputation: Disabled
md *does not* continue rebuilding after reboot on Slackware 14.1


Hello,

This happens on Slackware 14.1, Kernel is 3.10.17 stock x86_64 as well as stock i686, mdadm 3.2.6 (tested also with 3.3.4)

(The first issue was noticed on a Xen machine but was reproduced on stock kernel as well as on another machine that has 32bit Slackware installed)

Given a RAID1 array, a device fails and is hot-replaced. The rebuild starts normally. However, if the machine is rebooted before the rebuild is finished, the array no longer appears as degraded,recovering and the data is corrupted (given that one HDD is brand new).

Searching only provided a similar bug from 2012 in Fedora:
https://bugzilla.redhat.com/show_bug.cgi?id=817039

The suggestion there was to update mdadm, however mdadm installed in slackware is newer than the one in that bug report. Besides, I don't think the problem to be with mdadm (a user-mode program) but with the md driver in the kernel. However, just to be on the safe side I downloaded and compiled mdadm 3.3.4. The problem persists.

Everything RAID-related was done using only Linux tools (mdadm), i.e. MB BIOS was not configured for RAID. The same problem appears on two different systems (different CPU, MB etc.), so it's unlikely to be a hardware related or compatibility issue.

Is there an option/flag/switch that I am missing, or is a bug somewhere? Besides the kernel and mdadm are there any other components involved in Linux RAID?

Simple steps to reproduce are below. WARNING! /dev/sdb1 and /dev/sdc1 will be ERASED, don't try this unless you know what you are doing!
I used sdb1 and sdc1 as RAID autodetect partitions (0xfd)

Create the array:
Code:
root@nxen:~# mdadm -C /dev/md127 --level=1 --raid-devices=2 /dev/sdb1 /dev/sdc1
After it has finished the initial resync:
Code:
root@nxen:~# mdadm --manage /dev/md127 --fail /dev/sdc1
mdadm: set /dev/sdc1 faulty in /dev/md127
root@nxen:~# mdadm --manage /dev/md127 --remove failed
mdadm: hot removed 8:33 from /dev/md127
root@nxen:~# mdadm --manage /dev/md127 -a /dev/sdc1
mdadm: added /dev/sdc1
check that rebuild begun:
Code:
root@nxen:~# mdadm --detail /dev/md127
/dev/md127:
         Version : 1.2
   Creation Time : Wed Sep 30 14:48:05 2015
      Raid Level : raid1
      Array Size : 20955136 (19.98 GiB 21.46 GB)
   Used Dev Size : 20955136 (19.98 GiB 21.46 GB)
    Raid Devices : 2
   Total Devices : 2
     Persistence : Superblock is persistent

     Update Time : Wed Sep 30 14:55:52 2015
           State : active, degraded, recovering
  Active Devices : 1
 Working Devices : 2
  Failed Devices : 0
   Spare Devices : 1

  Rebuild Status : 1% complete

            Name : nxen:127  (local to host nxen)
            UUID : 7589d8f8:0d8b5716:06e07bfa:28407522
          Events : 23

     Number   Major   Minor   RaidDevice State
        0       8       17        0      active sync   /dev/sdb1
        2       8       33        1      spare rebuilding   /dev/sdc1
reboot the machine and check again:
Code:
root@nxen:~# mdadm --detail /dev/md127
/dev/md127:
         Version : 1.2
   Creation Time : Wed Sep 30 14:48:05 2015
      Raid Level : raid1
      Array Size : 20955136 (19.98 GiB 21.46 GB)
   Used Dev Size : 20955136 (19.98 GiB 21.46 GB)
    Raid Devices : 2
   Total Devices : 2
     Persistence : Superblock is persistent

     Update Time : Wed Sep 30 14:56:35 2015
           State : clean
  Active Devices : 2
 Working Devices : 2
  Failed Devices : 0
   Spare Devices : 0

            Name : nxen:127  (local to host nxen)
            UUID : 7589d8f8:0d8b5716:06e07bfa:28407522
          Events : 26

     Number   Major   Minor   RaidDevice State
        0       8       17        0      active sync   /dev/sdb1
        2       8       33        1      active sync   /dev/sdc1
It shows as clean instead of rebuilding. Also in the first scenario - where a HDD was actually replaced - the data was corrupted, which is to be expected when the rebuild is considered done but isn't.

Any ideas? Thanks in advance!
 
Old 10-04-2015, 01:51 AM   #2
wildwizard
Member
 
Registered: Apr 2009
Location: Oz
Distribution: slackware64-14.0
Posts: 875

Rep: Reputation: 282Reputation: 282Reputation: 282
1. Don't use 0xfd
2. Does this only occur within your scenario of the old disk been added back into the array? ie an actual new disk works ok?
 
Old 10-04-2015, 06:05 AM   #3
gaitos
LQ Newbie
 
Registered: Oct 2014
Distribution: Slackware
Posts: 28

Original Poster
Rep: Reputation: Disabled
Thank you for replying. Unfortunately, it took quite long for my post to be approved and the problem was solved in the meantime with help from another list.

Answers to your observations:
1. I only used 0xfd for testing (figuring it's the typical use scenario); the original problem manifested when using whole drives (/dev/sdb and /dev/sdc)

2. No, the problem first manifested when a brand new disk was introduced into the array. After the reboot (that happened before rebuild was finished) the array was shown as healthy and the new disk showed "active sync". Of course, the array didn't contain the right data. In fact this was quite scary for me. The machine I encountered the problem on was in testing, but I had a similar configuration in production. A HDD replacement followed by an eventual reboot before the rebuild was completed could lead to subtle alteration of data (the worst kind since it can pass undetected for a while). Needless to say, I have reconfigured that server.


I have found two possible solutions:
1. Upgrade to kernel 3.19.8 (the last 3.. kernel) or
2. Create /etc/mdadm.conf using mdadm --examine --scan > /etc/mdadm.conf and pass raid=noautodetect on kernel command line.

I have chosen the 2nd solution (that was suggested on the rlug list - Romanian LUG) because I didn't want to waste a lot of time configuring the newer kernel (e.g. compiled with its defaults it didn't recognize my network cards), and from what I've learned it's safer to have mdadm assembling the array than relying on kernel autodetect.

As far as I understood from the explanations on rlug, the Linux kernel md driver (at least in the 3.10.17 version) only supports version 0.9 of the md superblock. I have not yet verified the kernel changelog to see if my understanding is correct; however, I have empirically determined that the autodetect in the Slackware default huge-3.10.17 kernel will not continue rebuild after reboot.

So, safest solution: create a /etc/mdadm.conf and avoid kernel autodetect with "raid=noautodetect".
 
Old 10-05-2015, 07:13 AM   #4
bassmadrigal
LQ Guru
 
Registered: Nov 2003
Location: West Jordan, UT, USA
Distribution: Slackware
Posts: 8,792

Rep: Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656
Quote:
Originally Posted by gaitos View Post
I have chosen the 2nd solution (that was suggested on the rlug list - Romanian LUG) because I didn't want to waste a lot of time configuring the newer kernel (e.g. compiled with its defaults it didn't recognize my network cards), and from what I've learned it's safer to have mdadm assembling the array than relying on kernel autodetect.
Just as a side note, there shouldn't be any software issues with a stock Slackware if you were to install a 4.x kernel (although, it is possible, though not likely, that 3rd-party software you've installed may have kernel limitations). Pat has a good .config for the 4.1.6 kernel in -current. It should support all the hardware your 3.10.17 does, plus newer stuff.
 
Old 10-05-2015, 07:30 AM   #5
gaitos
LQ Newbie
 
Registered: Oct 2014
Distribution: Slackware
Posts: 28

Original Poster
Rep: Reputation: Disabled
@bassmadrigal
Thank you for the heads-up, I wasn't aware of that. However, this machine is acting as a Xen host (Dom0). I will test 4.1.6 (modify the base .config and compile it as Xen Dom0) when I get some time but I think the results would be slightly off this topic.
 
Old 10-05-2015, 03:44 PM   #6
wildwizard
Member
 
Registered: Apr 2009
Location: Oz
Distribution: slackware64-14.0
Posts: 875

Rep: Reputation: 282Reputation: 282Reputation: 282
Option 2 is the correct one though if you didn't use 0xfd you wouldn't need the raid=noautodetect to the kernel as it would not detect the partitions.

I'll be linking back to this thread as proof that 0xfd may cause data loss for the next person that mentions they use 0xfd for the partition type.

I've been warning people away from that for years now but people seem to think they know better than the kernel RAID folks who I've been quoting all that time.

See the following pages :-
https://raid.wiki.kernel.org/index.php/Partition_Types
https://raid.wiki.kernel.org/index.php/RAID_Boot
 
Old 10-06-2015, 01:29 AM   #7
gaitos
LQ Newbie
 
Registered: Oct 2014
Distribution: Slackware
Posts: 28

Original Poster
Rep: Reputation: Disabled
Yes, that seems a sensible warning. However, be advised that (at least for huge-3.10.17) the kernel will autodetect whole disks as well (no 0xfd partition). I assume that it wouldn't touch 0xda partitions. So IMHO the safe choice (no matter what kernel one is using) is to have a correct /etc/mdadm.conf and have arrays assembled by mdadm (via initrd if root fs is on RAID) and eventually pass "raid=noautodetect" to kernel or avoid autodetection via other means (e.g. 0xda partitions).
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
rebuilding a failed portion of a SlackBuilds build, instead of rebuilding everything Geremia Slackware 26 01-21-2015 05:56 PM
to continue a program after reboot dexter.gdv Programming 1 04-11-2011 11:02 PM
reboot system and continue from where it left csegau Programming 2 03-22-2010 08:51 PM
Continue DHCP lease after reboot pymehta Linux - Networking 4 11-21-2005 04:16 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware

All times are GMT -5. The time now is 01:22 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration