[SOLVED] Raid issues with the last kernel upgrade from 3.2.29 to 3.2.45 on Slackware 14.0

Slackovado · 07-16-2013, 02:30 PM

Quote:

Originally Posted by meetscott

Well, I figured I'd let everyone know I tried this... that is not specifying the root in the lilo.conf images section. I get the exact same results. So, I'm completely at a loss as to why I appear to be the only person who is seeing this behavior.

I give up on this one. I'm just going to be happy running the older kernel. It takes too much time to experiment around with this sort of thing and I have several other projects I need to attend to.

This issue has nothing to do with the Slackware 14 3.2.45 kernel.
I had run into this same problem back with Slackware 13.37.
I do remember spending several hours on resolving it but I don't remember the exact steps.
I vaguely remember that I would boot into huge kernel and then shut down (stop and remove) the wrongly "autodetected" raids and reassemble them again, then save the config into madadm.conf and build initrd.
I also remember that using the madadm switch that forces raid to not use partitions but whole drives didn't work and the kernel autodetection would still keep kicking in, so don't go on that path.
I think stopping the raid and reassembling it was the key to making it work.
Sorry can't be more specific as it's been over a year ago.
On another note I too have perfectly working raid 1 with the updated kernel 3.2.45
So don't give up on it too quickly and use madadm to fix it.
Do have a good backup of your data thought just in case you wipe out some partitions.

mRgOBLIN · 07-16-2013, 07:26 PM

The team have updated the ftp://ftp.slackware.com/pub/slackwar...EADME_RAID.TXT in current to try and cover some of the new raid autodetection stuff. Certainly helpful when starting from scratch and possibly something to be gleaned for those with existing arrays.

The thread given by Richard Cranium really does explain what's going on (and where we got a lot of our current understanding from) but due to the length of the thread and the complexity of the subject it can be hard to digest.

TracyTiger · 07-17-2013, 12:13 AM

Quote:

Originally Posted by mRgOBLIN

The team have updated the ftp://ftp.slackware.com/pub/slackwar...EADME_RAID.TXT in current to try and cover some of the new raid autodetection stuff. Certainly helpful when starting from scratch and possibly something to be gleaned for those with existing arrays.

I guess the term "new" is relative when one considers the age of the Slackware README document.

Or have new RAID auto detection features/functions/fixes been recently implemented?

I quickly read the updated Slackware document and noticed it includes ...

Quote:

Ensure that the partition type is Linux RAID Autodetect (type FD).

The following post by wildwizard from last December suggests that it may not be best to use fd partitions.

Quote:

This is a good time to remind people that the kernel autodetect routine is considered deprecated and may be removed without notice from the kernel and you should not be using that method for RAID assembly, mdadm is the way of the future.

Others have posted (not LQ) that some scenarios where array components are separated and used in other computers can cause serious problems when Linux auto detects the array. An easily preventable situation however.

Understanding that "fd" partitions were going out of style I stopped using them for RAID and now use type "da". I am able to get RAID to work with both partition types. I use mdadm.conf in both cases and do not specifically "turn off" auto detection with any boot parameters.

@mRgOBLIN Is it advisable to continue to use fd type partitions or is it better to use da type partitions? Or was auto detection the simplest way for the Slackware document to continue to describe the steps to getting RAID working? (Although the README doc does describe how to set up mdadm.conf.) Is wildwizard correct and auto detection is deprecated?

Richard Cranium · 07-17-2013, 12:36 AM

In my case...

Code:

# fdisk -l /dev/sda

Disk /dev/sda: 640.1 GB, 640135028736 bytes
255 heads, 63 sectors/track, 77825 cylinders, total 1250263728 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x09f624fd

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *          63      498014      248976   da  Non-FS data
/dev/sda2          498015  1250258624   624880305   fd  Linux raid autodetect
# fdisk -l /dev/sdc

Disk /dev/sdc: 640.1 GB, 640135028736 bytes
255 heads, 63 sectors/track, 77825 cylinders, total 1250263728 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x09f624fc

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1   *          63      498014      248976   83  Linux
/dev/sdc2          498015  1250258624   624880305   fd  Linux raid autodetect
#

...my sda/sdc partitions are able to auto-assemble with the 3.2.45 kernel. I cannot speak about LILO, since this machine uses GRUB2.

mRgOBLIN · 07-17-2013, 01:37 AM

@Tracy Tiger Yes wildwizard is correct in that fd and kernel auto-detect is not to be relied upon but I've seen nothing to indicate when it will be removed. The README_RAID may well have to be updated if those instructions no longer work with the kernel that current is released with. Certainly sounds like we may need to do some more testing.

It may well be the OP's problem that auto-detect has been deprecated (I haven't actually checked) but as a suggestion... try repacking your initrd.gz with your own mdadm.conf and see if that helps.

meetscott · 07-17-2013, 09:49 AM

Quote:

Originally Posted by mRgOBLIN

@Tracy Tiger Yes wildwizard is correct in that fd and kernel auto-detect is not to be relied upon but I've seen nothing to indicate when it will be removed. The README_RAID may well have to be updated if those instructions no longer work with the kernel that current is released with. Certainly sounds like we may need to do some more testing.

It may well be the OP's problem that auto-detect has been deprecated (I haven't actually checked) but as a suggestion... try repacking your initrd.gz with your own mdadm.conf and see if that helps.

I have regenerated my initrd.gz with different options (who knows? maybe I still missed something), many times :-( I can confirm that my partitions are type fd, they have been for years.

I have made a few notes on the README_RAID doc for myself. My procedure is slightly modified because I'm using LVM. The main difference is when I'm booting from the Slackware DVD for recovery. I load the volume groups, activate them and then bind /proc with /mnt/proc *before* I chroot. I'll also remind people, I'm using 4 disks, RAID 10. I can lay out my steps exactly if it will help on the documentation. It's not that far off. Although, for this portion of it, I'm fine. I can easily get my stuff up and running off the DVD. It's during the boot of the kernel, that we've been discussing here, where the problem lies.

That's gonna suck if I have to backup and rebuild from scratch to switch partition types. I'd do it if that was the right thing to do. But this is the first I've heard of it. I have 4 active systems right now that I use for work. So I loathe to brick anything and impact my clients or my ongoing development. I keep 2, identically configured RAID 10 systems, that sync each other daily. It's not impossible for me to intentionally bring one down for a while. But I'm actively developing several projects for 4 or 5 different organizations right now and my time is at a premium to deliver.

That being said, I'll help where ever I can as time allows.

mRgOBLIN · 07-17-2013, 05:31 PM

Quote:

Originally Posted by meetscott

I have regenerated my initrd.gz with different options (who knows? maybe I still missed something), many times :-(

What I'm suggesting is to copy your (correctly configured) mdadm.conf to /boot/initrd-tree/etc/ and then run mkinitrd again without the -c (or with CLEAR_TREE="0" if you use mkinitrd.conf). This should install your mdadm.conf into your initrd.gz and should (in theory) assemble your array with the correct name.

Quote:

Originally Posted by meetscott

That being said, I'll help where ever I can as time allows.

Much appreciated.

meetscott · 07-17-2013, 11:55 PM

Quote:

Originally Posted by mRgOBLIN

What I'm suggesting is to copy your (correctly configured) mdadm.conf to /boot/initrd-tree/etc/ and then run mkinitrd again without the -c (or with CLEAR_TREE="0" if you use mkinitrd.conf). This should install your mdadm.conf into your initrd.gz and should (in theory) assemble your array with the correct name.

Much appreciated.

I'll give it a shot. I remember having to hack the tree a few years ago in a similar way. It will likely be a week or two before I can get to it.

meetscott · 11-17-2013, 04:10 PM

I wanted to post an update to this tread. My upgrade to the 3.10.17 (because of Slackware 14.1) went flawlessly. No issues at all. I have not looked closely at the dmesg output. I'm grateful everything works the way I would expect. Much thanks to whoever works on this stuff and keeps it generally awesome... kernel guys, Slackware guys, etc.

meetscott · 11-26-2013, 12:20 PM

Another update. What's funny, is I have 2 servers, exactly the same configuration and hardware. Totally identical. The first one upgraded easily, the way I previously described. The second one is a brick. I can't get it to boot for anything. I've rebuilt the initrd.gz image, checked all the RAID settings (mdadm.conf, mdadm -D /dev/md1), partitions, dev directory, lilo settings. All are exactly the same. I don't understand it. The devices are getting identified on the bad machine as md126, md127, etc. instead of md0, and md1.

I'm down to checking to make sure the bios settings are the same. Because software wise, it's exactly the same. It would be great to get to the bottom of why this is happening.

By the way, I checked mdadm.conf on both machines and the UUIDs of the RAID arrays are exactly what they should be for each machine.

meetscott · 11-26-2013, 11:09 PM

The one difference I could find between the 2 identical systems was the BIOS version on the motherboard. So I went ahead and flashed an update on the system which wouldn't boot. It had a slightly older version than the one that would boot. No dice.

I was finally able to boot my server with manual intervention during boot time. When the kernel panics because it can't mount the root file system it asks if you want to fix it. Busybox is running in the initrd image and has enough tools available to do some poking around. It turns out the RAID array is not getting started. Don't have a clue why, but it's not.

I manually started the RAID array, mounted the root partition and exited to let the kernel continue to try to find init on the root partition.

Code:

mdadm -Es /etc/mdadm.conf
mdadm -S -s
mdadm -As

The first command is for good measure. I think what's in the initrd tree is fine for mdadm.conf. I don't think it is necessary to re-examine the array. The second command stops the array, again for good measure. It's already stopped. The third command assembles the array.

Since I'm also using LVM I have to fire those up next because the devices are not getting created in the initrd /dev directory.

Code:

vgscan --mknodes
vgchange -ay
mount /dev/vg2/root /mnt

So I manually create the device nodes. Then I mount root.

Once all this nonsense is done, I'm able to continue to boot normally. I'm thinking I might have to hack on the init scripts in the initrd tree to see if I can get the RAID array to fire up. I believe once it does, the logical volumes will ready themselves and init on the real root partition will be able to be read.

Richard Cranium · 11-27-2013, 11:56 PM

Did you use the /usr/share/mkinitrd/mkinitrd_command_generator.sh script to create your initrd?

meetscott · 11-29-2013, 05:44 PM

Quote:

Originally Posted by Richard Cranium

Did you use the /usr/share/mkinitrd/mkinitrd_command_generator.sh script to create your initrd?

Yes. That script is awesome and it worked fine for the other identical system.

Richard Cranium · 11-30-2013, 09:09 AM

Well, on the non-working system, try manually running the commands that the init script does...

Code:

  if [ -x /sbin/mdadm ]; then
    /sbin/mdadm -E -s >/etc/mdadm.conf
    /sbin/mdadm -S -s
    /sbin/mdadm -A -s
    # This seems to make the kernel see partitions more reliably:
    fdisk -l /dev/md* 1> /dev/null 2> /dev/null
  fi

Richard Cranium · 11-30-2013, 01:47 PM

I'll add that under Slackware 14.1, my raid devices are ignoring their old names and are initializing as /dev/md125, /dev/md126, and /dev/md127. LVM comes up OK, nonetheless.