Raid issues with the last kernel upgrade from 3.2.29 to 3.2.45 on Slackware 14.0
Been a long time since I've asked a question here.
The last kernel release for Slackware hosed my Raid installation. Basically, I had to roll back. 3.2.45 seems to be recognizing my disk as md127 and md126 when the kernel is loading. A little info on my system... 4 disks Raid 1 on the boot partition, /dev/md0 on a logical volume /dev/vg1/boot md version 0.90 Raid 10 on the rest /dev/md1, /dev/vg2/root, /dev/vg2/swap, /dev/vg2/home md version 1.2 I'm using the generic kernel. After the upgrade, mkinitrd, and lilo reinstall I get this message: Code:
mount: mounting /dev/vg2/root on /mnt failed: No such device Any thoughts or have others run into the same problems? I've search around quite a bit a tried quite a few things but it looks like this kernel upgrade is a "no go" for software raid devices being recognized and used in the same way. |
Hmm I had a similar issue with one of my RAID partitions not showing up in -current and I had assumed it was related to a change in the mkinitrd changes that went in.
I did however resolve the problem by ensuring that all RAID partitions are listed in /etc/mdadm.conf before creating the initrd. That may or may not help as I don't know if the 3.2 series has been getting the same RAID code updates as the 3.9 series. |
Just a point of information ....
I'm successfully running Slack64 14.0 with the 3.2.45 kernel with a fully encrypted (except /boot) RAID1/RAID10 setup very similar to yours. However I'm not using LVM. I use UUIDs in /etc/fstab, and like wildwizard, /etc/mdadm.conf defines the arrays, again with (different) UUIDs. I've had RAID component identification problems in the past when I didn't use UUID so now I always build RAID systems using UUID for configuration information. It boots up as expected without difficulty. The challenging part was getting the UUIDs correct. Every query-type command seems to produce different UUIDs. Through trial and error I figured out which ones to use. You may want to look carefully at mkinitrd, lilo, fstab, & mdadm.conf before giving up on the 3.2.45 kernel. EDIT: ... and get rid of "root=" in the lilo configuration image section. |
Quote:
Code:
ARRAY /dev/md0 UUID=994ea4ee:2e64f4d5:208cdb8d:9e23b04b |
Quote:
I'm a little confused on the last thing. "Get rid of 'root=' in the lilo configuration image section"??? How on earth will it know which partition to use for root? I have 4 partitions that could be used... I'm assuming it doesn't know that swap is swap. Here's my lilo configuration without the the commented out parts. Keep in mind that this is my configuration with 3.2.29. The configuration is the same for 3.2.45 with those particular values changed. Code:
append=" vt.default_utf8=0" |
Quote:
Quote:
Quote:
Troubleshooting based on my ignorance follows ... You may want to force a failure by changing a UUID in mdadm.conf just to see that the information there is actually being utilized and that the UUIDs there are correct when the new kernel is running. |
Quote:
Code:
ARRAY /dev/md1 metadata=0.90 UUID=xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx |
Quote:
Regarding your previous post... I've never tried *not* specifying the root device in my lilo.conf. I've been running this way for years and never had a problem. It is also specified in the Slackware documentation Alien Bob wrote. That doesn't make it right and perhaps it is worth trying. I don't know why this is suddenly becoming an issue. I don't reinstall from scratch unless I must for some new system. I always go through the upgrade process. These LVM Raid 10 configurations have been flawless through these upgrades. I even upgrade one machine remotely as it is colocated. This has also gone well for the last 7 years and I don't know how many upgrades :-) Incidentally, I have a laptop, which is *not* raid but uses LVM and encryption. The upgrade was okay there. Given the variety of configurations of systems (5 at the moment) I have running Slackware, I'm left with the impression that this is only an issue with Raid and the new 3.2.45 kernel. |
Quote:
Quote:
Quote:
Perhaps other LQ members have better insight into your issue than I, and would like to respond. |
Everything running fine here.
Code:
[root@nestor:~] # uname -r |
Quote:
|
I had no issues upgrading from 3.2.29 to 3.2.45.
My boot partition is on /dev/md0. I do use grub2 instead of lilo and all of my raid arrays auto-assemble instead of being explicitly defined in /etc/mdadm.conf. Code:
root@darkstar:~# cat /proc/mdstat |
Well, I figured I'd let everyone know I tried this... that is not specifying the root in the lilo.conf images section. I get the exact same results. So, I'm completely at a loss as to why I appear to be the only person who is seeing this behavior.
I give up on this one. I'm just going to be happy running the older kernel. It takes too much time to experiment around with this sort of thing and I have several other projects I need to attend to. |
Quote:
Code:
[ 4.664805] udevd[1056]: starting version 182 |
Richard Cranium, thanks for taking the time to put that output together so nicely. I saw the same things, only mine doesn't figure it out later. I guess they've put this auto-detection into kernel now. I would imagine I'm going to have to address it some day. I have a few other projects I'm working on at the moment, so I don't really have time to burn on figuring this out for now.
Just be grateful it is working :-) |
Quote:
I had run into this same problem back with Slackware 13.37. I do remember spending several hours on resolving it but I don't remember the exact steps. I vaguely remember that I would boot into huge kernel and then shut down (stop and remove) the wrongly "autodetected" raids and reassemble them again, then save the config into madadm.conf and build initrd. I also remember that using the madadm switch that forces raid to not use partitions but whole drives didn't work and the kernel autodetection would still keep kicking in, so don't go on that path. I think stopping the raid and reassembling it was the key to making it work. Sorry can't be more specific as it's been over a year ago. On another note I too have perfectly working raid 1 with the updated kernel 3.2.45 So don't give up on it too quickly and use madadm to fix it. Do have a good backup of your data thought just in case you wipe out some partitions. |
The team have updated the ftp://ftp.slackware.com/pub/slackwar...EADME_RAID.TXT in current to try and cover some of the new raid autodetection stuff. Certainly helpful when starting from scratch and possibly something to be gleaned for those with existing arrays.
The thread given by Richard Cranium really does explain what's going on (and where we got a lot of our current understanding from) but due to the length of the thread and the complexity of the subject it can be hard to digest. |
Quote:
I quickly read the updated Slackware document and noticed it includes ... Quote:
Quote:
Understanding that "fd" partitions were going out of style I stopped using them for RAID and now use type "da". I am able to get RAID to work with both partition types. I use mdadm.conf in both cases and do not specifically "turn off" auto detection with any boot parameters. @mRgOBLIN Is it advisable to continue to use fd type partitions or is it better to use da type partitions? Or was auto detection the simplest way for the Slackware document to continue to describe the steps to getting RAID working? (Although the README doc does describe how to set up mdadm.conf.) Is wildwizard correct and auto detection is deprecated? |
In my case...
Code:
# fdisk -l /dev/sda |
@Tracy Tiger Yes wildwizard is correct in that fd and kernel auto-detect is not to be relied upon but I've seen nothing to indicate when it will be removed. The README_RAID may well have to be updated if those instructions no longer work with the kernel that current is released with. Certainly sounds like we may need to do some more testing.
It may well be the OP's problem that auto-detect has been deprecated (I haven't actually checked) but as a suggestion... try repacking your initrd.gz with your own mdadm.conf and see if that helps. |
Quote:
I have made a few notes on the README_RAID doc for myself. My procedure is slightly modified because I'm using LVM. The main difference is when I'm booting from the Slackware DVD for recovery. I load the volume groups, activate them and then bind /proc with /mnt/proc *before* I chroot. I'll also remind people, I'm using 4 disks, RAID 10. I can lay out my steps exactly if it will help on the documentation. It's not that far off. Although, for this portion of it, I'm fine. I can easily get my stuff up and running off the DVD. It's during the boot of the kernel, that we've been discussing here, where the problem lies. That's gonna suck if I have to backup and rebuild from scratch to switch partition types. I'd do it if that was the right thing to do. But this is the first I've heard of it. I have 4 active systems right now that I use for work. So I loathe to brick anything and impact my clients or my ongoing development. I keep 2, identically configured RAID 10 systems, that sync each other daily. It's not impossible for me to intentionally bring one down for a while. But I'm actively developing several projects for 4 or 5 different organizations right now and my time is at a premium to deliver. That being said, I'll help where ever I can as time allows. |
Quote:
Quote:
|
Quote:
|
I wanted to post an update to this tread. My upgrade to the 3.10.17 (because of Slackware 14.1) went flawlessly. No issues at all. I have not looked closely at the dmesg output. I'm grateful everything works the way I would expect. Much thanks to whoever works on this stuff and keeps it generally awesome... kernel guys, Slackware guys, etc.
|
Another update. What's funny, is I have 2 servers, exactly the same configuration and hardware. Totally identical. The first one upgraded easily, the way I previously described. The second one is a brick. I can't get it to boot for anything. I've rebuilt the initrd.gz image, checked all the RAID settings (mdadm.conf, mdadm -D /dev/md1), partitions, dev directory, lilo settings. All are exactly the same. I don't understand it. The devices are getting identified on the bad machine as md126, md127, etc. instead of md0, and md1.
I'm down to checking to make sure the bios settings are the same. Because software wise, it's exactly the same. It would be great to get to the bottom of why this is happening. By the way, I checked mdadm.conf on both machines and the UUIDs of the RAID arrays are exactly what they should be for each machine. |
The one difference I could find between the 2 identical systems was the BIOS version on the motherboard. So I went ahead and flashed an update on the system which wouldn't boot. It had a slightly older version than the one that would boot. No dice.
I was finally able to boot my server with manual intervention during boot time. When the kernel panics because it can't mount the root file system it asks if you want to fix it. Busybox is running in the initrd image and has enough tools available to do some poking around. It turns out the RAID array is not getting started. Don't have a clue why, but it's not. I manually started the RAID array, mounted the root partition and exited to let the kernel continue to try to find init on the root partition. Code:
mdadm -Es /etc/mdadm.conf Since I'm also using LVM I have to fire those up next because the devices are not getting created in the initrd /dev directory. Code:
vgscan --mknodes Once all this nonsense is done, I'm able to continue to boot normally. I'm thinking I might have to hack on the init scripts in the initrd tree to see if I can get the RAID array to fire up. I believe once it does, the logical volumes will ready themselves and init on the real root partition will be able to be read. |
Did you use the /usr/share/mkinitrd/mkinitrd_command_generator.sh script to create your initrd?
|
Quote:
|
Well, on the non-working system, try manually running the commands that the init script does...
Code:
if [ -x /sbin/mdadm ]; then |
I'll add that under Slackware 14.1, my raid devices are ignoring their old names and are initializing as /dev/md125, /dev/md126, and /dev/md127. LVM comes up OK, nonetheless.
|
Quote:
I will continue to trouble shoot it, but I'm completely baffled as to why it won't come up on its own. As a drastic measure, I may try rebuilding everything from scratch. I hate to do this because it's so time consuming, but that may be the only way to get it out of this weird state it seems to be stuck in. |
Quote:
|
Quote:
In Slackware 14.1, the same arrays initialize as the high numbers but never change their names to the names I had used to create them. |
Try (as root)...
Code:
rm /boot/initrd-tree/etc/mdadm.conf Why? Well, on my machine, the initial tree created by /usr/share/mkinitrd/mkinitrd_command_generator.sh copies over the default /etc/mdadm.conf file into the initrd-tree. The default mdadm.conf contains only comments, but the init code in initrd contains... Code:
if [ -x /sbin/mdadm ]; then |
Quote:
I'm almost to the point of modifying init so it does what I want it to do. |
When you say "commenting out the /etc/mdadm.conf in the initrd-tree", did you mean "remove the file etc/mdadm.conf" from the initrd-tree?
Hmm. When you get this broken system running, what is the output of Code:
pvs -v |
Quote:
Output of pvs -v: Code:
Scanning for physical volume names |
I'm too much of a n00b to be in this conversation (and don't use initrd or LVM), but I might throw this in here. One of my lowly setups is a 2-disc setup that has a plain JFS-formatted /boot partition. Therefore, the kernels are loaded from the plain partition, the kernel auto-detects my RAID-0 /dev/md0, and everything else is read off of /dev/md0. I have something in my kernel cmdline like "md=0,/dev/sda1,/dev/sdb1 root=/dev/md0" or something to that effect. I know no better, so I went completely off of a document like Documentation/md.txt, but from my particular kernel source.
What it seems like is that for the non-LVM md partitions I use, the partition type should be fd00 if you want them autodetected. Despite the mention that autodetection is for DOS/MBR-style partitions only, they work with GPT partitions as well. If you don't want them autodetected, don't mark them as fd00, and let mdadm take care of it. As for the numbers, that took some jiggling. For v0 metadata, you can somehow assemble the raid as "0" and pass the flag --update=super-minor to mdadm so that the preferred-minor defaults to 0. This trick does not work with v1 arrays. To see the current preferred minor, use `mdadm --detail assembled_raid`. I've forgotten what I did to get the v1 arrays to budge the minor. I either assembled or re-built them as "18" and "19", respectively, instead of their normal names, then have it set up like this: ARRAY pretty_name_for_dev_md UUID=1337:f00:ba4:babab0033 Again, I'm learning this for only the second time (first time wasn't much fun) on Linux, and I'm doing this for new installs that I could reinstall or restore from backup. I also didn't get the feeling that the --auto=md{x} flag worked all the time. Zero confidence, but I got my particular setup up and running. YMMV. |
Quote:
Please try with either removing or renaming the mdadm.conf file in the initrd. That should force the init script to run the commands... Code:
/sbin/mdadm -E -s >/etc/mdadm.conf |
Quote:
Thanks! scott |
Quote:
scott |
This is a miracle. I renamed mdadm.conf to mdadm.conf.bak in the initrd-tree. Then I re-ran mkinitrd with no parameters. Re-ran lilo. Then I rebooted.
Walla! Everything came up. This is the first time I haven't had to manually intervene during boot time to get this machine to come up since my upgrade to Slackware 14.1 from Slackware 14.0 using the 3.2.29 kernel. Richard Cranium, you are the man! Thanks! |
Thank you all for this valuable information it helped me fix my problem as well. I just removed the mdadm.conf and rebuilt the initrd.gz and bam I was up and running on my new raid setup and it is smoking fast!
|
I'll just mention this bit:
The safest way to ensure that your software RAID arrays are setup correctly would be to run the command Code:
/sbin/mdadm -E -s >/etc/mdadm.conf So, the two approaches have pros and cons:
|
All times are GMT -5. The time now is 06:38 AM. |