LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Slackware (https://www.linuxquestions.org/questions/slackware-14/)
-   -   Raid issues with the last kernel upgrade from 3.2.29 to 3.2.45 on Slackware 14.0 (https://www.linuxquestions.org/questions/slackware-14/raid-issues-with-the-last-kernel-upgrade-from-3-2-29-to-3-2-45-on-slackware-14-0-a-4175468562/)

meetscott 07-05-2013 02:42 AM

Raid issues with the last kernel upgrade from 3.2.29 to 3.2.45 on Slackware 14.0
 
Been a long time since I've asked a question here.

The last kernel release for Slackware hosed my Raid installation. Basically, I had to roll back. 3.2.45 seems to be recognizing my disk as md127 and md126 when the kernel is loading.

A little info on my system...
4 disks

Raid 1 on the boot partition, /dev/md0 on a logical volume /dev/vg1/boot
md version 0.90

Raid 10 on the rest /dev/md1, /dev/vg2/root, /dev/vg2/swap, /dev/vg2/home
md version 1.2

I'm using the generic kernel. After the upgrade, mkinitrd, and lilo reinstall I get this message:

Code:

mount: mounting /dev/vg2/root on /mnt failed: No such device
ERROR: No /sbin/init found on rootdev (or not mounted). Trouble ahead. 
You an try to fix it. Type ‘exit’ when things are done.

At this point nothing brings it alive. I've tried booting both the huge and generic kernels. I have to boot from the Slackware install DVD, remove all kernel patch packages for 3.2.45 and install the 3.2.29 packages again. Rerun mkinitrd and reinstall lilo and I have a working system again.

Any thoughts or have others run into the same problems? I've search around quite a bit a tried quite a few things but it looks like this kernel upgrade is a "no go" for software raid devices being recognized and used in the same way.

wildwizard 07-05-2013 03:14 AM

Hmm I had a similar issue with one of my RAID partitions not showing up in -current and I had assumed it was related to a change in the mkinitrd changes that went in.

I did however resolve the problem by ensuring that all RAID partitions are listed in /etc/mdadm.conf before creating the initrd.

That may or may not help as I don't know if the 3.2 series has been getting the same RAID code updates as the 3.9 series.

TracyTiger 07-05-2013 04:58 AM

Just a point of information ....

I'm successfully running Slack64 14.0 with the 3.2.45 kernel with a fully encrypted (except /boot) RAID1/RAID10 setup very similar to yours. However I'm not using LVM.

I use UUIDs in /etc/fstab, and like wildwizard, /etc/mdadm.conf defines the arrays, again with (different) UUIDs. I've had RAID component identification problems in the past when I didn't use UUID so now I always build RAID systems using UUID for configuration information.

It boots up as expected without difficulty. The challenging part was getting the UUIDs correct. Every query-type command seems to produce different UUIDs. Through trial and error I figured out which ones to use.

You may want to look carefully at mkinitrd, lilo, fstab, & mdadm.conf before giving up on the 3.2.45 kernel.

EDIT: ... and get rid of "root=" in the lilo configuration image section.

meetscott 07-05-2013 11:21 AM

Quote:

Originally Posted by wildwizard (Post 4984558)
Hmm I had a similar issue with one of my RAID partitions not showing up in -current and I had assumed it was related to a change in the mkinitrd changes that went in.

I did however resolve the problem by ensuring that all RAID partitions are listed in /etc/mdadm.conf before creating the initrd.

That may or may not help as I don't know if the 3.2 series has been getting the same RAID code updates as the 3.9 series.

Yes, I have those listed in my mdadm.conf. I did check that. I forgot to say so.

Code:

ARRAY /dev/md0 UUID=994ea4ee:2e64f4d5:208cdb8d:9e23b04b
ARRAY /dev/md/1 UUID=d79b38ac:2b0c654d:a16d0a19:babaf044

I've tried a few settings in there but have gotten no where. The /device is showing /dev/md/1. I've tried that and /dev/md1, which is what it originally was.

meetscott 07-05-2013 12:03 PM

Quote:

Originally Posted by Tracy Tiger (Post 4984596)
Just a point of information ....

I'm successfully running Slack64 14.0 with the 3.2.45 kernel with a fully encrypted (except /boot) RAID1/RAID10 setup very similar to yours. However I'm not using LVM.

I use UUIDs in /etc/fstab, and like wildwizard, /etc/mdadm.conf defines the arrays, again with (different) UUIDs. I've had RAID component identification problems in the past when I didn't use UUID so now I always build RAID systems using UUID for configuration information.

It boots up as expected without difficulty. The challenging part was getting the UUIDs correct. Every query-type command seems to produce different UUIDs. Through trial and error I figured out which ones to use.

You may want to look carefully at mkinitrd, lilo, fstab, & mdadm.conf before giving up on the 3.2.45 kernel.

EDIT: ... and get rid of "root=" in the lilo configuration image section.

Interesting. You are not using LVM and you are encrypting. I encrypt my laptop drive and I use LVM on that. The upgrade went okay on that one. Weird of the UUIDs need to be tweaked around now. I see them use them and didn't think anything of it. They match, what more could the system be looking for?

I'm a little confused on the last thing. "Get rid of 'root=' in the lilo configuration image section"??? How on earth will it know which partition to use for root? I have 4 partitions that could be used... I'm assuming it doesn't know that swap is swap.

Here's my lilo configuration without the the commented out parts. Keep in mind that this is my configuration with 3.2.29. The configuration is the same for 3.2.45 with those particular values changed.
Code:

append=" vt.default_utf8=0"
boot = /dev/md0
raid-extra-boot = mbr-only

bitmap = /boot/slack.bmp
bmp-colors = 255,0,255,0,255,0
bmp-table = 60,6,1,16
bmp-timer = 65,27,0,255                                                                                             
prompt                                                                                                                                                                                                                                                                       
timeout = 100
change-rules
reset

vga = 773

image = /boot/vmlinuz-generic-3.2.29
  initrd = /boot/initrd.gz
  root = /dev/vg2/root
  label = 3.2.29
  read-only


TracyTiger 07-05-2013 01:31 PM

Quote:

Originally Posted by meetscott (Post 4984797)
Interesting. You are not using LVM and you are encrypting.

I've used RAID/Encryption both with and without LVM. Both worked. I don't have any current systems running LVM for me to check at the moment.

Quote:

I'm a little confused on the last thing. "Get rid of 'root=' in the lilo configuration image section"??? How on earth will it know which partition to use for root? I have 4 partitions that could be used... I'm assuming it doesn't know that swap is swap.
I believe "root=" isn't needed with initrd because initrd already has the information about which partition to use for root. See the thread here https://www.linuxquestions.org/quest...6/#post4801795 for information on how using "root=" in the lilo image section causes problems.

Quote:

Here's my lilo configuration without the the commented out parts. Keep in mind that this is my configuration with 3.2.29. The configuration is the same for 3.2.45 with those particular values changed.
My particular problems in the linked post occurred with I upgraded a running system. I don't know why an upgrade causes issues.

Troubleshooting based on my ignorance follows ...
You may want to force a failure by changing a UUID in mdadm.conf just to see that the information there is actually being utilized and that the UUIDs there are correct when the new kernel is running.

TracyTiger 07-05-2013 03:21 PM

Quote:

Originally Posted by meetscott (Post 4984772)
Yes, I have those listed in my mdadm.conf. I did check that. I forgot to say so.

Code:

ARRAY /dev/md0 UUID=994ea4ee:2e64f4d5:208cdb8d:9e23b04b
ARRAY /dev/md/1 UUID=d79b38ac:2b0c654d:a16d0a19:babaf044

I've tried a few settings in there but have gotten no where. The /device is showing /dev/md/1. I've tried that and /dev/md1, which is what it originally was.

Note that my mdadm.conf file looks more like this:

Code:

ARRAY /dev/md1 metadata=0.90 UUID=xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx
ARRAY /dev/md2 metadata=1.2 UUID=xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx
ARRAY /dev/md3 metadata=1.2 UUID=xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx
ARRAY /dev/md5 metadata=1.2 UUID=xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx
ARRAY /dev/md6 metadata=1.2 UUID=xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx

Maybe the missing metadata is important to the new kernel? Perhaps it defaults to version 1.2 so version 0.90 needs to be made explicit?

meetscott 07-07-2013 02:41 AM

Quote:

Originally Posted by Tracy Tiger (Post 4984889)
Note that my mdadm.conf file looks more like this:

Code:

ARRAY /dev/md1 metadata=0.90 UUID=xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx
ARRAY /dev/md2 metadata=1.2 UUID=xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx
ARRAY /dev/md3 metadata=1.2 UUID=xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx
ARRAY /dev/md5 metadata=1.2 UUID=xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx
ARRAY /dev/md6 metadata=1.2 UUID=xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx

Maybe the missing metadata is important to the new kernel? Perhaps it defaults to version 1.2 so version 0.90 needs to be made explicit?

Thanks for the reply, I've tried both, with and without the metadata.

Regarding your previous post...
I've never tried *not* specifying the root device in my lilo.conf. I've been running this way for years and never had a problem. It is also specified in the Slackware documentation Alien Bob wrote. That doesn't make it right and perhaps it is worth trying.

I don't know why this is suddenly becoming an issue. I don't reinstall from scratch unless I must for some new system. I always go through the upgrade process. These LVM Raid 10 configurations have been flawless through these upgrades. I even upgrade one machine remotely as it is colocated. This has also gone well for the last 7 years and I don't know how many upgrades :-)

Incidentally, I have a laptop, which is *not* raid but uses LVM and encryption. The upgrade was okay there. Given the variety of configurations of systems (5 at the moment) I have running Slackware, I'm left with the impression that this is only an issue with Raid and the new 3.2.45 kernel.

TracyTiger 07-07-2013 01:34 PM

Quote:

Originally Posted by meetscott (Post 4984546)
3.2.45 seems to be recognizing my disk as md127 and md126 when the kernel is loading.

Whenever I don't use default values and I see default values appearing on the screen and in logs, I usually suspect that my configuration setup isn't working (/etc/xxxx.conf) or isn't being referenced as I intended.

Quote:

I've never tried *not* specifying the root device in my lilo.conf. I've been running this way for years and never had a problem. It is also specified in the Slackware documentation Alien Bob wrote. That doesn't make it right and perhaps it is worth trying.
As you probably read in the link to the previous LQ thread, Alien Bob was who suggested I drop specifying root in the lilo.conf image section.

Quote:

I don't know why this is suddenly becoming an issue.
RAID using initrd and specifying root in lilo worked well for me for a long time also....until it didn't. :)

Perhaps other LQ members have better insight into your issue than I, and would like to respond.

kikinovak 07-07-2013 01:59 PM

Everything running fine here.

Code:

[root@nestor:~] # uname -r
3.2.45
[root@nestor:~] # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md3 : active raid5 sda3[0] sdd3[3] sdc3[2] sdb3[1]
      729317376 blocks level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
     
md2 : active raid1 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      995904 blocks [4/4] [UUUU]
     
md1 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
      96256 blocks [4/4] [UUUU]


meetscott 07-07-2013 02:51 PM

Quote:

Originally Posted by Tracy Tiger (Post 4985774)
Whenever I don't use default values and I see default values appearing on the screen and in logs, I usually suspect that my configuration setup isn't working (/etc/xxxx.conf) or isn't being referenced as I intended.



As you probably read in the link to the previous LQ thread, Alien Bob was who suggested I drop specifying root in the lilo.conf image section.



RAID using initrd and specifying root in lilo worked well for me for a long time also....until it didn't. :)

Perhaps other LQ members have better insight into your issue than I, and would like to respond.

I didn't read the link before, but I have now. I'll have to give it a try. It seems that might be the key.

Richard Cranium 07-07-2013 06:33 PM

I had no issues upgrading from 3.2.29 to 3.2.45.

My boot partition is on /dev/md0. I do use grub2 instead of lilo and all of my raid arrays auto-assemble instead of being explicitly defined in /etc/mdadm.conf.

Code:

root@darkstar:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md1 : active raid1 sde3[0] sdf3[1]
      142716800 blocks super 1.2 [2/2] [UU]
     
md0 : active raid1 sde2[0] sdf2[1]
      523968 blocks super 1.2 [2/2] [UU]
     
md3 : active raid1 sdc2[0] sda2[1]
      624880192 blocks [2/2] [UU]
     
unused devices: <none>
# pvs
  PV        VG      Fmt  Attr PSize  PFree 
  /dev/md1  mdgroup lvm2 a--  136.09g 136.09g
  /dev/md3  mdgroup lvm2 a--  595.91g  86.62g
  /dev/sdd  testvg  lvm2 a--  111.79g  11.79g
root@darkstar:~#


meetscott 07-14-2013 07:17 PM

Well, I figured I'd let everyone know I tried this... that is not specifying the root in the lilo.conf images section. I get the exact same results. So, I'm completely at a loss as to why I appear to be the only person who is seeing this behavior.

I give up on this one. I'm just going to be happy running the older kernel. It takes too much time to experiment around with this sort of thing and I have several other projects I need to attend to.

Richard Cranium 07-15-2013 10:19 PM

Quote:

Originally Posted by meetscott (Post 4984546)
Been a long time since I've asked a question here.

The last kernel release for Slackware hosed my Raid installation. Basically, I had to roll back. 3.2.45 seems to be recognizing my disk as md127 and md126 when the kernel is loading.

I bothered to look at my dmesg output; it appears that my system also starts using md125, md126 and md127 but figures out later that isn't correct (some messages removed for clarity)...

Code:

[    4.664805] udevd[1056]: starting version 182
[    4.876982] md: bind<sda2>
[    4.882388] md: bind<sdc2>
[    4.883399] bio: create slab <bio-1> at 1
[    4.883557] md/raid1:md127: active with 2 out of 2 mirrors
[    4.883674] md127: detected capacity change from 0 to 639877316608
[    4.890199]  md127: unknown partition table

[  23.896644]  sde: sde1 sde2 sde3
[  23.902268] sd 8:0:2:0: [sde] Attached SCSI disk
[  24.036540] md: bind<sde2>
[  24.039801] md: bind<sde3>
[  24.126990]  sdf: sdf1 sdf2 sdf3
[  24.132618] sd 8:0:3:0: [sdf] Attached SCSI disk

[  24.264754] md: bind<sdf3>

[  24.266127] md/raid1:md125: active with 2 out of 2 mirrors
[  24.266242] md125: detected capacity change from 0 to 146142003200

[  24.274335] md: bind<sdf2>
[  24.275479] md/raid1:md126: active with 2 out of 2 mirrors
[  24.275593] md126: detected capacity change from 0 to 536543232
[  24.288237]  md126: unknown partition table
[  24.293884]  md125: unknown partition table

[  25.575361] md125: detected capacity change from 146142003200 to 0
[  25.575466] md: md125 stopped.
[  25.575566] md: unbind<sdf3>
[  25.589043] md: export_rdev(sdf3)
[  25.589161] md: unbind<sde3>
[  25.605083] md: export_rdev(sde3)
[  25.605667] md126: detected capacity change from 536543232 to 0
[  25.605771] md: md126 stopped.
[  25.605871] md: unbind<sdf2>
[  25.610029] md: export_rdev(sdf2)
[  25.610136] md: unbind<sde2>
[  25.615016] md: export_rdev(sde2)
[  25.615537] md127: detected capacity change from 639877316608 to 0
[  25.615641] md: md127 stopped.
[  25.615741] md: unbind<sdc2>
[  25.620051] md: export_rdev(sdc2)
[  25.620156] md: unbind<sda2>
[  25.624083] md: export_rdev(sda2)
[  25.772347] md: md3 stopped.
[  25.772979] md: bind<sda2>
[  25.773190] md: bind<sdc2>
[  25.774071] md/raid1:md3: active with 2 out of 2 mirrors
[  25.774188] md3: detected capacity change from 0 to 639877316608
[  25.781286]  md3: unknown partition table
[  25.794571] md: md0 stopped.
[  25.795353] md: bind<sdf2>
[  25.795611] md: bind<sde2>
[  25.796365] md/raid1:md0: active with 2 out of 2 mirrors
[  25.796482] md0: detected capacity change from 0 to 536543232
[  25.808178]  md0: unknown partition table
[  26.014044] md: md1 stopped.
[  26.020403] md: bind<sdf3>
[  26.020649] md: bind<sde3>
[  26.021428] md/raid1:md1: active with 2 out of 2 mirrors
[  26.021544] md1: detected capacity change from 0 to 146142003200
[  26.071258]  md1: unknown partition table

I doubt any of that helps, but if you ever get around to looking at this again, you might want to wade through https://bugzilla.redhat.com/show_bug.cgi?id=606481 which contained more than I ever wanted to know about the subject. (Hell, now I'm not sure why my setup works! :confused: )

meetscott 07-16-2013 10:34 AM

Richard Cranium, thanks for taking the time to put that output together so nicely. I saw the same things, only mine doesn't figure it out later. I guess they've put this auto-detection into kernel now. I would imagine I'm going to have to address it some day. I have a few other projects I'm working on at the moment, so I don't really have time to burn on figuring this out for now.

Just be grateful it is working :-)

Slackovado 07-16-2013 02:30 PM

Quote:

Originally Posted by meetscott (Post 4990320)
Well, I figured I'd let everyone know I tried this... that is not specifying the root in the lilo.conf images section. I get the exact same results. So, I'm completely at a loss as to why I appear to be the only person who is seeing this behavior.

I give up on this one. I'm just going to be happy running the older kernel. It takes too much time to experiment around with this sort of thing and I have several other projects I need to attend to.

This issue has nothing to do with the Slackware 14 3.2.45 kernel.
I had run into this same problem back with Slackware 13.37.
I do remember spending several hours on resolving it but I don't remember the exact steps.
I vaguely remember that I would boot into huge kernel and then shut down (stop and remove) the wrongly "autodetected" raids and reassemble them again, then save the config into madadm.conf and build initrd.
I also remember that using the madadm switch that forces raid to not use partitions but whole drives didn't work and the kernel autodetection would still keep kicking in, so don't go on that path.
I think stopping the raid and reassembling it was the key to making it work.
Sorry can't be more specific as it's been over a year ago.
On another note I too have perfectly working raid 1 with the updated kernel 3.2.45
So don't give up on it too quickly and use madadm to fix it.
Do have a good backup of your data thought just in case you wipe out some partitions.

mRgOBLIN 07-16-2013 07:26 PM

The team have updated the ftp://ftp.slackware.com/pub/slackwar...EADME_RAID.TXT in current to try and cover some of the new raid autodetection stuff. Certainly helpful when starting from scratch and possibly something to be gleaned for those with existing arrays.

The thread given by Richard Cranium really does explain what's going on (and where we got a lot of our current understanding from) but due to the length of the thread and the complexity of the subject it can be hard to digest.

TracyTiger 07-17-2013 12:13 AM

Quote:

Originally Posted by mRgOBLIN (Post 4991598)
The team have updated the ftp://ftp.slackware.com/pub/slackwar...EADME_RAID.TXT in current to try and cover some of the new raid autodetection stuff. Certainly helpful when starting from scratch and possibly something to be gleaned for those with existing arrays.

I guess the term "new" is relative when one considers the age of the Slackware README document. :) Or have new RAID auto detection features/functions/fixes been recently implemented?

I quickly read the updated Slackware document and noticed it includes ...
Quote:

Ensure that the partition type is Linux RAID Autodetect (type FD).
The following post by wildwizard from last December suggests that it may not be best to use fd partitions.

Quote:

This is a good time to remind people that the kernel autodetect routine is considered deprecated and may be removed without notice from the kernel and you should not be using that method for RAID assembly, mdadm is the way of the future.
Others have posted (not LQ) that some scenarios where array components are separated and used in other computers can cause serious problems when Linux auto detects the array. An easily preventable situation however.

Understanding that "fd" partitions were going out of style I stopped using them for RAID and now use type "da". I am able to get RAID to work with both partition types. I use mdadm.conf in both cases and do not specifically "turn off" auto detection with any boot parameters.

@mRgOBLIN Is it advisable to continue to use fd type partitions or is it better to use da type partitions? Or was auto detection the simplest way for the Slackware document to continue to describe the steps to getting RAID working? (Although the README doc does describe how to set up mdadm.conf.) Is wildwizard correct and auto detection is deprecated?

Richard Cranium 07-17-2013 12:36 AM

In my case...
Code:

# fdisk -l /dev/sda

Disk /dev/sda: 640.1 GB, 640135028736 bytes
255 heads, 63 sectors/track, 77825 cylinders, total 1250263728 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x09f624fd

  Device Boot      Start        End      Blocks  Id  System
/dev/sda1  *          63      498014      248976  da  Non-FS data
/dev/sda2          498015  1250258624  624880305  fd  Linux raid autodetect
# fdisk -l /dev/sdc

Disk /dev/sdc: 640.1 GB, 640135028736 bytes
255 heads, 63 sectors/track, 77825 cylinders, total 1250263728 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x09f624fc

  Device Boot      Start        End      Blocks  Id  System
/dev/sdc1  *          63      498014      248976  83  Linux
/dev/sdc2          498015  1250258624  624880305  fd  Linux raid autodetect
#

...my sda/sdc partitions are able to auto-assemble with the 3.2.45 kernel. I cannot speak about LILO, since this machine uses GRUB2.

mRgOBLIN 07-17-2013 01:37 AM

@Tracy Tiger Yes wildwizard is correct in that fd and kernel auto-detect is not to be relied upon but I've seen nothing to indicate when it will be removed. The README_RAID may well have to be updated if those instructions no longer work with the kernel that current is released with. Certainly sounds like we may need to do some more testing.

It may well be the OP's problem that auto-detect has been deprecated (I haven't actually checked) but as a suggestion... try repacking your initrd.gz with your own mdadm.conf and see if that helps.

meetscott 07-17-2013 09:49 AM

Quote:

Originally Posted by mRgOBLIN (Post 4991737)
@Tracy Tiger Yes wildwizard is correct in that fd and kernel auto-detect is not to be relied upon but I've seen nothing to indicate when it will be removed. The README_RAID may well have to be updated if those instructions no longer work with the kernel that current is released with. Certainly sounds like we may need to do some more testing.

It may well be the OP's problem that auto-detect has been deprecated (I haven't actually checked) but as a suggestion... try repacking your initrd.gz with your own mdadm.conf and see if that helps.

I have regenerated my initrd.gz with different options (who knows? maybe I still missed something), many times :-( I can confirm that my partitions are type fd, they have been for years.

I have made a few notes on the README_RAID doc for myself. My procedure is slightly modified because I'm using LVM. The main difference is when I'm booting from the Slackware DVD for recovery. I load the volume groups, activate them and then bind /proc with /mnt/proc *before* I chroot. I'll also remind people, I'm using 4 disks, RAID 10. I can lay out my steps exactly if it will help on the documentation. It's not that far off. Although, for this portion of it, I'm fine. I can easily get my stuff up and running off the DVD. It's during the boot of the kernel, that we've been discussing here, where the problem lies.

That's gonna suck if I have to backup and rebuild from scratch to switch partition types. I'd do it if that was the right thing to do. But this is the first I've heard of it. I have 4 active systems right now that I use for work. So I loathe to brick anything and impact my clients or my ongoing development. I keep 2, identically configured RAID 10 systems, that sync each other daily. It's not impossible for me to intentionally bring one down for a while. But I'm actively developing several projects for 4 or 5 different organizations right now and my time is at a premium to deliver.

That being said, I'll help where ever I can as time allows.

mRgOBLIN 07-17-2013 05:31 PM

Quote:

Originally Posted by meetscott (Post 4991970)
I have regenerated my initrd.gz with different options (who knows? maybe I still missed something), many times :-(

What I'm suggesting is to copy your (correctly configured) mdadm.conf to /boot/initrd-tree/etc/ and then run mkinitrd again without the -c (or with CLEAR_TREE="0" if you use mkinitrd.conf). This should install your mdadm.conf into your initrd.gz and should (in theory) assemble your array with the correct name.

Quote:

Originally Posted by meetscott (Post 4991970)
That being said, I'll help where ever I can as time allows.

Much appreciated.

meetscott 07-17-2013 11:55 PM

Quote:

Originally Posted by mRgOBLIN (Post 4992197)
What I'm suggesting is to copy your (correctly configured) mdadm.conf to /boot/initrd-tree/etc/ and then run mkinitrd again without the -c (or with CLEAR_TREE="0" if you use mkinitrd.conf). This should install your mdadm.conf into your initrd.gz and should (in theory) assemble your array with the correct name.

Much appreciated.

I'll give it a shot. I remember having to hack the tree a few years ago in a similar way. It will likely be a week or two before I can get to it.

meetscott 11-17-2013 04:10 PM

I wanted to post an update to this tread. My upgrade to the 3.10.17 (because of Slackware 14.1) went flawlessly. No issues at all. I have not looked closely at the dmesg output. I'm grateful everything works the way I would expect. Much thanks to whoever works on this stuff and keeps it generally awesome... kernel guys, Slackware guys, etc.

meetscott 11-26-2013 12:20 PM

Another update. What's funny, is I have 2 servers, exactly the same configuration and hardware. Totally identical. The first one upgraded easily, the way I previously described. The second one is a brick. I can't get it to boot for anything. I've rebuilt the initrd.gz image, checked all the RAID settings (mdadm.conf, mdadm -D /dev/md1), partitions, dev directory, lilo settings. All are exactly the same. I don't understand it. The devices are getting identified on the bad machine as md126, md127, etc. instead of md0, and md1.

I'm down to checking to make sure the bios settings are the same. Because software wise, it's exactly the same. It would be great to get to the bottom of why this is happening.

By the way, I checked mdadm.conf on both machines and the UUIDs of the RAID arrays are exactly what they should be for each machine.

meetscott 11-26-2013 11:09 PM

The one difference I could find between the 2 identical systems was the BIOS version on the motherboard. So I went ahead and flashed an update on the system which wouldn't boot. It had a slightly older version than the one that would boot. No dice.

I was finally able to boot my server with manual intervention during boot time. When the kernel panics because it can't mount the root file system it asks if you want to fix it. Busybox is running in the initrd image and has enough tools available to do some poking around. It turns out the RAID array is not getting started. Don't have a clue why, but it's not.

I manually started the RAID array, mounted the root partition and exited to let the kernel continue to try to find init on the root partition.
Code:

mdadm -Es /etc/mdadm.conf
mdadm -S -s
mdadm -As

The first command is for good measure. I think what's in the initrd tree is fine for mdadm.conf. I don't think it is necessary to re-examine the array. The second command stops the array, again for good measure. It's already stopped. The third command assembles the array.

Since I'm also using LVM I have to fire those up next because the devices are not getting created in the initrd /dev directory.
Code:

vgscan --mknodes
vgchange -ay
mount /dev/vg2/root /mnt

So I manually create the device nodes. Then I mount root.

Once all this nonsense is done, I'm able to continue to boot normally. I'm thinking I might have to hack on the init scripts in the initrd tree to see if I can get the RAID array to fire up. I believe once it does, the logical volumes will ready themselves and init on the real root partition will be able to be read.

Richard Cranium 11-27-2013 11:56 PM

Did you use the /usr/share/mkinitrd/mkinitrd_command_generator.sh script to create your initrd?

meetscott 11-29-2013 05:44 PM

Quote:

Originally Posted by Richard Cranium (Post 5071811)
Did you use the /usr/share/mkinitrd/mkinitrd_command_generator.sh script to create your initrd?

Yes. That script is awesome and it worked fine for the other identical system.

Richard Cranium 11-30-2013 09:09 AM

Well, on the non-working system, try manually running the commands that the init script does...
Code:

  if [ -x /sbin/mdadm ]; then
    /sbin/mdadm -E -s >/etc/mdadm.conf
    /sbin/mdadm -S -s
    /sbin/mdadm -A -s
    # This seems to make the kernel see partitions more reliably:
    fdisk -l /dev/md* 1> /dev/null 2> /dev/null
  fi


Richard Cranium 11-30-2013 01:47 PM

I'll add that under Slackware 14.1, my raid devices are ignoring their old names and are initializing as /dev/md125, /dev/md126, and /dev/md127. LVM comes up OK, nonetheless.

meetscott 12-01-2013 12:03 AM

Quote:

Originally Posted by Richard Cranium (Post 5072968)
Well, on the non-working system, try manually running the commands that the init script does...
Code:

  if [ -x /sbin/mdadm ]; then
    /sbin/mdadm -E -s >/etc/mdadm.conf
    /sbin/mdadm -S -s
    /sbin/mdadm -A -s
    # This seems to make the kernel see partitions more reliably:
    fdisk -l /dev/md* 1> /dev/null 2> /dev/null
  fi


Yep, I saw that code when I was analyzing the init scripts. That's part of where I got the idea of running the code I posted previously. Doing this, I was able to get it up and running. But it still will not come up on its own. It's miserable having to start the server with cryptic, manual intervention, but at least I was able to get it up. I just can't reboot it unless I'm there to manually bring up the raid devices, the logical volumes and mount the root volume.

I will continue to trouble shoot it, but I'm completely baffled as to why it won't come up on its own. As a drastic measure, I may try rebuilding everything from scratch. I hate to do this because it's so time consuming, but that may be the only way to get it out of this weird state it seems to be stuck in.

meetscott 12-01-2013 12:12 AM

Quote:

Originally Posted by Richard Cranium (Post 5073054)
I'll add that under Slackware 14.1, my raid devices are ignoring their old names and are initializing as /dev/md125, /dev/md126, and /dev/md127. LVM comes up OK, nonetheless.

I remember you mentioning this before. Neither of my systems, that is the working one and the one that requires manual intervention at start up, are initializing with /dev/md125, /dev/md126, and /dev/md127. Mine are both doing the right things. The right things being, /dev/md0, and /dev/md1 in my case.

Richard Cranium 12-01-2013 03:11 AM

Quote:

Originally Posted by meetscott (Post 5073256)
I remember you mentioning this before. Neither of my systems, that is the working one and the one that requires manual intervention at start up, are initializing with /dev/md125, /dev/md126, and /dev/md127. Mine are both doing the right things. The right things being, /dev/md0, and /dev/md1 in my case.

Well, in Slackware 14.0 after the 3.2.45 kernel upgrade, my arrays would initialize as the high numbers but would reset themselves to the names that I had used to create them.

In Slackware 14.1, the same arrays initialize as the high numbers but never change their names to the names I had used to create them.

Richard Cranium 12-01-2013 03:41 AM

Try (as root)...
Code:

rm /boot/initrd-tree/etc/mdadm.conf
mkinitrd -o /boot/initrd-test.gz

...then add a stanza to lilo.conf to use initrd-test.gz instead of initrd.gz.

Why?

Well, on my machine, the initial tree created by /usr/share/mkinitrd/mkinitrd_command_generator.sh copies over the default /etc/mdadm.conf file into the initrd-tree. The default mdadm.conf contains only comments, but the init code in initrd contains...
Code:

  if [ -x /sbin/mdadm ]; then
    # If /etc/mdadm.conf is present, udev should DTRT on its own;
    # If not, we'll make one and go from there:
    if [ ! -r /etc/mdadm.conf ]; then
      /sbin/mdadm -E -s >/etc/mdadm.conf
      /sbin/mdadm -S -s
      /sbin/mdadm -A -s
      # This seems to make the kernel see partitions more reliably:
      fdisk -l /dev/md* 1> /dev/null 2> /dev/null
    fi
  fi

...which meant that the only raid assembly that happens is whatever the kernel auto-assembles for you. I tried this on my machine, and now I've got my proper raid array names back.

meetscott 12-01-2013 09:29 PM

Quote:

Originally Posted by Richard Cranium (Post 5073304)
Try (as root)...
Code:

rm /boot/initrd-tree/etc/mdadm.conf
mkinitrd -o /boot/initrd-test.gz

...then add a stanza to lilo.conf to use initrd-test.gz instead of initrd.gz.

Why?

Well, on my machine, the initial tree created by /usr/share/mkinitrd/mkinitrd_command_generator.sh copies over the default /etc/mdadm.conf file into the initrd-tree. The default mdadm.conf contains only comments, but the init code in initrd contains...
Code:

  if [ -x /sbin/mdadm ]; then
    # If /etc/mdadm.conf is present, udev should DTRT on its own;
    # If not, we'll make one and go from there:
    if [ ! -r /etc/mdadm.conf ]; then
      /sbin/mdadm -E -s >/etc/mdadm.conf
      /sbin/mdadm -S -s
      /sbin/mdadm -A -s
      # This seems to make the kernel see partitions more reliably:
      fdisk -l /dev/md* 1> /dev/null 2> /dev/null
    fi
  fi

...which meant that the only raid assembly that happens is whatever the kernel auto-assembles for you. I tried this on my machine, and now I've got my proper raid array names back.

I tried your suggestion. It was a good one, but it still didn't work. By commenting out the /etc/mdadm.conf in the initrd-tree it did force the kernel to pick up new names. So it fired up md126 and md127. After this it still panics out because it can't load the LVM with root.

I'm almost to the point of modifying init so it does what I want it to do.

Richard Cranium 12-01-2013 11:29 PM

When you say "commenting out the /etc/mdadm.conf in the initrd-tree", did you mean "remove the file etc/mdadm.conf" from the initrd-tree?

Hmm. When you get this broken system running, what is the output of
Code:

pvs -v
when run as root? If the raid arrays are running (even with screwed up names), the lvm tools should find the physical volume information from the UUIDs in the metadata.

meetscott 12-02-2013 12:52 AM

Quote:

Originally Posted by Richard Cranium (Post 5073704)
When you say "commenting out the /etc/mdadm.conf in the initrd-tree", did you mean "remove the file etc/mdadm.conf" from the initrd-tree?

Hmm. When you get this broken system running, what is the output of
Code:

pvs -v
when run as root? If the raid arrays are running (even with screwed up names), the lvm tools should find the physical volume information from the UUIDs in the metadata.

I left the mdadm.conf file there with everything commented out inside it.

Output of pvs -v:
Code:

    Scanning for physical volume names
  PV        VG  Fmt  Attr PSize  PFree DevSize PV UUID                             
  /dev/md0  vg1  lvm2 a--  508.00m    0  509.75m 5xHv5N-0Jv3-CU67-C6Iz-KErk-QLdr-65bnlJ
  /dev/md1  vg2  lvm2 a--    1.82t    0    1.82t DLfHqe-gfvV-m3H5-qlzm-sK9k-C4bB-LbyHvL


mlslk31 12-02-2013 01:05 AM

I'm too much of a n00b to be in this conversation (and don't use initrd or LVM), but I might throw this in here. One of my lowly setups is a 2-disc setup that has a plain JFS-formatted /boot partition. Therefore, the kernels are loaded from the plain partition, the kernel auto-detects my RAID-0 /dev/md0, and everything else is read off of /dev/md0. I have something in my kernel cmdline like "md=0,/dev/sda1,/dev/sdb1 root=/dev/md0" or something to that effect. I know no better, so I went completely off of a document like Documentation/md.txt, but from my particular kernel source.

What it seems like is that for the non-LVM md partitions I use, the partition type should be fd00 if you want them autodetected. Despite the mention that autodetection is for DOS/MBR-style partitions only, they work with GPT partitions as well. If you don't want them autodetected, don't mark them as fd00, and let mdadm take care of it.

As for the numbers, that took some jiggling. For v0 metadata, you can somehow assemble the raid as "0" and pass the flag --update=super-minor to mdadm so that the preferred-minor defaults to 0. This trick does not work with v1 arrays. To see the current preferred minor, use `mdadm --detail assembled_raid`. I've forgotten what I did to get the v1 arrays to budge the minor. I either assembled or re-built them as "18" and "19", respectively, instead of their normal names, then have it set up like this:

ARRAY pretty_name_for_dev_md UUID=1337:f00:ba4:babab0033

Again, I'm learning this for only the second time (first time wasn't much fun) on Linux, and I'm doing this for new installs that I could reinstall or restore from backup. I also didn't get the feeling that the --auto=md{x} flag worked all the time. Zero confidence, but I got my particular setup up and running. YMMV.

Richard Cranium 12-02-2013 02:17 AM

Quote:

Originally Posted by meetscott (Post 5073730)
I left the mdadm.conf file there with everything commented out inside it.

Output of pvs -v:
Code:

    Scanning for physical volume names
  PV        VG  Fmt  Attr PSize  PFree DevSize PV UUID                             
  /dev/md0  vg1  lvm2 a--  508.00m    0  509.75m 5xHv5N-0Jv3-CU67-C6Iz-KErk-QLdr-65bnlJ
  /dev/md1  vg2  lvm2 a--    1.82t    0    1.82t DLfHqe-gfvV-m3H5-qlzm-sK9k-C4bB-LbyHvL


Ok. I thought that you might have had some PVs that had been around a while and were in lvm1 format.

Please try with either removing or renaming the mdadm.conf file in the initrd. That should force the init script to run the commands...
Code:

      /sbin/mdadm -E -s >/etc/mdadm.conf
      /sbin/mdadm -S -s
      /sbin/mdadm -A -s
      # This seems to make the kernel see partitions more reliably:
      fdisk -l /dev/md* 1> /dev/null 2> /dev/null

...which should result in what you want.

meetscott 12-07-2013 11:45 AM

Quote:

Originally Posted by mlslk31 (Post 5073733)
I'm too much of a n00b to be in this conversation (and don't use initrd or LVM), but I might throw this in here.

I've been really busy this last week. So I haven't looked at this stuff again. But I appreciate the input and ideas. You may consider yourself a n00b, but you might not be giving yourself enough credit ;-)

Thanks!
scott

meetscott 12-07-2013 11:47 AM

Quote:

Originally Posted by Richard Cranium (Post 5073756)
Ok. I thought that you might have had some PVs that had been around a while and were in lvm1 format.

Please try with either removing or renaming the mdadm.conf file in the initrd. That should force the init script to run the commands...
Code:

      /sbin/mdadm -E -s >/etc/mdadm.conf
      /sbin/mdadm -S -s
      /sbin/mdadm -A -s
      # This seems to make the kernel see partitions more reliably:
      fdisk -l /dev/md* 1> /dev/null 2> /dev/null

...which should result in what you want.

I'll give it a shot and let you know. Thanks!
scott

meetscott 12-07-2013 03:30 PM

This is a miracle. I renamed mdadm.conf to mdadm.conf.bak in the initrd-tree. Then I re-ran mkinitrd with no parameters. Re-ran lilo. Then I rebooted.

Walla! Everything came up. This is the first time I haven't had to manually intervene during boot time to get this machine to come up since my upgrade to Slackware 14.1 from Slackware 14.0 using the 3.2.29 kernel.

Richard Cranium, you are the man! Thanks!

rmathes 05-01-2014 10:48 AM

Thank you all for this valuable information it helped me fix my problem as well. I just removed the mdadm.conf and rebuilt the initrd.gz and bam I was up and running on my new raid setup and it is smoking fast!

Richard Cranium 05-02-2014 07:50 PM

I'll just mention this bit:

The safest way to ensure that your software RAID arrays are setup correctly would be to run the command
Code:

/sbin/mdadm -E -s >/etc/mdadm.conf
as root prior to running the mkinitrd command (or the outstanding command generator script). udevd will use that information, if present, to correctly auto-assemble your arrays in the first place. Otherwise, udevd will create them with the wrong names and the initrd init script will re-assemble them. That slows your boot down a slight bit.

So, the two approaches have pros and cons:
  • Remove /etc/mdadm.conf from your initrd:
    • Pro: You will never have to re-generate your initrd if you add a new RAID array or delete/change an existing one.
    • Con: You will assemble the arrays twice during the boot process, slowing it down slightly.
  • Ensure that your /etc/mdadm.conf in the initrd is correct:
    • Pro: You will only assemble your arrays once during boot, speeding things up some amount.
    • Con: You have to remember to re-create your initrd as you add/delete/change your RAID configuration.


All times are GMT -5. The time now is 06:38 AM.