LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Red Hat
User Name
Password
Red Hat This forum is for the discussion of Red Hat Linux.

Notices


Reply
  Search this Thread
Old 01-28-2010, 06:06 AM   #1
pukeko
LQ Newbie
 
Registered: Jan 2010
Posts: 6

Rep: Reputation: 0
logical volumes not detected during boot


I have a system where the logical volumes are not being detected on boot and would like some guidance as to how to cure this.

The box is a Proliant DL385 G5p with a pair of 146 GB mirrored disks. The mirroring is done in hardware by an HP Smart Array P400 controller.

The mirrored disk (/dev/cciss/c0d0) has 4 partitions: boot, root, swap and an LVM physical volume in one volume group with several logical volumes, including /var, /home and /opt.

The OS is a 64-bit RHEL 5.3 basic install with a kernel upgrade to 2.6.18-164.6.1.el5.x86_64 (to cure a problem with bonded NICs) plus quite a few extras for stuff like Oracle, EMC PowerPath and HP's Proliant Support Pack.

The basic install is OK and the box can still be rebooted OK after the kernel upgrade. However, after the other stuff goes on it fails to reboot.

The problem is that the boot fails during file system check of the logical volume file systems but the failure is due to these volumes not being found. Specifically the boot goes through the following steps:

Red Hat nash version 5.1.19.6 starting

setting clock
starting udev
loading default keymap (us)
setting hostname
No devices found <--- suspicious?
Setting up Logical Volume Management:

fsck.ext3 checks then fail with messages:
No such file or directory while trying to open /dev/<volume group>/<logical volume>
There are also messages about not being able to find the superblock but this is clearly due to the device itself not being found.

If I boot from a rescue CD all of the logical volumes are present, with correct sizes; dmsetup shows them all to be active and I can access the files within. Fdisk also shows all the partitions to be OK and of the right type. I am therefore very sure that there is nothing wrong with the disk or logical volumes.

If I skip the fsck step using fastboot in grub, or by editing fstab, then the system will boot but is clearly broken because of the filesystems missing.

If I poke around inside the manual fsck correction utility in the failed boot I can see that the devices /dev/<volume group>/<logical volume> AND /dev/mapper/<volume group>-<logical volume> are NOT present. dmsetup therefore also shows nothing present.

My suspicion is that the logical volume detection is failing because it is too slow. I base this on having noted an online article claiming to cure a similar problem by editing init (which I presume means recompiling it) AND from the observation that my failures are not entirely consistent. In some cases the first logical volume is detected but the rest aren't and sometimes none are detected. (I have never seen more than one detected.)

I've tried booting with rootdelay but this didn't help.

Booting with debug didn't seem to provide any clues.

It is not clear to me where in the boot process the problem is occurring nor what may be causing it. I wonder whether some of the monitoring features in the Proliant Support Pack may be interfering. I can't think what else may be getting in the road. (Isolating the problem add-on by trial builds and reboots is possible in principle but would take days because of the volume of extras and doesn't necessarily indicate a cure.)

I do note that the problem appears to be hardware sensitive as I have an identical build on some DL585 boxes which seem OK.

Is there a way to coax the logical volumes into being recognized or better pinpointing where the problem is coming from? What is linux rescue doing that the normal boot does not?

Thanks
 
Old 01-28-2010, 06:25 AM   #2
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
Try run "mkvgnodes" to create the volume group nodes.

Is the system mounting your / partition at this point, or are you still in initrd. You may need to chroot in that case. Be sure to run "/bin/bash -l" to set up root's environment if chroot'ing.

Run "depmod -a". Maybe that failed when installing a new kernel. You may also need to run "mkinitrd" if your initrd file is missing kernel modules you need.

I'm not using LVM on my SuSE system, but their boot.lvm script may be similar to yours.
Code:
#! /bin/sh
#
# Copyright (c) 2001 SuSE GmbH Nuernberg, Germany.  All rights reserved.
#
# /etc/init.d/boot.lvm
#
### BEGIN INIT INFO
# Provides:          boot.lvm
# Required-Start:    boot.device-mapper boot.udev boot.rootfsck
# Should-Start:      boot.multipath boot.md boot.dmraid
# Required-Stop:     $null
# Should-Stop:       $null
# Default-Start:     B
# Default-Stop:
# Description:       start logical volumes
### END INIT INFO

. /etc/rc.status
. /etc/sysconfig/lvm

# udev interaction
if [ -x /sbin/udevadm ] ; then
    [ -z "$LVM_DEVICE_TIMEOUT" ] && LVM_DEVICE_TIMEOUT=60
else
    LVM_DEVICE_TIMEOUT=0
Also check the scripts in your initrd file. You may need to make changes there, or maybe insert a sleep command before the system starts the service.
Code:
case "$1" in
  start)
        #
        # Find and activate volume groups (HM 1/1/1999)
        #
        if test -d /etc/lvm -a -x /sbin/vgscan -a -x /sbin/vgchange ; then
            # Waiting for udev to settle
            if [ "$LVM_DEVICE_TIMEOUT" -gt 0 ] ; then
                echo "Waiting for udev to settle..."
                /sbin/udevadm settle --timeout=$LVM_DEVICE_TIMEOUT
            fi
            echo "Scanning for LVM volume groups..."
            /sbin/vgscan --mknodes
            echo "Activating LVM volume groups..."
            /sbin/vgchange -a y $LVM_VGS_ACTIVATED_ON_BOOT

            rc_status -v -r
        fi
        ;;
This should be easier than recompiling init.
 
Old 01-28-2010, 09:35 AM   #3
DrLove73
Senior Member
 
Registered: Sep 2009
Location: Srbobran, Serbia
Distribution: CentOS 5.5 i386 & x86_64
Posts: 1,118
Blog Entries: 1

Rep: Reputation: 129Reputation: 129
My guess would be that maybe RAID partitions are not booted properly / in time. That is the first step. I can not check now on my own PC if that is where booting stops, but see if any of the additional software for RAID controller is causing the problem. I think there is kernel option to make booting more verbose.

You should also contact both HP and Red Hat and see if they have some clue.

Last edited by DrLove73; 01-28-2010 at 09:37 AM.
 
Old 02-05-2010, 02:32 PM   #4
pukeko
LQ Newbie
 
Registered: Jan 2010
Posts: 6

Original Poster
Rep: Reputation: 0
I've solved this problem.

A key to discovering the cause was noticing that the standard RHEL5 build puts "quiet" on the kernel line in grub.conf. Removing this provided a more comprehensive log of the boot.

This showed that there was an out of memory condition. This caused oom-killer to kill the lvm.static (and depmod) processes during boot. Clearly, this was why the logical volume devices were not being created.(Indeed when I fiddled with the available memory, reducing it by setting mem=... in grub.conf it even killed off init!).

Oom-killer has a bit of a reputation for doing disruptive things and for it to become operational during boot phase and on a 64-bit system with 20GB of RAM is clearly unusual. (In its defense I would point out that without it I probably would not have been able to get at the logs to find out what the problem was.)

I checked and the low mem setting was indeed the full 20GB as expected on a 64-bit system.

As mentioned in my original post, the problem only arose after I added some extras to the system. Some hard thinking as to what could be consuming so much memory during boot phase led me to some kernel settings applied for Oracle.

This box was built as a test platform for some production systems that had over 3 times the available memory. The build details had been copied identically -- what else does one do for a test platform?

It turned out that a vm.nr_hugepages setting was responsible. The value chosen was OK for the production systems but inappropriate for this one. Removing this from sysctl.conf cured the problem.

Some learnings to take from this include
  • The RH logging defaults are for convenience on a working system and not helpful for debugging
  • oom-killer can cause boot problems
  • Altering kernel defaults is risky
  • Saving money by choosing cheaper hardware for a test system can be a false economy!
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Logical volumes not available on reboot Vanyel Linux - Server 11 09-19-2011 10:23 AM
creating logical volumes... mtsm Linux - Newbie 1 01-31-2009 06:38 AM
Can't delete logical volumes gnoop Linux - Enterprise 6 01-23-2009 11:54 AM
I did not know about logical volumes gezi Linux - Newbie 5 04-27-2005 09:26 AM
FC3 slow boot: scans logical volumes..? tardigrade Fedora 1 04-03-2005 01:45 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Red Hat

All times are GMT -5. The time now is 05:42 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration