RHEL 5, Libdl.so.2 and Kernel panic at boot

mek1 · 09-04-2008, 09:37 AM

My RHEL5 server locked up overnight, when I restart it I get the following;

Code:

/sbin/init: error while loading shared libraries: /lib/libdl.so.2: invalid ELF header

kernel panic - not syncing: attempted to kill init!

It locks up and gets no further. Keep in mind this is hosted on an ESX server as I saw other kernel panic's attributed to hardware.

So I went into linux restore and can get to the /lib/ folder and see libdl.so.2, just not sure what I should try to correct this error.

Any suggestions?
Thanks

unSpawn · 09-04-2008, 04:41 PM

If you 'rpm -qf /lib/libdl.so.2' you'll see it's the Glibc package you could use rpm --verify on to start with. before you reinstall Glibc it would be good to look at other things. Are there or have there been any other logged errors leading up to this?

mek1 · 09-04-2008, 05:33 PM

unfortunately, from linux rescue shell;
'rpm -qf /lib/libdl.so.2' returns 'file /lib/libdl.so.2 is not owned by any package'

'rpm --verify Glibc' returns 'package glibc is not installed'.

Very odd. Any other suggestions?

Valery Reznic · 09-05-2008, 01:30 AM

Quote:

Originally Posted by mek1

unfortunately, from linux rescue shell;
'rpm -qf /lib/libdl.so.2' returns 'file /lib/libdl.so.2 is not owned by any package'

'rpm --verify Glibc' returns 'package glibc is not installed'.

Very odd. Any other suggestions?

It should be not 'rpm --verify Glibc', but 'rpm --verify glibc'

Anyway instead of chasing what wrong on your system I think it's much simpler just reinstall it

unSpawn · 09-06-2008, 10:07 AM

Quote:

Originally Posted by Valery Reznic

Anyway instead of chasing what wrong on your system I think it's much simpler just reinstall it

I disagree strongly. About the only time errors like ELF header corruption could happen on GNU/Linux systems is when package contents are written to (as in update). The rest of the time the library file is accessed but not modified. If this was not due to an update then not knowing the source of corruption means it can occur again. Running GNU/Linux is all about performance, protecting assets and providing services in a continuous, stable and secure way so you should not deliberately neglect signals like that. Besides that the "re-install and all will be fine" mantra is reminiscent of working with products from this particular vendor founded to develop and sell BASIC interpreters for the Altair 8800 and doesn't solve anything. Work on the cause, not the symptoms.

Valery Reznic · 09-07-2008, 01:30 AM

Quote:

Originally Posted by unSpawn

I disagree strongly. About the only time errors like ELF header corruption could happen on GNU/Linux systems is when package contents are written to (as in update). The rest of the time the library file is accessed but not modified. If this was not due to an update then not knowing the source of corruption means it can occur again. Running GNU/Linux is all about performance, protecting assets and providing services in a continuous, stable and secure way so you should not deliberately neglect signals like that. Besides that the "re-install and all will be fine" mantra is reminiscent of working with products from this particular vendor founded to develop and sell BASIC interpreters for the Altair 8800 and doesn't solve anything. Work on the cause, not the symptoms.

The question was (As I see it ) about making system work again, not about investigation. It can be any number of reason why it's happened - from hardware failure to rootkit and to occasional bug in some process running as root. While it's interesting HOW system got into this state it's not always possible to find it out. Sure it's have nothing to do with making system work again, but can help to avoid next breakage.

unSpawn · 09-07-2008, 04:25 AM

Quote:

Originally Posted by Valery Reznic

While it's interesting HOW system got into this state it's not always possible to find it out.

Sure, but there's a difference between actually trying to find the root cause and saying "oh, well, just reinstall whatever it is that's b0rken". If this was a production environment where an informed management decision was made (weighing all risks, consequences et cetera) to trade in diagnosis for uptime, then I would agree. Business requirements just bring a different type of "clarity" to things. But otherwise it is a perfect example of human nature to seek the path of least resistance (like by just reinstalling software). The point is there's nothing to be learnt from that approach and it does not solve anything.

Valery Reznic · 09-07-2008, 08:26 AM

Quote:

Originally Posted by unSpawn

Sure, but there's a difference between actually trying to find the root cause and saying "oh, well, just reinstall whatever it is that's b0rken". If this was a production environment where an informed management decision was made (weighing all risks, consequences et cetera) to trade in diagnosis for uptime, then I would agree. Business requirements just bring a different type of "clarity" to things. But otherwise it is a perfect example of human nature to seek the path of least resistance (like by just reinstalling software). The point is there's nothing to be learnt from that approach and it does not solve anything.

I see two different problems: 1) Repair damaged system. 2) Understand what caused damage.
My initial post was related to the first problem: if system is damaged I find it that usually it simpler just re-install it, than fix problem after problem.
And how one proceed with repair in quite unrelated to the second problem - understand what caused damage. If investigate problem is important, than re-installation can be done on the different hard drive (or different computer).

mek1 · 09-07-2008, 04:36 PM

Well, this has certainly turned into an interesting discussion.

Regarding the initial question, as a new guy to RHEL I really have no idea what caused it. Luckily this was a testing system as compared to our actual production environment we are moving towards. I had applied updates that RH deemed worth while just prior to the crash. While i did bring the system back to life via rolling back the virtual server to an earlier instance i still have nothing on the problem. Luckily, again this was a testing enviroment (I needed to test badly enough to roll back pre-error).

In the meantime I'm going to look into a reading more documentation on RH in general so that if this were to happen to our physical server I've got some direction on how to correct it.

thanks

RootAround · 10-23-2010, 11:16 PM

For what it’s worth, here are my fix procedures using the CentOS5.5 Live CD.
The CD works nicely with my cable modem.

WHAT WORKED.

1. Loaded linux from the Live CD. Logged in as root (a must).

2. Edited the Live CD’s /etc/fstab file. Located the /dev/hdaN entry associated
with the hard drive of the damaged linux home. changed ro (read)
parameter to rw (read-write).

Note: -N- is an integer

3. invoked yum at the Live Cd's command prompt:
yum --installroot=/mnt/disc/hdaN reinstall glibc
Note: -N- is the same integer as above.

I got a message complaining about libXmuu.so.1 being unable to link.

4. Rebooted linux from hard drive. There was no kernel panic, but a libXmuu.so.1 message reappeared:

/sbin/libconfig: can’t link /usr/lib/libXmuu.so.1 to libXmuu.so.1.0.0

I replied -yes- to a deletion request.

5. I got into the desktop, but many of the menu items and desktop icons were unusable, so I repeated steps 1-3, but using libXmuu.so.1 instead of glibc in the Yum command. Yum found the appropriate package (libXmuu.so.1 is a link) and installed it. I was back up.

WHAT DIDN’T WORK.

1. Manually repairing libdl.so.2. It’s a link to /lib/libdl-2.5.so in the same directory.
Copying/recreating the link didn’t work for me.
Note: I found it’s location by entering:

locate libdl.so.2

at the command prompt.

2. Booting GRUB in emergency mode. Followed

26.4. Booting into Emergency Mode of the CentOS manual

to append emergency to the kernel line. Got same kernel panic.