Server crashing every 2 weeks

andyccn · 09-15-2007, 04:52 AM

Hi,

I have a dedicated server which, when provided to me, was running Fedora Core 4. That's now end-of-life'd so the first thing I did was upgraded to Fedora 7 and compiled a custom kernel 2.6.22.3 (as the default kernel with F7 didn't include the SATA drivers and Netfilter that I needed.)

As you'd expect the server is monitored, and I have access to it via SSH and a serial console.

It was all up and running perfectly happily for just over 2 weeks, then around 1.30am one morning I got a text message saying it was down. When I had a look, I couldn't get to it via SSH so I logged on to the serial console and was presented the message:

Hangcheck: hangcheck value past margin!

I couldn't do anything, it was completely locked, so I forced a reboot from my provider's control panel. It came back up as if nothing had happened, and there was absolutely nothing of interest in the log files, except a complaint that it couldn't contact DHCP to renew it's IP address - that was my mistake, I hadn't opened the firewall to DHCP.

So after opening the firewall, I left it on it's merry way, and again this morning (nearly 2 weeks after the last incident), at 2.20am I got another text saying it was down.

Logged on this morning - same thing, same message. However the log files were a little different - there was still nothing of interest, however it had stopped logging at around 8am yesterday morning - which is strange because the DHCP client constantly checks the address and always reports how long's left before it renews - so every minute of every day there's something being logged. (I hadn't noticed this last time.) Actually I noticed that syslog was still logging in another logfile until 2.10 - when the server went down, so perhaps the DHCP client stopped running? I don't know.

As this is a live server hosting some websites, a backup mail server, and secondary DNS server, I can't afford for it to keep going down every 2 weeks (and it seems to do so early in the morning so it's always down for at least 6 hours when it crashes.)

After doing some Google'ing this morning, I found out that hangcheck is a sort of watchdog that can, when triggered, reboot the machine it's running on. I found quite a few posts from other people basically saying "if you don't know what it is, why did you enable it?" When I compiled the kernel I accepted all the defaults except when it came to SATA and Netfilter, so I figure it must have been a default option.

I went and got the latest kernel sources, 2.6.22.6, and disabled hangcheck support, and have compiled and started the server up on this new kernel, in the hope that it was hangcheck causing the problem.

Basically I'm asking if anyone else has had any sort of trouble like this with hangcheck, what exactly it does, and how I can go about gathering more details about why the server crashes when it does? (e.g. debug info, panic statements, etc.)

Thanks,

Andy.

LaughingBoy · 09-15-2007, 11:57 PM

I'm very surprised that the default kernel didn't have the SATA drivers that you needed. Using anything out of the defaults can cause problems down the track. If you upgrade with the latest patches, etc, you are likely to get a new kernel. When you next reboot, all the customisations you had will be lost.

What SATA chipset are you using that the kernel doesn't have ?

andyccn · 09-17-2007, 12:41 PM

Yeah that's what I thought. As it's a dedicated machine they don't give me that level of specs, but looking through the boot messages, I can see it's an NVIDIA nForce chipset with a SATA2 controller (perhaps that's why, the fact that it's SATA2?)

As for kernel upgrades, I'll be updating the system myself on an as-needed basis, and (providing I can get it stable) I'll only re-compile a new kernel if an OS upgrade (like FC4->F7) requires it, and then I can use my previous kernel build file (.config) to compile the same modules/support etc.

LaughingBoy · 09-18-2007, 02:06 AM

Quote:

Originally Posted by andyccn

Yeah that's what I thought. As it's a dedicated machine they don't give me that level of specs, but looking through the boot messages, I can see it's an NVIDIA nForce chipset with a SATA2 controller (perhaps that's why, the fact that it's SATA2?)

I've got a couple of boards with NVidia SATA-II chips in them, and they have been working under Fedora for a while now.

Are you sure the default kernel doesn't support the NVidia SATA chipset

The sata_nv module covers pretty much all of their chips. I bought a motherboard with a CPU that isn't properly recognised, and it just comes up as "AMD Processor model unknown", but it still works. All chipsets on the board appear to work fine though.

What happens to make you think it's the kernel not finding the HDD?

andyccn · 09-18-2007, 06:41 AM

Quote:

Originally Posted by LaughingBoy

I've got a couple of boards with NVidia SATA-II chips in them, and they have been working under Fedora for a while now.

Are you sure the default kernel doesn't support the NVidia SATA chipset

The sata_nv module covers pretty much all of their chips. I bought a motherboard with a CPU that isn't properly recognised, and it just comes up as "AMD Processor model unknown", but it still works. All chipsets on the board appear to work fine though.

What happens to make you think it's the kernel not finding the HDD?

Purely the fact it wouldn't boot from it! I got as far as lilo, then lilo was reporting it couldn't find the HDD partition that root= specified. Booted into the previous kernel, worked fine, no changes to lilo required.

Also inside the kernel package I downloaded from kernel.org, all the SATA drivers were set to default "no", I had to specifically say "y" to include them when I ran through the config.

How did you install Fedora? Perhaps the Fedora installer enables all the required options for you, as this one was an upgrade to FC4 through Yum.

LaughingBoy · 09-18-2007, 11:37 PM

Quote:

Originally Posted by andyccn

Purely the fact it wouldn't boot from it! I got as far as lilo, then lilo was reporting it couldn't find the HDD partition that root= specified. Booted into the previous kernel, worked fine, no changes to lilo required.

Also inside the kernel package I downloaded from kernel.org, all the SATA drivers were set to default "no", I had to specifically say "y" to include them when I ran through the config.

How did you install Fedora? Perhaps the Fedora installer enables all the required options for you, as this one was an upgrade to FC4 through Yum.

Lilo??? Wow. Try using the default boot loader : Grub. Are you able to post your lilo.conf if you insist on using Lilo ?

when you say you boot to the previous kernel and it's fine - is that the FC4 kernel, or a previous F7 kernel?

Secondly, I believe you are able to download the configuration files that the Fedora team used to compile the kernel. It's been a while since I've had to do that, but I believe you can. They include their .config file with the package, so you can use "make xconfig -useconfig=oldconfig" (or the equivalent)

Lastly, how did I install Fedora? From scratch. I have an FC4 system which I am rebuilding from scratch to support F7. I'll migrate the backups across and recover from those. The differences were almost too great. Less headaches too.

I believe I recall Fedora 7 had a number of kernel differences to previous releases of Fedora Core. Modules were loaded differently, amidst other changes. I recall there being an article on how you couldn't go from FC6 to F7 through yum for that reason. I'm guessing that FC4 to F7 is going to suffer the same fate.

My recommendation? Backup all essential services the FC4 system provided, and install F7 from scratch. Then restore those backups. Things like databases, compilers, etc have all undergone significant upgrades in that time frame.

andyccn · 09-19-2007, 06:48 AM

Lilo - I'm much more comfortable with it, plus it was the standard boot loader installed on the machine when it was handed over to me. As it's a dedicated machine leased to me, I'm limited to what I can do - ie. I can't install a new boot loader as if I mess the configuration up, I won't be able to get access to the serial console again. Hence why I had to upgrade FC4 to F7 instead of do a clean install.

The required kernel upgrade is between FC4 and FC5, as some packages break if you keep with the old kernel.
The upgrade itself (apart from the failed kernel boot) was pretty smooth: see here

Yes, the old kernel was the FC4 one.
The machine was a minimal FC4 install, so pretty much everything on it I installed from scratch. I never rely on packages to install critical services, such as databases, etc. I compile everything from source so I know what's going on.

Here's my lilo.conf, working fine on my re-compiled kernel and on the old one, but not on the yum-upgraded kernel.

Code:

boot=/dev/sda
root=/dev/md0

install=/boot/boot.b
vga=normal
timeout=36
prompt
lba32

read-only

default=lxser

serial=0,57600n8
append="console=ttyS0,57600 console=tty0 panic=30"

image=/boot/vmlinuz
        label=lxser
        append="console=tty0 console=ttyS0,57600 panic=30"

image=/boot/vmlinuz
        label=lx

LaughingBoy · 09-20-2007, 05:47 AM

Let me preface this reply with "I've not used Lilo in a loooooong time!" Since grub handled multi-booting with so much more ease, and adding new kernels was simple... the graphical nature. Everything was simpler. I was an easy convert from Lilo, but I've not played with it for over 5... maybe 7 years.

That being said, I only see two images in the conf file... and they point to the same file. Surely if you want to select another kernel, you provide a different image name? I might be mistaken.

LaughingBoy · 09-20-2007, 05:53 AM

Quote:

Originally Posted by andyccn

As it's a dedicated machine leased to me, I'm limited to what I can do - ie. I can't install a new boot loader as if I mess the configuration up, I won't be able to get access to the serial console again.

Can't you request a local admin / support person to insert a DVD of F7 into the drive and reboot?

That being said, I'm unsure how to install the OS remotely. Beyond my field of knowledge.

Sorry.

andyccn · 09-20-2007, 06:41 AM

Quote:

Originally Posted by LaughingBoy

Can't you request a local admin / support person to insert a DVD of F7 into the drive and reboot?

That being said, I'm unsure how to install the OS remotely. Beyond my field of knowledge.

Sorry.

Haha no I've tried that before - "we have a wide variety of OS images available to re-image your server; we do not support installing your own OS" blah-di-blah-di-blah. That and the fact the imaging process (and emergency boot) takes place over the network, I'm fairly certain the server doesn't even have a CD/DVD drive.

That's the other great reason for me using Lilo, it's interface is very minimal, which is ideal over a serial console. I could never understand grub in text-only mode.

Yes there is only one image file, one with the serial console enabled, the other without (again, the standard distribution from my server provider.) While testing the other kernels, they were listed here as well, I took them out but left the images intact in-case I do need them again.

LaughingBoy · 09-20-2007, 11:59 PM

So, how does one tell the F7 and FC4 images apart?

andyccn · 09-21-2007, 02:35 AM

Easy! The FC4 kernel was 2.6.14 (upgraded to 2.6.16 successfully by FC5), the F7 kernel is 2.6.22. My compiled kernel is 2.6.22.6.

I can't remember the exact names for the images, but it's something like /boot/system-2.6.<whatever>.img.

/boot/vmlinuz is a symlink to the correct system.*.img file.
So if I want to add in one of the other kernels, I just add a new section in lilo.conf with an image=/boot/system-2.6.<whatever>.img setting, run 'lilo', reboot, and that's another option in Lilo to boot from.

andyccn · 09-23-2007, 12:45 AM

OK so the server crashed again about an hour ago, and all I got on the console was "ip_tables (c) Core Netfilter team" so it looked like it was half-way through a reboot (as that's a line from the boot-up messages.)

I forced a reboot from it's control panel, and it came back up OK. I decided to install the standard kernel through yum (currently 2.6.22.5-76.fc7), and when rebooting, all I got is:

Quote:

List of all partitions:
No filesystem could mount root, tried: iso9660
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(9,0)

So it appears the SATA drivers are loaded, it's the file-system that isn't (the root fs is ext3.)

Looking inside the config file for this kernel (/boot/config-2.6.22.5-76.fc7), I see "CONFIG_EXT3_FS=m" - not sure what the "m" means, but in my config, it's CONFIG_EXT3_FS=y.

So I'm going to recompile the 2.6.22.6 kernel using the Fedora project's config file, with the added EXT3 file-system and see if this recurring crashing goes away.

LaughingBoy · 09-23-2007, 02:03 AM

Quote:

Originally Posted by andyccn

Looking inside the config file for this kernel (/boot/config-2.6.22.5-76.fc7), I see "CONFIG_EXT3_FS=m" - not sure what the "m" means, but in my config, it's CONFIG_EXT3_FS=y.

So I'm going to recompile the 2.6.22.6 kernel using the Fedora project's config file, with the added EXT3 file-system and see if this recurring crashing goes away.

The "=m" means load as a module. Usually, essential core kernel software that's not going to be unloaded whist running should be loaded in completely - not modularised. Most kernel options that should be modularised are (for example) USB device support for systems that only load USB Mass Storage controllers every now and then. The kernel can then release that memory back to the system if it's not being used. That's (I believe) the idea behind it.

I'm also a bit puzzled why it only appeared to try ISO9660 as the file system type. If I'm reading that correctly, it didn't know about any other file system - or didn't try. What does your FSTAB file say the partition should be mounted as? What does fdisk say the partition type is?

andyccn · 09-25-2007, 03:39 PM

Like I said, in the default kernel config file installed by yum, all file-systems except iso9660 and ext2 are set to be loaded as a module. The root file-system is on a software-RAID partition, the under-lying partition is ext3, which is what /etc/fstab says it is as well.

I can't understand why the Fedora project would load all file-systems as modules - what if you add your own partition with a different partition type to when you installed it (e.g. xfs)?

After the crash the other morning, I decided to re-image the machine with Ubuntu, and (at least for the moment) I can't upgrade the kernel past 2.6.16 (I've got to try and find an RPM, as apt-get doesn't recognise any newer kernel packages.) This is simply for the fact I can't afford for the server to keep crashing every 1-2 weeks. If it still does now, I can go back to my server provider, and say "I'm using one of your standard images and the server is crashing, please swap it out with another one." If I went to them with the customised/upgraded Fedora, they may well turn round and say they couldn't support it, so I have to cover all bases!