reboots happen? in the enterprise

cambie · 09-25-2006, 02:06 PM

I work for a LARGE company as a Linux Sys Admin, and we are currently heavily pushing applications towards linux away from AIX, HP-UX, and Solaris to a lesser extent. Currently, we admin about 500 boxes from two different manufacturers. Most are HP, and the rest are IBM made. Though we have our preferences in terms of remote management and other software accessories, both manufacturers seem to see their fare share of random reboots for no apparent reason. Sometimes there's clues in /var/log/messages or other logs that give some vague explanation, but this is not satisfactory to some new additions to our team who've come from other platforms. They don't like that the server doesn't take a dump that we can send off to RedHat or some vendor who can explain what the heck happened. There are often reboots that we seem to write off as "well, that happens sometimes." With the emphasis our company is putting on linux, I don't want to get up to 1200 servers and mission critical apps and then have a crash that I have to explain to the CEO.

Is that the way it is running linux on an intel chip? If we want stability, are we going to have to look for different hardware, or is there something we are missing in terms of logging data that can help us diagnose these problems. We've tried setting up netdump, but that seems to be all but worthless. We've only actually seen it dump data once in all the times it's been set up.

To be clear, I'm concerned by random reboots, hangs, and other things going on in Redhat 2.1, 3, and 4 systems, on IBM and HP recent generation hardware.

My boss doesn't like that we don't have a good explanation sometimes, and I want to be able to provide an answer.

macemoneta · 09-25-2006, 02:54 PM

The single most common explanation for "random reboots" is that it's not a software issue. The two most common causes I've seen are power fluctuations and heat.

One data center swore that their power was clean, but it turned out the cleaning staff was plugging the floor cleaner into the wrong outlet, causing power spikes that generated reboots. Another one plugged a high current laser printer into the same circuit as the server (occassional reboot when the laser "woke up" from power saving mode). Overloaded power supplies are another common condition where this happens. A power supply that is rated 750W cannot really supply 750W of stable power - it leaves no room for conditioning.

Failed fans or dust clogged heatsinks cause various components to shutdown spontaneously when they exceed a threshold. At one place I consulted for the building manager decided to shutdown the AC overnight to save power, raising the room temps above the maximum ambient. The IT staff had no idea, because they were not in the building when it happened. In another case, a well cooled room had some hot spots caused by poor circulation around the racks (no raised floor).

All the above types of scenarios prevent the software from being able to report the problem, because the situation is not under the software's control. Linux can provide dumps on software problems, but no OS can report a problem if the CPU halts.

hob · 09-25-2006, 05:00 PM

That's a very good point. I would add that Intel-based machines tend to be built to a lower standard than the old RS/6000 and Sun machines (i.e. they are built to price points), so even with server hardware from reputable suppliers I would expect failures when you deploy in large numbers, and more general flakiness than with proprietary UNIX boxes.

From the software side of things:

- The version of RHEL may make a difference. I believe that RHEL 4 and the 2.6 kernel were the first versions to be developed with significant collaboration with the hardware vendors.

- Red Hat are developing software in conjunction with some hardware vendors to address the issues with debugging (frysk and systemtap). These are in Fedora, and may be in RHEL 5 - I don't know about the latter.

- Generalization: for big deployments of Intel boxes you can use virtualization and clustering to reduce the dependency on individual machines.

cambie · 09-26-2006, 11:10 AM

Very good points, and certainly possible. But our data center is a fortress. Palm scan to get into the secured computer room, A/C is NEVER turned off, power supplies are redundant and powered by our own power system. I've got no control over that obviously, so I can't rule out unclean power or something like that. But I suspect that's not the issue. And the heat I just really doubt. It happens in multiple data centers across the country, and all of them are cooled properly.

I've sort of resigned myself to thinking this is just what happens with x86 hardware. But something makes me want to not let it die like that. I mean, everything is redundant in the dang thing....shouldn't that prevent alot of hardware failures? I guess if it's something on the motherboard, then it doesn't matter if all of your memory and hard drives and power supplies are redundant.

Just to note, we are only now starting to deploy 4.0, so all of our boxes are either RHEL 2.1 or 3.0. About half and half really. They are on DP DL380's, DL580's, and some DL740s. From IBM, we use x346's, x366's and x460's with their mxe counterpart chained together for 8ways. It happens across both platforms, although we've got ASR turned on for the HP's so those usually come back up on their own. The IBM's usually just hang.

macemoneta · 09-26-2006, 11:31 AM

Quote:

I've sort of resigned myself to thinking this is just what happens with x86 hardware.

If you think that's the case, do you think Linux on x86 hardware would be as popular as it is?

If you think your data center is a fortress, I assume each site has multiple environmental recorders (power, temp, humidity; even vibration monitors are not uncommon). You can provide the tapes to the vendor as proof the environment is not the cause and that the hardware is unstable.

Quote:

Palm scan to get into the secured computer room

Who cleans the place? The CIO?

One other thought since you mentioned using your own power system; are you using DC power? Nothing screws up a data center faster than long runs of DC power, ground loops, or floating grounds.

paulgnyc · 09-26-2006, 11:32 AM

You might be able to get some more out of this...

I had a system that was constantly (and randomly) crashing with a custom kernel and some kernel mods that we couldn't track down the issue with. We ended up doing console redirection out of the server's serial port as a cmdline arg to the kernel boot and then logging all of the output with minicom on the server that it was attached to.

Unfortunately, I don't remember all of the steps we went through off the top of my head, but you can check for docs on /proc/sys/kernel/printk and make the console=/dev/ttyS0 change to your /etc/grub.conf. The fact that it's actually rebooting might suggest you've got a number greater than 0 in /proc/sys/kernel/panic. This will automatically reboot the system in N number of seconds after a panic occurs. If this is set to 0 in your system, then I would definitely investigate further the power situation that the previous post suggested.

Another doc I would suggest checking out is for nmi_watchdog:

http://www.mjmwired.net/kernel/Docum...i_watchdog.txt

By the way, it turns out that my server was crashing because of a bug in the ips driver for the RAID controller. It was an IBM x345 with a ServerRAID 5i card.

cambie · 09-26-2006, 08:09 PM

Quote:

Originally Posted by paulgnyc

You might be able to get some more out of this...

I had a system that was constantly (and randomly) crashing with a custom kernel and some kernel mods that we couldn't track down the issue with. We ended up doing console redirection out of the server's serial port as a cmdline arg to the kernel boot and then logging all of the output with minicom on the server that it was attached to.

Unfortunately, I don't remember all of the steps we went through off the top of my head, but you can check for docs on /proc/sys/kernel/printk and make the console=/dev/ttyS0 change to your /etc/grub.conf. The fact that it's actually rebooting might suggest you've got a number greater than 0 in /proc/sys/kernel/panic. This will automatically reboot the system in N number of seconds after a panic occurs. If this is set to 0 in your system, then I would definitely investigate further the power situation that the previous post suggested.

Another doc I would suggest checking out is for nmi_watchdog:

http://www.mjmwired.net/kernel/Docum...i_watchdog.txt

By the way, it turns out that my server was crashing because of a bug in the ips driver for the RAID controller. It was an IBM x345 with a ServerRAID 5i card.

That's the thing. We do redirect output to the serial port, and log all of it's data to a central location. We do this for every system in our midrange department. We use this for monitoring, and echo the date command to the console every ten minutes to have that output logged. But when we crash, usually nothing is on the console. Our hp management and ibm software tells us nothing about what kind of hardware problems may have occurred to cause the crash. The ILO and RSA's have nothing in their event logs.

The IBM's we have all have ServeRAID 8i cards in them, and we already worked through our driver problems with scsi hangs. So that's not it. And as I said, it happens on both vendor's servers.

The data center manager would have access to all power monitoring applications and info on the way current is provided. All I truly know is that there's a 20 ton generator on the roof of the data center I work in, and that's the newest one added. The power generation and backup system is impressive, and that's about all I can say for sure.

We've started setting up netdump on all of our systems in an attempt to get some sort of crash data when it does happen. But I've heard from other SA's this doesn't usually give any useful information. I'm curious if anyone else agrees/disagrees.