-   Linux - Newbie (
-   -   RHEL5.8 server rebooting on it's own - Why? (

felbvts 09-04-2012 11:41 AM

RHEL5.8 server rebooting on it's own - Why?

I am running RHEL 5.8 (2.6.18-308. the server has reboot itself twice in the past 3 days. No errors are seen before the reboots and noone is logged in during the reboots. (reviewed most log files in /var/log)

Im thinking a missing patch possibly but unless I can figure out which one - I won't be granted an outage. I have searched the RH knowledge base but I haven't found anything.

Any ideas where I should check to find the RCA for the reboots?

MensaWater 09-04-2012 11:56 AM

Is it possible the server hardware or power to it blinked? Do you have any tools/logs for the hardware that might show it? (e.g. for Dell systems one can run Dell OpenManage and it keeps hardware and alert logs that might give a clue.)

felbvts 09-04-2012 01:13 PM

Thanks for the reply. No power outage reported. Plus it rebooted 9/2 and 9/4 at diff't times of day. Logs dont look like it was a hard reset either.

I went through the SAR logs, I dont see any cpu or memory spikes.

Right now I am thinking it's a bug - that I need a patch.
I am putting in a support call with Red Hat - Will let you know what I find out.

Any additional comments are welcome! :)

btmiller 09-04-2012 03:03 PM

I'd suggest running mcelog to see if any machine check events were logged. I'd also suggest running memtest86 on the system to make sure that the RAM is good.

It could also be a heating problem. Do you have a way to monitor the CPU and motherboard temperatures (either lm-sensors or using something like IPMI)?

felbvts 09-05-2012 03:04 PM

These were great ideas - thank you!

I've ruled out the temperature issue as none of the other servers in the rack are having any issues.

mcelog is not showing anything. /proc/cpuinfo & meminfo are not showing anything significant.

MensaWater 09-05-2012 03:27 PM

Temperature can affect one server in a rack worse than others. A few years back we had a rack dead center of our data center that had a DELL PERC (RAID) controller in it. Even though the system itself was not showing any temperature issues based on its internal sensors I was able to demonstrate that the PERC itself (which has no temperature sensor of its own) was being affected by heat and causing the system to lock up periodically. (Of course one other server in the rack also experienced the issue but both were the same class and both had multiple disks.) The other servers in the rack however never had any apparent issues.

I demonstrated the issue simply by opening the rack door. This gave very little extra air but was just enough to avoid the issue. Whenever I closed it after a short while I'd see the issue come back. What was maddening was that DELL denied the PERC was susceptible to heat until I'd proven it by doing these tests. Back then they had me run full diags on the system which really annoyed me because the diag for the PERC on tests whether the battery is there and charged - it did no actual component test of the board itself.

All times are GMT -5. The time now is 06:54 AM.