Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I am running RHEL 5.8 (2.6.18-308.1.1.0.1.el5)and the server has reboot itself twice in the past 3 days. No errors are seen before the reboots and noone is logged in during the reboots. (reviewed most log files in /var/log)
Im thinking a missing patch possibly but unless I can figure out which one - I won't be granted an outage. I have searched the RH knowledge base but I haven't found anything.
Any ideas where I should check to find the RCA for the reboots?
Jennifer
Is it possible the server hardware or power to it blinked? Do you have any tools/logs for the hardware that might show it? (e.g. for Dell systems one can run Dell OpenManage and it keeps hardware and alert logs that might give a clue.)
I'd suggest running mcelog to see if any machine check events were logged. I'd also suggest running memtest86 on the system to make sure that the RAM is good.
It could also be a heating problem. Do you have a way to monitor the CPU and motherboard temperatures (either lm-sensors or using something like IPMI)?
Temperature can affect one server in a rack worse than others. A few years back we had a rack dead center of our data center that had a DELL PERC (RAID) controller in it. Even though the system itself was not showing any temperature issues based on its internal sensors I was able to demonstrate that the PERC itself (which has no temperature sensor of its own) was being affected by heat and causing the system to lock up periodically. (Of course one other server in the rack also experienced the issue but both were the same class and both had multiple disks.) The other servers in the rack however never had any apparent issues.
I demonstrated the issue simply by opening the rack door. This gave very little extra air but was just enough to avoid the issue. Whenever I closed it after a short while I'd see the issue come back. What was maddening was that DELL denied the PERC was susceptible to heat until I'd proven it by doing these tests. Back then they had me run full diags on the system which really annoyed me because the diag for the PERC on tests whether the battery is there and charged - it did no actual component test of the board itself.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.