Random Reboots

n9066r · 12-21-2008, 10:29 PM

I have a pair of Proliant 5500s running Cent OS 5.2 that both randomly reboot. APIC as well as ASR are off. I'm thinking it has to be a software problem as opposed to a hardware problem since both are experiencing the same problem. Also, one is a mail/web server running Apache and Zimbra and the other is a Samba server.

Dec 21 06:06:48 mail -- MARK --
Dec 21 06:26:48 mail -- MARK --
Dec 21 06:46:48 mail -- MARK --
Dec 21 07:06:48 mail -- MARK --
Dec 21 07:26:48 mail -- MARK --
Dec 21 07:46:48 mail -- MARK --
Dec4.1: restart.
Dec 21 15:05:56 mail audispd: af_unix plugin initialized
Dec 21 15:05:56 mail audispd: audispd initialized with q_depth=64 and 1 active plugins
Dec 21 15:05:56 mail kernel: klogd 1.4.1, log source = /proc/kmsg started.
Dec 21 15:05:56 mail kernel: Linux version 2.6.18-92.1.18.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)) #1 SMP Wed Nov 12 09:30:27 EST 2008
Dec 21 15:05:56 mail kernel: BIOS-provided physical RAM map:
Dec 21 15:05:56 mail kernel: BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
Dec 21 15:05:56 mail kernel: BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
Dec 21 15:05:56 mail kernel: BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
Dec 21 15:05:56 mail kernel: BIOS-e820: 0000000000100000 - 000000009fffc000 (usable)
Dec 21 15:05:56 mail kernel: BIOS-e820: 000000009fffc000 - 00000000a0000000 (ACPI data)
Dec 21 15:05:56 mail kernel: BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
Dec 21 15:05:56 mail kernel: BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved)
Dec 21 15:05:56 mail kernel: BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
Dec 21 15:05:56 mail kernel: 1663MB HIGHMEM available.
Dec 21 15:05:56 mail kernel: 896MB LOWMEM available.
Dec 21 15:05:56 mail kernel: found SMP MP-table at 000f4fd0

anomie · 12-22-2008, 05:11 PM

Weird. You have 'mark' timestamps showing up regularly, then they stop, then hours later the server restarts. I wonder if something is killing syslogd??

Given that your logs aren't telling you much, do you have some spare RAM that you could trade out with on one of the two servers? (I think this is worth exploring on the chance that your RAM + motherboard combination are just not playing nice.)

rweaver · 12-22-2008, 05:29 PM

Quote:

Originally Posted by n9066r

I have a pair of Proliant 5500s running Cent OS 5.2 that both randomly reboot. APIC as well as ASR are off. I'm thinking it has to be a software problem as opposed to a hardware problem since both are experiencing the same problem. Also, one is a mail/web server running Apache and Zimbra and the other ...<SNIP>... 1663MB HIGHMEM available.
Dec 21 15:05:56 mail kernel: 896MB LOWMEM available.
Dec 21 15:05:56 mail kernel: found SMP MP-table at 000f4fd0

Run rootkit hunter and check rootkit on the system. That looks strange and having two servers doing it may be a coincidence, but its highly doubtful.

trickykid · 12-23-2008, 09:43 AM

What type of cron jobs do you have running? Do you have any backups running? Anything in particular running you know of between the time stamps of MARK above and the reboot time stamps? I would imagine some process is killing these machines, preventing the logging or the like.

n9066r · 12-23-2008, 10:39 AM

Here is what I have running:

root Yes /etc/cron.daily/makewhatis.cron
/etc/cron.daily/rpm
/etc/cron.daily/mlocate.cron
/etc/cron.daily/0logwatch
/etc/cron.daily/prelink
/etc/cron.daily/0anacron
/etc/cron.daily/tmpwatch
root Yes /etc/cron.weekly/makewhatis.cron
/etc/cron.weekly/logrotate
/etc/cron.weekly/0anacron
root Yes /etc/cron.monthly/0anacron
root Yes /etc/webmin/cron/tempdelete.pl
zimbra Yes find /opt/zimbra/log/ -type f -name \*.log\* -mtime +8 -exec rm {} \; > /dev/nul ...
zimbra Yes find /opt/zimbra/log/ -type f -name \*.out.???????????? -mtime +8 -exec rm {} \; ...
zimbra Yes /opt/zimbra/libexec/zmstatuslog
zimbra Yes /opt/zimbra/libexec/zmdisklog
zimbra Yes find /opt/zimbra/mailboxd/logs/ -type f -name \*log\* -mtime +8 -exec rm {} \; > ...
zimbra Yes /opt/zimbra/libexec/zmmaintaintables >> /dev/null 2>&1
zimbra Yes /opt/zimbra/libexec/zmdbintegrityreport -m
zimbra Yes /opt/zimbra/libexec/zmcheckduplicatemysqld -e > /dev/null 2>&1
zimbra Yes /opt/zimbra/libexec/zmlogprocess > /tmp/logprocess.out 2>&1
zimbra Yes /opt/zimbra/libexec/zmgengraphs >> /tmp/gengraphs.out 2>&1
zimbra Yes /opt/zimbra/libexec/zmdailyreport -m
zimbra Yes /opt/zimbra/libexec/zmqueuelog
zimbra Yes /opt/zimbra/bin/zmtrainsa >> /opt/zimbra/log/spamtrain.log 2>&1
zimbra Yes /opt/zimbra/bin/zmtrainsa --cleanup >> /opt/zimbra/log/spamtrain.log 2>&1
zimbra No find /opt/zimbra/dspam/var/dspam/data/z/i/zimbra/zimbra.sig/ -type f -name \*sig ...
zimbra No /opt/zimbra/dspam/bin/dspam_logrotate -a 60 /opt/zimbra/dspam/var/dspam/system.l ...
zimbra No /opt/zimbra/dspam/bin/dspam_logrotate -a 60 /opt/zimbra/dspam/var/dspam/data/z/i ...
zimbra Yes /opt/zimbra/libexec/sa-learn -p /opt/zimbra/conf/salocal.cf --dbpath /opt/zimbra ...
zimbra Yes find /opt/zimbra/data/amavisd/tmp -maxdepth 1 -type d -name 'amavis-*' -mtime +1 ...
zimbra Yes find /opt/zimbra/data/amavisd/quarantine -type f -mtime +7 -exec rm -f {} \; > / ...

trickykid · 12-23-2008, 10:45 AM

What time are those zimbra crons running? Anything close to the time the machine is getting rebooted? One of those could be the culprit.

momolin · 12-23-2008, 05:14 PM

The servers both rebooted at the same time, I think you can check the power.

n9066r · 12-23-2008, 05:51 PM

Quote:

Originally Posted by rweaver

Run rootkit hunter and check rootkit on the system. That looks strange and having two servers doing it may be a coincidence, but its highly doubtful.

I'll run them and see what I come up with.

Thanks

n9066r · 12-23-2008, 06:02 PM

Quote:

Originally Posted by trickykid

What time are those zimbra crons running? Anything close to the time the machine is getting rebooted? One of those could be the culprit.

None of them are close to the reboot times and the reboot is completely random. Load also seems to be no factor as it is just as likely to reboot running off hours as business hours.

trickykid · 12-24-2008, 08:50 AM

Quote:

Originally Posted by momolin

The servers both rebooted at the same time, I think you can check the power.

I'd have to agree here as well. If it's not a cron, it's off business hours and both servers rebooted almost at the exact time, I'd check power as well.

n9066r · 12-24-2008, 09:35 AM

I checked for rootkits last night and both servers were clean. The servers don't reboot at the same time but there could still be an issue with the UPS.

I tried:

kernel /vmlinuz-2.6.18-92.1.18.el5 ro root=/dev/VolGroup00/LogVol00 debug apm=off acpi=off ide=nodma nousb nopsmcia noapic nofb

and so far the server has been up 23 hours which it hasn't done before. Hopefully this will be the answer.

trickykid · 12-24-2008, 09:55 AM

Quote:

Originally Posted by n9066r

I checked for rootkits last night and both servers were clean. The servers don't reboot at the same time but there could still be an issue with the UPS.

Never rule it out of the equation.

Do these servers have dual power supplies? If so, do you have more than one UPS? If there's more than one, you should split up the power and or try bypassing the UPS to see if the problem reoccurs. Also, most UPS's have a management port, you could probably setup to monitor these to see if it's the actual culprit.

n9066r · 12-24-2008, 10:05 AM

They do have dual power supplies. I have another UPS at home that I'll bring and try next week. The UPS does have a management port so I'll set it up and see what it shows.

alexhwest · 12-24-2008, 10:48 AM

Probably not relevant to the problem, but you used nopsmcia where it should be nopcmcia.

n9066r · 12-24-2008, 10:57 AM

Thanks for catching the typo. I'll change it in grub.conf