Redhat EL5 Intermittent server syslogd 1.4.1: restart

elmgrep · 05-14-2011, 12:51 AM

My HP server ProLiant DL360 G6, Redhat EL5 2.6.18-194.el5 64 bit is failing with intermittent server restarts.

Initially we had some cooling issues which meant the server restarted outside business hours - but not during business hours. A/C has now been upgraded.

Instead I'm having "random" server restarts.
The HP management console do not log any H/W errors
The I/O and CPU load is low.
I am not sure what is at fault based on the log snippets below -

Any ideas for additional logging that can reveal the real issue or clues based on the log content below would be most appreciated:
(there're a few minutes between last log entry and the syslogd restart so I don't know if the log is related to the server restart)
The fax logging is from Hylafax

Suggestion, please

Thanks much

/var/log/messages
May 9 08:26:41 risqube vcagentd: Type: 4 - Event ID: 1073741884
May 9 08:26:43 risqube kernel: mtrr: type mismatch for e8000000,4000000 old: uncachable new: write-combining
May 9 08:34:26 risqube syslogd 1.4.1: restart.

May 9 09:32:03 risqube kernel: mtrr: type mismatch for e8000000,4000000 old: uncachable new: write-combining
May 9 09:37:04 risqube syslogd 1.4.1: restart.

May 10 15:23:54 risqube FaxGetty[8496]: LOCKWAIT
May 10 15:23:54 risqube FaxQueuer[6988]: NOTIFY exit status: 0 (3677)
May 10 15:28:08 risqube syslogd 1.4.1: restart.

May 10 19:23:36 risqube FaxGetty[8265]: MODEM USR U.S. Robotics 56K FAX
May 10 19:29:14 risqube syslogd 1.4.1: restart.

May 10 19:35:13 risqube FaxGetty[10528]: MODEM USR U.S. Robotics 56K FAX
May 10 20:56:47 risqube syslogd 1.4.1: restart.

May 10 22:15:46 risqube vcagentd: Type: 4 - Event ID: 1073741884
May 10 22:15:49 risqube kernel: mtrr: type mismatch for e8000000,4000000 old: uncachable new: write-combining
May 10 22:16:43 risqube hpasmxld[5775]: hpDeferSPDThread: End of Collecting DIMM SPD data.
May 10 22:23:32 risqube syslogd 1.4.1: restart.

May 12 14:49:27 risqube FaxGetty[10387]: LOCKWAIT
May 12 14:49:27 risqube FaxQueuer[6985]: NOTIFY exit status: 0 (25282)
May 12 14:53:33 risqube syslogd 1.4.1: restart.

May 13 07:41:49 risqube syslogd 1.4.1: restart.

May 13 10:18:49 risqube syslogd 1.4.1: restart.

May 13 10:35:13 risqube kernel: nfsd: last server has exited
May 13 10:35:13 risqube kernel: nfsd: unexporting all filesystems
May 13 10:35:14 risqube xinetd[5260]: Exiting...
May 13 10:35:20 risqube rpc.statd[4063]: Caught signal 15, un-registering and exiting.
May 13 10:35:20 risqube auditd[3746]: The audit daemon is exiting.
May 13 10:35:20 risqube kernel: audit(1305297320.464:131): audit_pid=0 old=3746 by auid=4294967295
May 13 10:35:20 risqube kernel: Kernel logging (proc) stopped.
May 13 10:35:20 risqube kernel: Kernel log daemon terminating.
May 13 10:35:21 risqube exiting on signal 15
May 13 10:38:31 risqube syslogd 1.4.1: restart.

May 13 17:30:06 risqube FaxSend[3251]: SEND FAX: JOB 33259
May 13 17:34:42 risqube syslogd 1.4.1: restart.

/etc/log/cron

May 10 19:25:01 risqube crond[24969]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests)
May 10 19:30:18 risqube crond[5541]: (CRON) STARTUP (V5.0)
May 10 19:30:25 risqube anacron[6964]: Anacron 2.3 started on 2011-05-10
May 10 19:30:25 risqube anacron[6964]: Normal exit (0 jobs run)

May 10 20:50:01 risqube crond[23039]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests)
May 10 20:51:01 risqube crond[23654]: (root) CMD (/usr/sbin/faxqclean)
May 10 20:52:01 risqube crond[24240]: (root) CMD (/usr/sbin/faxqclean)
May 10 20:53:01 risqube crond[24842]: (root) CMD (/usr/sbin/faxqclean)
May 10 20:57:52 risqube crond[5550]: (CRON) STARTUP (V5.0)
May 10 20:58:00 risqube anacron[6972]: Anacron 2.3 started on 2011-05-10
May 10 20:58:00 risqube anacron[6972]: Normal exit (0 jobs run)

May 13 07:35:01 risqube crond[18803]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests)
May 13 07:42:56 risqube crond[5559]: (CRON) STARTUP (V5.0)
May 13 07:43:02 risqube anacron[6918]: Anacron 2.3 started on 2011-05-13
May 13 07:43:02 risqube anacron[6918]: Normal exit (0 jobs run)

more /etc/syslog.conf
# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none /var/log/messages

# The authpriv file has restricted access.
authpriv.* /var/log/secure

# Log all the mail messages in one place.
mail.* -/var/log/maillog

# Log cron stuff
cron.* /var/log/cron

# Everybody gets emergency messages
*.emerg *

# Save news errors of level crit and higher in a special file.
uucp,news.crit /var/log/spooler

# Save boot messages also to boot.log
local7.* /var/log/boot.log

xeleema · 05-14-2011, 03:26 AM

Greetingz!

Are you sure the server itself is rebooting? You're checking the output of the "uptime" command, correct?
If it's just the syslog daemon that's restarting, then it's probably logrotate cycling the logs.

On a side note; double-check your "Automatic Server Recovery" settings. Perhaps the watchdog is getting too frisky (try setting the timer to 30 minutes).

elmgrep · 05-15-2011, 04:48 PM

Unfortunately - yes the server is really rebooting this often.
Have change the ASR timer to 30 min
Thanks!

Maybe its an MTRR issue

dmesg |grep mtrr
mtrr: type mismatch for e8000000,4000000 old: uncachable new: write-combining
mtrr: type mismatch for e8000000,4000000 old: uncachable new: write-combining

lspci -v
01:03.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02) (prog-if 00 [VGA controller])
Subsystem: Hewlett-Packard Company Unknown device 31fb
Flags: bus master, stepping, medium devsel, latency 64, IRQ 7
Memory at e8000000 (32-bit, prefetchable) [size=128M]
I/O ports at 3000 [size=256]
Memory at f5ff0000 (32-bit, non-prefetchable) [size=64K]
[virtual] Expansion ROM at f5e00000 [disabled] [size=128K]
Capabilities: [50] Power Management version 2

cat /proc/mtrr

reg00: base=0xe0000000 (3584MB), size= 512MB: uncachable, count=1

This indicative of a problem?!

dita · 09-15-2011, 12:30 PM

hi
We have the same problem RHEL 5.7 2.6.18-274.el5 #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

kernel: mtrr: type mismatch for d4000000,1000000 old: uncachable new: write-combining

Any Idea's
Thanks D.

kyjo · 06-04-2012, 11:48 AM

Hi Elmgrep and dita

did you find any solution to your server rebooting issue ? Can you share some info?

Thanks