Hello all,
I have a Oracle RAC installed on two Linux servers with OEL 4 running on each.
Basically there are two situations with the servers:
The servers work well all week until sunday morning when
1. either a SCSI error appears on both servers at 02:02am and one or both servers shutdown
2. or the same SCSI error occurs at the same hour, the servers continue to work, but until exactly 04:02am when they restart. After restart the servers have an abnormal behaviour, performing very slow, sometimes not responding to ping, which needs restart.
Each sunday morning i have to come to the office to manualy restart the servers.
I've opened i SR at Oracle, but, after a lot of investigations, they said it's a operating system problem.
------------------------------------------
Here is the first situations when one of the servers shutdown:
from var/log/messages
server1:
Sep 14 02:03:05 rac1tiriac kernel: SCSI error : <2 0 1 1> return code = 0x20000
Sep 14 02:03:06 rac1tiriac kernel: SCSI error : <2 0 1 2> return code = 0x20000
Sep 14 02:03:17 rac1tiriac kernel: o2net: connection to node rac2tiriac (num 1) at 192.168.xx.xx:7777 has been idle for 10 seconds, shutting it down.
Sep 14 02:03:17 rac1tiriac kernel: (0,1)
2net_idle_timer:1309
here are some times that might help debug the situation: (tmr 1221346987.288928 now 1221346997.286720 dr 1221346987.288913 adv 1221346987.288930:1221346987.288931 func (2961896f:504) 1221343477.841972:1221343477.842018)
Sep 14 02:03:17 rac1tiriac kernel: o2net: no longer connected to node rac2tiriac (num 1) at 192.168.xx.xx:7777
Sep 14 04:02:20 rac1tiriac syslogd 1.4.1: restart.
Sep 14 04:02:19 rac1tiriac nmbd[7694]: Got SIGHUP dumping debug info.
server2:
(the one that shut down)( seems that the message from the first one shut it down :"connection to node rac2tiriac (num 1) at 192.168.xx.xx:7777 has been idle for 10 seconds, shutting it down")
-no specific message appeard after 02:01am until the second day when it was manualy started.
-------------------------------------------------
Here is the second situation, when scsi error appeard at 02:02, both servers continued to work until 04:02 when restarted.
from var/log/messages:
server1:
Sep 21 02:02:25 rac1tiriac kernel: SCSI error : <3 0 1 1> return code = 0x20000
Sep 21 02:02:25 rac1tiriac kernel: SCSI error : <3 0 1 1> return code = 0x20000
Sep 21 02:02:26 rac1tiriac kernel: SCSI error : <3 0 1 2> return code = 0x20000
Sep 21 04:02:07 rac1tiriac cups: cupsd shutdown succeeded
Sep 21 04:02:09 rac1tiriac cups: cupsd startup succeeded
Sep 21 04:02:09 rac1tiriac nmbd[7678]: [2008/09/21 04:02:09, 0] nmbd/nmbd.c
rocess(542)
Sep 21 04:02:09 rac1tiriac syslogd 1.4.1: restart.
Sep 21 04:02:09 rac1tiriac nmbd[7678]: Got SIGHUP dumping debug info.
server2
Sep 21 04:02:06 rac2 syslogd 1.4.1: restart.
Sep 21 02:02:25 rac2tiriac kernel: SCSI error : <1 0 1 2> return code = 0x20000
Sep 21 02:02:25 rac2tiriac kernel: SCSI error : <1 0 1 1> return code = 0x20000
Sep 21 04:02:05 rac2tiriac cups: cupsd shutdown succeeded
Sep 21 04:02:06 rac2tiriac cups: cupsd startup succeeded
Sep 21 04:02:06 rac2tiriac syslogd 1.4.1: restart.
Sep 21 04:02:06 rac2tiriac nmbd[7709]: [2008/09/21 04:02:06, 0] nmbd/nmbd.c
rocess(542)
Sep 21 04:02:06 rac2tiriac nmbd[7709]: Got SIGHUP dumping debug info
--------------------------------
Here is a part of the var/log/cron on each server for 04:02 hour:
rac1:
Sep 21 04:01:01 rac1tiriac crond[23639]: (root) CMD (run-parts /etc/cron.hourly)
Sep 21 04:02:01 rac1tiriac crond[24089]: (root) CMD (run-parts /etc/cron.daily)
Sep 21 04:02:06 rac1tiriac anacron[24544]: Updated timestamp for job `cron.daily' to 2008-09-21
rac2:
Sep 21 04:00:01 rac2tiriac crond[22765]: (root) CMD (/script.sh)
Sep 21 04:01:01 rac2tiriac crond[23213]: (root) CMD (run-parts /etc/cron.hourly)
Sep 21 04:02:01 rac2tiriac crond[23681]: (root) CMD (run-parts /etc/cron.daily)
Sep 21 04:02:05 rac2tiriac anacron[24014]: Updated timestamp for job `cron.daily' to 2008-09-21
After lot of searches i found that my problem is verry similar to an old thread(2006) called "Why the sever goes down every weekend?" but which gave me no resolution.
Can anybody help? Any advice is highly appreciated!
Adrian.