My servers go down every weekend at fixed hour. Why?

adrian.carciumaru · 09-23-2008, 03:23 AM

Hello all,

I have a Oracle RAC installed on two Linux servers with OEL 4 running on each.
Basically there are two situations with the servers:
The servers work well all week until sunday morning when
1. either a SCSI error appears on both servers at 02:02am and one or both servers shutdown
2. or the same SCSI error occurs at the same hour, the servers continue to work, but until exactly 04:02am when they restart. After restart the servers have an abnormal behaviour, performing very slow, sometimes not responding to ping, which needs restart.

Each sunday morning i have to come to the office to manualy restart the servers.
I've opened i SR at Oracle, but, after a lot of investigations, they said it's a operating system problem.

------------------------------------------
Here is the first situations when one of the servers shutdown:
from var/log/messages

server1:

Sep 14 02:03:05 rac1tiriac kernel: SCSI error : <2 0 1 1> return code = 0x20000
Sep 14 02:03:06 rac1tiriac kernel: SCSI error : <2 0 1 2> return code = 0x20000
Sep 14 02:03:17 rac1tiriac kernel: o2net: connection to node rac2tiriac (num 1) at 192.168.xx.xx:7777 has been idle for 10 seconds, shutting it down.
Sep 14 02:03:17 rac1tiriac kernel: (0,1)

2net_idle_timer:1309
here are some times that might help debug the situation: (tmr 1221346987.288928 now 1221346997.286720 dr 1221346987.288913 adv 1221346987.288930:1221346987.288931 func (2961896f:504) 1221343477.841972:1221343477.842018)
Sep 14 02:03:17 rac1tiriac kernel: o2net: no longer connected to node rac2tiriac (num 1) at 192.168.xx.xx:7777

Sep 14 04:02:20 rac1tiriac syslogd 1.4.1: restart.
Sep 14 04:02:19 rac1tiriac nmbd[7694]: Got SIGHUP dumping debug info.

server2:
(the one that shut down)( seems that the message from the first one shut it down :"connection to node rac2tiriac (num 1) at 192.168.xx.xx:7777 has been idle for 10 seconds, shutting it down")
-no specific message appeard after 02:01am until the second day when it was manualy started.

-------------------------------------------------
Here is the second situation, when scsi error appeard at 02:02, both servers continued to work until 04:02 when restarted.
from var/log/messages:

server1:

Sep 21 02:02:25 rac1tiriac kernel: SCSI error : <3 0 1 1> return code = 0x20000
Sep 21 02:02:25 rac1tiriac kernel: SCSI error : <3 0 1 1> return code = 0x20000
Sep 21 02:02:26 rac1tiriac kernel: SCSI error : <3 0 1 2> return code = 0x20000

Sep 21 04:02:07 rac1tiriac cups: cupsd shutdown succeeded
Sep 21 04:02:09 rac1tiriac cups: cupsd startup succeeded
Sep 21 04:02:09 rac1tiriac nmbd[7678]: [2008/09/21 04:02:09, 0] nmbd/nmbd.c

rocess(542)
Sep 21 04:02:09 rac1tiriac syslogd 1.4.1: restart.
Sep 21 04:02:09 rac1tiriac nmbd[7678]: Got SIGHUP dumping debug info.

server2
Sep 21 04:02:06 rac2 syslogd 1.4.1: restart.
Sep 21 02:02:25 rac2tiriac kernel: SCSI error : <1 0 1 2> return code = 0x20000
Sep 21 02:02:25 rac2tiriac kernel: SCSI error : <1 0 1 1> return code = 0x20000

Sep 21 04:02:05 rac2tiriac cups: cupsd shutdown succeeded
Sep 21 04:02:06 rac2tiriac cups: cupsd startup succeeded
Sep 21 04:02:06 rac2tiriac syslogd 1.4.1: restart.
Sep 21 04:02:06 rac2tiriac nmbd[7709]: [2008/09/21 04:02:06, 0] nmbd/nmbd.c

rocess(542)
Sep 21 04:02:06 rac2tiriac nmbd[7709]: Got SIGHUP dumping debug info
--------------------------------

Here is a part of the var/log/cron on each server for 04:02 hour:
rac1:
Sep 21 04:01:01 rac1tiriac crond[23639]: (root) CMD (run-parts /etc/cron.hourly)
Sep 21 04:02:01 rac1tiriac crond[24089]: (root) CMD (run-parts /etc/cron.daily)
Sep 21 04:02:06 rac1tiriac anacron[24544]: Updated timestamp for job `cron.daily' to 2008-09-21

rac2:
Sep 21 04:00:01 rac2tiriac crond[22765]: (root) CMD (/script.sh)
Sep 21 04:01:01 rac2tiriac crond[23213]: (root) CMD (run-parts /etc/cron.hourly)
Sep 21 04:02:01 rac2tiriac crond[23681]: (root) CMD (run-parts /etc/cron.daily)
Sep 21 04:02:05 rac2tiriac anacron[24014]: Updated timestamp for job `cron.daily' to 2008-09-21

After lot of searches i found that my problem is verry similar to an old thread(2006) called "Why the sever goes down every weekend?" but which gave me no resolution.

Can anybody help? Any advice is highly appreciated!
Adrian.

billymayday · 09-23-2008, 03:37 AM

What jobs are you running at or just before that time? Have a look at the crontabs in /var/spool/cron

adrian.carciumaru · 09-23-2008, 03:45 AM

Hi,
i only have one "root" file in var/spool/cron which contains that line
0 * * * 0,6 /script.sh

i've searched the messages from etc/log but nothing happening before indicates me that will cause the restart.

Quote:

Originally Posted by billymayday

What jobs are you running at or just before that time? Have a look at the crontabs in /var/spool/cron

adrian.carciumaru · 09-24-2008, 04:17 AM

Can anybody help? Any advice is highly appreciated!
Thx

cam34 · 09-24-2008, 06:13 AM

nmbd is SIGHUPing at 4.02am
Do you have samba installed and running as well?
nmbd is NetBiosDaemon for MS networking and Samba....

Do you need it running? Can you disable the service?

adrian.carciumaru · 09-24-2008, 06:43 AM

yes i have the samba installed and running as well.
as i know smdb and nmbd are samba processes.
i don't know if i can disable the nmbd and samba but i will try to disable them on saturday evening and see what happens..

Quote:

Originally Posted by cam34

nmbd is SIGHUPing at 4.02am
Do you have samba installed and running as well?
nmbd is NetBiosDaemon for MS networking and Samba....

Do you need it running? Can you disable the service?

racracracrac · 09-24-2008, 12:33 PM

It may be a silly question, but are you sure they are rebooting? Your logs only indicate things are being restarted, most likely from logrotate.

Run the command w(1) to see what the uptime on your systems are.

(link removed)

adrian.carciumaru · 09-25-2008, 02:59 AM

you may be right. They are up for 3 days(from sunday morning) because i manually restarted them that morning .
thx

Quote:

Originally Posted by racracracrac

It may be a silly question, but are you sure they are rebooting? Your logs only indicate things are being restarted, most likely from logrotate.

Run the command w(1) to see what the uptime on your systems are.

adrian.carciumaru · 09-25-2008, 07:02 AM

but even if they don't reboot, they have an abnomal behaviour after 04:02 , performing very slow

Quote:

Originally Posted by adrian.carciumaru

you may be right. They are up for 3 days(from sunday morning) because i manually restarted them that morning .
thx

JMCraig · 09-25-2008, 10:30 AM

Adrian,

Based on a slightly similar problem I had with SCSI drives for a period of time, you may have a bad SCSI drive (or cable--those cables are such a pain to test). I had a failure that would occur periodically (only sometimes at a particular time) and it didn't always take the server down, but it did cause problems. The only way I was able to figure out what was wrong was to set up logging on the SCSI drives via the vendor's SCSI RAID controller management software. That allowed me to see that it was always a particular disk. After doing a number of low-level formats and then rebuilding the RAID setup on that drive, only to have it fail again shortly after that, I finally replaced the drive: problem solved, it seemed. Later, I had a similar problem and it was solved by replacing the cable.

Definitely get whatever administrative software is available for the drives you're using and enable whatever logging is available.

As to why it affects both computers, my guess is that Samba shares might be involved (since that is what's getting the SIGHUP). If you're using them, you might consider reconfiguring things so that you're not using them over one weekend and see what happens. In my case, I knew why it was sometimes happening at a certain time of day because a database was being backed up. Do you have anything set up to run from "inside" Oracle (such as backups)? (I'm not familiar with Oracle so I don't know if you can schedule it to do its own backups without running an external program. But if you can do that, maybe that's what's causing it happen at particular times.)

HTH,

John

adrian.carciumaru · 09-30-2008, 03:35 AM

Hello all,

Thanks John for your advice.
The servers do not restart indeed at ~04:02AM. RACRACRAC was right.

However the problem is with the shutdown at 02:02AM after the SCSI error.
There are 2 week since neither of the servers went down. However there were 2 week in a row working well before and the third causing shutdown to one or both servers. So i'm not sure that the servers will not shutdown in the future. The problem causing the shutdown to the servers at exactly 02:02 AM Sunday morning is the SCSI error. Sometimes it goes over this eror and continue to run and sometimes the error occurs and one (or both) servers shutdown. How can i get rid of the SCSI error?

Quote:

Originally Posted by JMCraig

Adrian,

Based on a slightly similar problem I had with SCSI drives for a period of time, you may have a bad SCSI drive (or cable--those cables are such a pain to test). I had a failure that would occur periodically (only sometimes at a particular time) and it didn't always take the server down, but it did cause problems. The only way I was able to figure out what was wrong was to set up logging on the SCSI drives via the vendor's SCSI RAID controller management software. That allowed me to see that it was always a particular disk. After doing a number of low-level formats and then rebuilding the RAID setup on that drive, only to have it fail again shortly after that, I finally replaced the drive: problem solved, it seemed. Later, I had a similar problem and it was solved by replacing the cable.

Definitely get whatever administrative software is available for the drives you're using and enable whatever logging is available.

As to why it affects both computers, my guess is that Samba shares might be involved (since that is what's getting the SIGHUP). If you're using them, you might consider reconfiguring things so that you're not using them over one weekend and see what happens. In my case, I knew why it was sometimes happening at a certain time of day because a database was being backed up. Do you have anything set up to run from "inside" Oracle (such as backups)? (I'm not familiar with Oracle so I don't know if you can schedule it to do its own backups without running an external program. But if you can do that, maybe that's what's causing it happen at particular times.)

HTH,

John