LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-23-2008, 04:23 AM   #1
adrian.carciumaru
LQ Newbie
 
Registered: Sep 2008
Posts: 8

Rep: Reputation: 0
My servers go down every weekend at fixed hour. Why?


Hello all,

I have a Oracle RAC installed on two Linux servers with OEL 4 running on each.
Basically there are two situations with the servers:
The servers work well all week until sunday morning when
1. either a SCSI error appears on both servers at 02:02am and one or both servers shutdown
2. or the same SCSI error occurs at the same hour, the servers continue to work, but until exactly 04:02am when they restart. After restart the servers have an abnormal behaviour, performing very slow, sometimes not responding to ping, which needs restart.

Each sunday morning i have to come to the office to manualy restart the servers.
I've opened i SR at Oracle, but, after a lot of investigations, they said it's a operating system problem.


------------------------------------------
Here is the first situations when one of the servers shutdown:
from var/log/messages

server1:

Sep 14 02:03:05 rac1tiriac kernel: SCSI error : <2 0 1 1> return code = 0x20000
Sep 14 02:03:06 rac1tiriac kernel: SCSI error : <2 0 1 2> return code = 0x20000
Sep 14 02:03:17 rac1tiriac kernel: o2net: connection to node rac2tiriac (num 1) at 192.168.xx.xx:7777 has been idle for 10 seconds, shutting it down.
Sep 14 02:03:17 rac1tiriac kernel: (0,1)2net_idle_timer:1309
here are some times that might help debug the situation: (tmr 1221346987.288928 now 1221346997.286720 dr 1221346987.288913 adv 1221346987.288930:1221346987.288931 func (2961896f:504) 1221343477.841972:1221343477.842018)
Sep 14 02:03:17 rac1tiriac kernel: o2net: no longer connected to node rac2tiriac (num 1) at 192.168.xx.xx:7777

Sep 14 04:02:20 rac1tiriac syslogd 1.4.1: restart.
Sep 14 04:02:19 rac1tiriac nmbd[7694]: Got SIGHUP dumping debug info.

server2:
(the one that shut down)( seems that the message from the first one shut it down :"connection to node rac2tiriac (num 1) at 192.168.xx.xx:7777 has been idle for 10 seconds, shutting it down")
-no specific message appeard after 02:01am until the second day when it was manualy started.


-------------------------------------------------
Here is the second situation, when scsi error appeard at 02:02, both servers continued to work until 04:02 when restarted.
from var/log/messages:

server1:

Sep 21 02:02:25 rac1tiriac kernel: SCSI error : <3 0 1 1> return code = 0x20000
Sep 21 02:02:25 rac1tiriac kernel: SCSI error : <3 0 1 1> return code = 0x20000
Sep 21 02:02:26 rac1tiriac kernel: SCSI error : <3 0 1 2> return code = 0x20000


Sep 21 04:02:07 rac1tiriac cups: cupsd shutdown succeeded
Sep 21 04:02:09 rac1tiriac cups: cupsd startup succeeded
Sep 21 04:02:09 rac1tiriac nmbd[7678]: [2008/09/21 04:02:09, 0] nmbd/nmbd.crocess(542)
Sep 21 04:02:09 rac1tiriac syslogd 1.4.1: restart.
Sep 21 04:02:09 rac1tiriac nmbd[7678]: Got SIGHUP dumping debug info.



server2
Sep 21 04:02:06 rac2 syslogd 1.4.1: restart.
Sep 21 02:02:25 rac2tiriac kernel: SCSI error : <1 0 1 2> return code = 0x20000
Sep 21 02:02:25 rac2tiriac kernel: SCSI error : <1 0 1 1> return code = 0x20000

Sep 21 04:02:05 rac2tiriac cups: cupsd shutdown succeeded
Sep 21 04:02:06 rac2tiriac cups: cupsd startup succeeded
Sep 21 04:02:06 rac2tiriac syslogd 1.4.1: restart.
Sep 21 04:02:06 rac2tiriac nmbd[7709]: [2008/09/21 04:02:06, 0] nmbd/nmbd.crocess(542)
Sep 21 04:02:06 rac2tiriac nmbd[7709]: Got SIGHUP dumping debug info
--------------------------------

Here is a part of the var/log/cron on each server for 04:02 hour:
rac1:
Sep 21 04:01:01 rac1tiriac crond[23639]: (root) CMD (run-parts /etc/cron.hourly)
Sep 21 04:02:01 rac1tiriac crond[24089]: (root) CMD (run-parts /etc/cron.daily)
Sep 21 04:02:06 rac1tiriac anacron[24544]: Updated timestamp for job `cron.daily' to 2008-09-21

rac2:
Sep 21 04:00:01 rac2tiriac crond[22765]: (root) CMD (/script.sh)
Sep 21 04:01:01 rac2tiriac crond[23213]: (root) CMD (run-parts /etc/cron.hourly)
Sep 21 04:02:01 rac2tiriac crond[23681]: (root) CMD (run-parts /etc/cron.daily)
Sep 21 04:02:05 rac2tiriac anacron[24014]: Updated timestamp for job `cron.daily' to 2008-09-21


After lot of searches i found that my problem is verry similar to an old thread(2006) called "Why the sever goes down every weekend?" but which gave me no resolution.

Can anybody help? Any advice is highly appreciated!
Adrian.
 
Old 09-23-2008, 04:37 AM   #2
billymayday
LQ Guru
 
Registered: Mar 2006
Location: Sydney, Australia
Distribution: Fedora, CentOS, OpenSuse, Slack, Gentoo, Debian, Arch, PCBSD
Posts: 6,678

Rep: Reputation: 122Reputation: 122
What jobs are you running at or just before that time? Have a look at the crontabs in /var/spool/cron
 
Old 09-23-2008, 04:45 AM   #3
adrian.carciumaru
LQ Newbie
 
Registered: Sep 2008
Posts: 8

Original Poster
Rep: Reputation: 0
Hi,
i only have one "root" file in var/spool/cron which contains that line
0 * * * 0,6 /script.sh

i've searched the messages from etc/log but nothing happening before indicates me that will cause the restart.

Quote:
Originally Posted by billymayday View Post
What jobs are you running at or just before that time? Have a look at the crontabs in /var/spool/cron
 
Old 09-24-2008, 05:17 AM   #4
adrian.carciumaru
LQ Newbie
 
Registered: Sep 2008
Posts: 8

Original Poster
Rep: Reputation: 0
Can anybody help? Any advice is highly appreciated!
Thx
 
Old 09-24-2008, 07:13 AM   #5
cam34
Member
 
Registered: Aug 2003
Distribution: Fedora 22, Debian 8, Centos 6/7 for servers
Posts: 101

Rep: Reputation: 16
nmbd is SIGHUPing at 4.02am
Do you have samba installed and running as well?
nmbd is NetBiosDaemon for MS networking and Samba....

Do you need it running? Can you disable the service?
 
Old 09-24-2008, 07:43 AM   #6
adrian.carciumaru
LQ Newbie
 
Registered: Sep 2008
Posts: 8

Original Poster
Rep: Reputation: 0
yes i have the samba installed and running as well.
as i know smdb and nmbd are samba processes.
i don't know if i can disable the nmbd and samba but i will try to disable them on saturday evening and see what happens..


Quote:
Originally Posted by cam34 View Post
nmbd is SIGHUPing at 4.02am
Do you have samba installed and running as well?
nmbd is NetBiosDaemon for MS networking and Samba....

Do you need it running? Can you disable the service?
 
Old 09-24-2008, 01:33 PM   #7
racracracrac
Member
 
Registered: Sep 2008
Posts: 44

Rep: Reputation: 15
It may be a silly question, but are you sure they are rebooting? Your logs only indicate things are being restarted, most likely from logrotate.

Run the command w(1) to see what the uptime on your systems are.

(link removed)

Last edited by Mara; 10-12-2008 at 04:54 PM. Reason: link removed, as it has nothing to do with this thread
 
Old 09-25-2008, 03:59 AM   #8
adrian.carciumaru
LQ Newbie
 
Registered: Sep 2008
Posts: 8

Original Poster
Rep: Reputation: 0
you may be right. They are up for 3 days(from sunday morning) because i manually restarted them that morning .
thx

Quote:
Originally Posted by racracracrac View Post
It may be a silly question, but are you sure they are rebooting? Your logs only indicate things are being restarted, most likely from logrotate.

Run the command w(1) to see what the uptime on your systems are.
 
Old 09-25-2008, 08:02 AM   #9
adrian.carciumaru
LQ Newbie
 
Registered: Sep 2008
Posts: 8

Original Poster
Rep: Reputation: 0
but even if they don't reboot, they have an abnomal behaviour after 04:02 , performing very slow

Quote:
Originally Posted by adrian.carciumaru View Post
you may be right. They are up for 3 days(from sunday morning) because i manually restarted them that morning .
thx
 
Old 09-25-2008, 11:30 AM   #10
JMCraig
Member
 
Registered: Feb 2003
Location: Utah, USA
Distribution: Red Hat EL/CentOS, Ubuntu/Debian
Posts: 113

Rep: Reputation: 15
Adrian,

Based on a slightly similar problem I had with SCSI drives for a period of time, you may have a bad SCSI drive (or cable--those cables are such a pain to test). I had a failure that would occur periodically (only sometimes at a particular time) and it didn't always take the server down, but it did cause problems. The only way I was able to figure out what was wrong was to set up logging on the SCSI drives via the vendor's SCSI RAID controller management software. That allowed me to see that it was always a particular disk. After doing a number of low-level formats and then rebuilding the RAID setup on that drive, only to have it fail again shortly after that, I finally replaced the drive: problem solved, it seemed. Later, I had a similar problem and it was solved by replacing the cable.

Definitely get whatever administrative software is available for the drives you're using and enable whatever logging is available.

As to why it affects both computers, my guess is that Samba shares might be involved (since that is what's getting the SIGHUP). If you're using them, you might consider reconfiguring things so that you're not using them over one weekend and see what happens. In my case, I knew why it was sometimes happening at a certain time of day because a database was being backed up. Do you have anything set up to run from "inside" Oracle (such as backups)? (I'm not familiar with Oracle so I don't know if you can schedule it to do its own backups without running an external program. But if you can do that, maybe that's what's causing it happen at particular times.)

HTH,

John
 
Old 09-30-2008, 04:35 AM   #11
adrian.carciumaru
LQ Newbie
 
Registered: Sep 2008
Posts: 8

Original Poster
Rep: Reputation: 0
Hello all,

Thanks John for your advice.
The servers do not restart indeed at ~04:02AM. RACRACRAC was right.

However the problem is with the shutdown at 02:02AM after the SCSI error.
There are 2 week since neither of the servers went down. However there were 2 week in a row working well before and the third causing shutdown to one or both servers. So i'm not sure that the servers will not shutdown in the future. The problem causing the shutdown to the servers at exactly 02:02 AM Sunday morning is the SCSI error. Sometimes it goes over this eror and continue to run and sometimes the error occurs and one (or both) servers shutdown. How can i get rid of the SCSI error?



Quote:
Originally Posted by JMCraig View Post
Adrian,

Based on a slightly similar problem I had with SCSI drives for a period of time, you may have a bad SCSI drive (or cable--those cables are such a pain to test). I had a failure that would occur periodically (only sometimes at a particular time) and it didn't always take the server down, but it did cause problems. The only way I was able to figure out what was wrong was to set up logging on the SCSI drives via the vendor's SCSI RAID controller management software. That allowed me to see that it was always a particular disk. After doing a number of low-level formats and then rebuilding the RAID setup on that drive, only to have it fail again shortly after that, I finally replaced the drive: problem solved, it seemed. Later, I had a similar problem and it was solved by replacing the cable.

Definitely get whatever administrative software is available for the drives you're using and enable whatever logging is available.

As to why it affects both computers, my guess is that Samba shares might be involved (since that is what's getting the SIGHUP). If you're using them, you might consider reconfiguring things so that you're not using them over one weekend and see what happens. In my case, I knew why it was sometimes happening at a certain time of day because a database was being backed up. Do you have anything set up to run from "inside" Oracle (such as backups)? (I'm not familiar with Oracle so I don't know if you can schedule it to do its own backups without running an external program. But if you can do that, maybe that's what's causing it happen at particular times.)

HTH,

John
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Why the sever goes down every weekend? QianChen Linux - Newbie 18 09-22-2008 08:34 PM
weekend networker Kumado Linux - Networking 3 02-01-2006 07:43 PM
Aergh. X dies on the hour, every hour l00zer Linux - Software 4 06-07-2005 11:02 PM
change clock from 24 hour to 12 hour in suse 9.2/KDE 3.3 jmlumpkin Linux - Newbie 1 01-23-2005 12:45 AM
Anyone at Toorcon this weekend? chort Linux - Security 0 09-25-2004 03:18 AM


All times are GMT -5. The time now is 09:46 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration