LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 12-13-2006, 10:47 AM   #1
QianChen
LQ Newbie
 
Registered: Jul 2006
Posts: 18

Rep: Reputation: 0
Why the sever goes down every weekend?


Hello, all,

I have a Linux server which runs Rh 9 on it.

It works very stable before Thanksgiving. It performs backup to tapes from Tuesday early morning (4:30 am) to Saturday early morning. On Saturday, it backs up the whole system to the tape. On Sunday and Monday's early morning, it doesn't backup to the tape but rearranges some folders.

I forgot to change the cronjob during the Thanksgiving. No one changed the tapes during the Thanksgiving, so the backup should not performed correctly.

After the Thanksgiving, everything was OK except the first weekend when the server went down. I restarted it on the next Monday, and it worked well until the second weekend when it went down again.

I checked the /var/log/cron and found:

Dec 2 23:53:00 flulinux01 CROND[2472]: (root) CMD (/usr/lib/sa/sa2 -A)
Dec 3 00:00:00 flulinux01 CROND[2682]: (root) CMD (/usr/lib/sa/sa1 1 1)
Dec 3 00:00:00 flulinux01 CROND[2684]: (flulims) CMD (/home/flulims/flowlims-ap
plication-2.1/parse_lims_data_update.pl)
Dec 3 00:01:00 flulinux01 CROND[2730]: (root) CMD (run-parts /etc/cron.hourly)
Dec 4 15:46:09 flulinux01 crond[1454]: (CRON) STARTUP (fork ok)



Dec 9 04:30:00 flulinux01 CROND[3213]: (root) CMD (/script/backup_part_t
ape)
Dec 9 04:40:00 flulinux01 CROND[3482]: (flulims) CMD (/home/flulims/flow
lims-application-2.1/parse_lims_data_update.pl)
Dec 9 04:40:00 flulinux01 CROND[3481]: (root) CMD (/usr/lib/sa/sa1 1 1)
Dec 11 15:19:36 flulinux01 crond[1448]: (CRON) STARTUP (fork ok)

The /script/backup_part_tape is the backup script which backup the whole system to the tape.

I don't know how to check the system to find out the reason. I can only guess it is the backup_part_tape script which was wrong although I didn't reedit it.

Can anybody help? Any advice is highly appreciated!
Qian

Last edited by QianChen; 12-13-2006 at 10:48 AM.
 
Old 12-13-2006, 11:08 AM   #2
anomie
Senior Member
 
Registered: Nov 2004
Location: Texas
Distribution: RHEL, Scientific Linux, Debian, Fedora
Posts: 3,935
Blog Entries: 5

Rep: Reputation: Disabled
How about checking /var/log/messages to see if you can find any clues around the time the server goes down. (Do you mean it shuts down or reboots, btw?)

This could indicate a hardware problem, but not info to go on just yet.
 
Old 12-13-2006, 03:15 PM   #3
QianChen
LQ Newbie
 
Registered: Jul 2006
Posts: 18

Original Poster
Rep: Reputation: 0
Thank you.

When the server is down, I just can not log in, and I can not ping it. So I have to go to the office where it locates physically to reboot it.

I checked the /var/log/messages and found every Sunday, at the same time, it will restart automatically. I don't know why. Around the time the server went down, there are no logs. I copied some of them as below:


Nov 12 04:02:02 flulinux01 syslogd 1.4.1: restart.

Nov 19 04:02:02 flulinux01 syslogd 1.4.1: restart.

Nov 26 04:02:02 flulinux01 syslogd 1.4.1: restart.

Dec 1 07:59:58 flulinux01 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1
00 Mbps Full Duplex
Dec 4 15:46:03 flulinux01 syslogd 1.4.1: restart.

Dec 8 04:53:10 flulinux01 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1
00 Mbps Full Duplex
Dec 11 15:19:31 flulinux01 syslogd 1.4.1: restart.

According to the cron log, it didn't work at 00:01:00 of Dec 3 and at 04:40:00 of Dec 9. But there are no message logs during these periods of time.

What should I do next?

Thank you very much!

Last edited by QianChen; 12-13-2006 at 03:16 PM.
 
Old 12-13-2006, 03:24 PM   #4
QianChen
LQ Newbie
 
Registered: Jul 2006
Posts: 18

Original Poster
Rep: Reputation: 0
I checked the boot.log and found these:

Nov 19 04:02:01 flulinux01 cups: cupsd shutdown succeeded
Nov 19 04:02:02 flulinux01 cups: cupsd startup succeeded

Nov 26 04:02:01 flulinux01 cups: cupsd shutdown succeeded
Nov 26 04:02:02 flulinux01 cups: cupsd startup succeeded

Dec 4 15:46:03 flulinux01 syslog: syslogd startup succeeded
Dec 4 15:46:03 flulinux01 syslog: klogd startup succeeded
Dec 4 15:46:03 flulinux01 irqbalance: irqbalance startup succeeded
Dec 4 15:46:03 flulinux01 portmap: portmap startup succeeded
Dec 4 15:46:03 flulinux01 nfslock: rpc.statd startup succeeded
Dec 4 15:46:03 flulinux01 mdmonitor: mdadm startup succeeded
Dec 4 15:46:03 flulinux01 mdmonitor: mdadm succeeded
Dec 4 15:46:04 flulinux01 audit: auditd startup succeeded
Dec 4 15:46:04 flulinux01 audit: auditd startup succeeded
Dec 4 15:46:04 flulinux01 raidmon: MegaCtrl startup succeeded
Dec 4 15:46:04 flulinux01 random: Initializing random number generator: succee
ded
Dec 4 15:46:04 flulinux01 rc: Starting pcmcia: succeeded
Dec 4 15:46:05 flulinux01 netfs: Mounting other filesystems: succeeded
Dec 4 15:46:05 flulinux01 oracleasm: succeeded
Dec 4 15:46:05 flulinux01 last message repeated 2 times
Dec 4 15:46:06 flulinux01 autofs: automount startup succeeded
Dec 4 15:46:08 flulinux01 cups: cupsd startup succeeded
Dec 4 15:46:09 flulinux01 sshd: succeeded
Dec 4 15:46:09 flulinux01 xinetd: xinetd startup succeeded
Dec 4 15:46:09 flulinux01 ntpd: succeeded
Dec 4 15:46:07 flulinux01 last message repeated 2 times
Dec 4 15:46:07 flulinux01 ntpd: ntpd startup succeeded
Dec 4 15:46:08 flulinux01 sendmail: sendmail startup succeeded
Dec 4 15:46:08 flulinux01 sendmail: sm-client startup succeeded
Dec 4 15:46:09 flulinux01 gpm: gpm startup succeeded
Dec 4 15:46:09 flulinux01 crond: crond startup succeeded
Dec 4 15:46:09 flulinux01 xfs: xfs startup succeeded
Dec 4 15:46:09 flulinux01 atd: atd startup succeeded
Dec 4 15:46:09 flulinux01 rhnsd: rhnsd startup succeeded

Dec 5 04:02:02 flulinux01 cups: cupsd shutdown succeeded
Dec 5 04:02:02 flulinux01 cups: cupsd startup succeeded

Dec 11 15:19:31 flulinux01 syslog: syslogd startup succeeded
Dec 11 15:19:31 flulinux01 syslog: klogd startup succeeded
Dec 11 15:19:31 flulinux01 irqbalance: irqbalance startup succeeded
Dec 11 15:19:31 flulinux01 portmap: portmap startup succeeded
Dec 11 15:19:31 flulinux01 nfslock: rpc.statd startup succeeded
Dec 11 15:19:31 flulinux01 mdmonitor: mdadm startup succeeded
Dec 11 15:19:31 flulinux01 mdmonitor: mdadm succeeded
Dec 11 15:19:32 flulinux01 audit: auditd startup succeeded
Dec 11 15:19:32 flulinux01 audit: auditd startup succeeded
Dec 11 15:19:32 flulinux01 raidmon: MegaCtrl startup succeeded
Dec 11 15:19:32 flulinux01 random: Initializing random number generator: succee
ded
Dec 11 15:19:32 flulinux01 rc: Starting pcmcia: succeeded
Dec 11 15:19:32 flulinux01 netfs: Mounting other filesystems: succeeded
Dec 11 15:19:32 flulinux01 oracleasm: succeeded
Dec 11 15:19:32 flulinux01 last message repeated 2 times
Dec 11 15:19:34 flulinux01 autofs: automount startup succeeded
Dec 11 15:19:36 flulinux01 cups: cupsd startup succeeded
Dec 11 15:19:36 flulinux01 sshd: succeeded
Dec 11 15:19:36 flulinux01 xinetd: xinetd startup succeeded
Dec 11 15:19:36 flulinux01 ntpd: succeeded
Dec 11 15:19:34 flulinux01 last message repeated 2 times
Dec 11 15:19:35 flulinux01 ntpd: ntpd startup succeeded
Dec 11 15:19:35 flulinux01 sendmail: sendmail startup succeeded
Dec 11 15:19:35 flulinux01 sendmail: sm-client startup succeeded
Dec 11 15:19:36 flulinux01 gpm: gpm startup succeeded
Dec 11 15:19:36 flulinux01 crond: crond startup succeeded
Dec 11 15:19:36 flulinux01 xfs: xfs startup succeeded
Dec 11 15:19:36 flulinux01 atd: atd startup succeeded
Dec 11 15:19:36 flulinux01 rhnsd: rhnsd startup succeeded

Dec 12 04:02:02 flulinux01 cups: cupsd shutdown succeeded
Dec 12 04:02:02 flulinux01 cups: cupsd startup succeeded

On Dec 4 and Dec 11, I restarted the server. Others were done automatically, I think.

Do the information help?
 
Old 12-13-2006, 03:41 PM   #5
anomie
Senior Member
 
Registered: Nov 2004
Location: Texas
Distribution: RHEL, Scientific Linux, Debian, Fedora
Posts: 3,935
Blog Entries: 5

Rep: Reputation: Disabled
Quote:
When the server is down, I just can not log in, and I can not ping it. So I have to go to the office where it locates physically to reboot it.
This brings up a couple more questions. So when it runs into trouble you can't ssh to it or reboot it. Have you tried logging in to it directly on one of the VTs? If not, is the screen frozen? Or do you see dump / panic information or anything?

I'm trying to determine whether the problem is really with your NIC, or if the server is indeed crashing.
 
Old 12-13-2006, 04:06 PM   #6
QianChen
LQ Newbie
 
Registered: Jul 2006
Posts: 18

Original Poster
Rep: Reputation: 0
Sorry. I just cannot ssh it nor ping it remotely. I haven't tried to log in directly.

When I tried to ssh it, I input the host name, port number, user name(root) and password, and the screen was frozen without any responds. At the first time, I could ping it, but at the second time, I could not ping it.

The screen was just frozen, I couldn't see any dump/panic information. No, I saw nothing.

Why it keeps restarting every Sunday morning?
 
Old 12-13-2006, 04:20 PM   #7
anomie
Senior Member
 
Registered: Nov 2004
Location: Texas
Distribution: RHEL, Scientific Linux, Debian, Fedora
Posts: 3,935
Blog Entries: 5

Rep: Reputation: Disabled
I'm not sure about the timing (Sunday mornings), but unattended lockups like that frequently mean hardware trouble.

A couple things to check out:
  • Run a memory test. I'd recommend using memtest86 or memtest86+ (I use the latter). Both have bootable image files you can burn to cd.
  • Get down to single-user mode, remount the / filesystem read-only, and then run fsck -A.

Pay attention to any output from both of these.

Last edited by anomie; 12-13-2006 at 04:21 PM.
 
Old 12-13-2006, 04:53 PM   #8
Windchaser
LQ Newbie
 
Registered: Dec 2006
Location: Chicago, IL
Distribution: Fedora Core 5 and 6
Posts: 23

Rep: Reputation: 15
Is there anything differently happening over the weekends at your facility? A previous company we had a customer experience server crashes every few days very late at night. It turned out to be a janitor who was plugging his vacuum into the rack that the server was in. Is it possible something is happening at your site that is interrupting power or something similar that could be triggering your crashes.
 
Old 12-13-2006, 05:04 PM   #9
QianChen
LQ Newbie
 
Registered: Jul 2006
Posts: 18

Original Poster
Rep: Reputation: 0
The server is located in another office. I have no idea about any accidents happened there during weekends. According to the log, it reboots at the same time, 04:02:02. It is hard to imagine that an accident happened on time every Sunday.

Thank you for your advice!
 
Old 12-13-2006, 05:09 PM   #10
frob23
Senior Member
 
Registered: Jan 2004
Location: Roughly 29.467N / 81.206W
Distribution: OpenBSD, Ubuntu, FreeBSD
Posts: 1,449

Rep: Reputation: 48
The fact that it's happening at exactly the same time (04:02:02 on Sunday morning) makes me initially rule out a hardware issue. What does cron start at (or around) 4am on Sunday?

It's probably that backup script (although it looks like that should start at 4:30)... can you run the script manually or does it reboot?

How much detail does the script provide logging-wise and can you watch it when it runs?

Do you know what part of the script could be causing the reboot?

Last edited by frob23; 12-13-2006 at 05:10 PM.
 
Old 12-13-2006, 06:04 PM   #11
anomie
Senior Member
 
Registered: Nov 2004
Location: Texas
Distribution: RHEL, Scientific Linux, Debian, Fedora
Posts: 3,935
Blog Entries: 5

Rep: Reputation: Disabled
I'm a little confused. (Just re-read the thread to make sure I didn't miss something.) Is it freezing or rebooting?
 
Old 12-13-2006, 06:17 PM   #12
QianChen
LQ Newbie
 
Registered: Jul 2006
Posts: 18

Original Poster
Rep: Reputation: 0
Yes, according the boot log, it reboots on Sunday which I didn't know before.

The two times of down, on Dec 3 and Dec 9,it was just frozen.

The cron on Sunday and Monday morning( 4:30 am) is like this:

mkdir /rawdata/oraclebackup/`date '+%d-%B-%Y'`
cp /rawdata/data_* /rawdata/oraclebackup/`date '+%d-%B-%Y'`
cp /rawdata/c-* /rawdata/oraclebackup/`date '+%d-%B-%Y'`
cp /rawdata/arch* /rawdate/oraclebackup/`date '+%d-%B-%Y'`
rm -f /rawdata/data_*
rm -f /rawdata/c-*
rm -f /rawdata/arch*

I just rearrange some files.

The script run on Saturday is like this:
#move database backups to /raw/oraclebackup
mkdir /rawdata/oraclebackup/`date '+%d-%B-%Y'`
cp /rawdata/data_* /rawdata/oraclebackup/`date '+%d-%B-%Y'`
cp /rawdata/c-* /rawdata/oraclebackup/`date '+%d-%B-%Y'`
cp /rawdata/arch* /rawdata/oraclebackup/`date '+%d-%B-%Y'`
rm -f /rawdata/data_*
rm -f /rawdata/c-*
rm -f /rawdata/arch*
#end of moving

DIRECTORIES="/"
#DIRECTORIES="/etc /home /opt /root /var"
BACKUPTO=/dev/nst0
TAR=/bin/tar
PATH=/usr/local/bin:/usr/bin:/bin
START=`date +%s`

#Daily full backup
NEWER=""
echo "*****start time*****"
date
echo
if mt -f /$BACKUPTO status | grep "ONLINE"; then
echo "***** finding sockets*****"
find $DIRECTORIES -type s > sockets
echo
#echo "*****setting compression on*****"
#mt -f /$BACKUP compression 1
echo
#echo "*****setting type to DLT 35 Compressed*****"
#mt -f /$BACKUPTO setdensity 0x85
echo
echo "*****archiving*****"
$TAR $NEWER -cf $BACKUPTO $DIRECTORIES --exclude-from=sockets --absolut
e-names --totals
echo
echo "*****tape-drive status*****"
mt -f /$BACKUPTO status
echo
echo "*****ejection tape*****"
mt -f /$BACKUPTO offline
echo
echo "*****end time*****"
date
else
echo "*****WARNING TAPE DRIVE IS OFFLINE, NO BACKUPS PERFORMED*****"
fi
FINISH=`date +%s`
diff=$((FINISH - START))
echo -n "***** Total Run Time: "
HRS=`expr $diff / 3600`
MIN=`expr $diff % 3600 / 60`
SEC=`expr $diff % 3600 % 60`
if [ $HRS -gt 0 ]
then
echo -n "$HRS hrs. "
fi
if [ $MIN -gt 0 ]
then
echo -n "$MIN mins. "
fi
if [ $SEC -gt 0 ]
then
if [ $MIN -gt 0 ]
then
echo "and $SEC secs. "
elif [ $HRS -gt 0 ]
then
echo "and $SEC secs. "
else
echo "$SEC secs. "
fi
fi

If there is no tape in the tape drive, it will give out an error but won't reboot. During the Thanksgiving, because no one helped to change the rejected tape, it would give out errors for several days. Is that the possible reason?

Thank you all!

Last edited by QianChen; 12-13-2006 at 06:18 PM.
 
Old 12-13-2006, 06:23 PM   #13
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,240

Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
If it's had a few hiccups and now freezes occasionally, it's possible that the disk purging has got out of sync and one or more partitions (temporarily) fill up.
Usually it's a good idea to compress backups eg gzip, but this can use a lot of temp space whilst gzip is in progress.
If it's also re-booting on a regular basis, that sounds like a separate issue.
Could be worth looking at all cron files, esp /etc/crontab.
If the re-boots are not at the exact time, the janitor/cleaner thing could be happening; it's happened to me, it's not just an IT myth. sigh...
 
Old 12-14-2006, 12:21 PM   #14
QianChen
LQ Newbie
 
Registered: Jul 2006
Posts: 18

Original Poster
Rep: Reputation: 0
Maybe you are right. Because the /dev/sda6 is going to be used up. Here is the result of df.

Filesystem Size Used Avail Use% Mounted on
/dev/sda7 4.0G 1.1G 2.7G 29% /
none 0 0 0 - /proc
none 0 0 0 - /dev/pts
usbdevfs 0 0 0 - /proc/bus/usb
/dev/sda3 190M 15M 166M 9% /boot
/dev/sda8 2.0G 806M 1.1G 43% /home
/dev/sda6 4.0G 3.1G 741M 81% /opt/oracle
none 2.0G 0 2.0G 0% /dev/shm
/dev/sda12 1012M 33M 928M 4% /tmp
/dev/sda5 5.0G 2.4G 2.4G 50% /usr
/dev/sdd1 135G 20G 109G 16% /rawdata
oracleasmfs 0 0 0 - /dev/oracleasm

I have an Oracle database preinstalled on our server. Now we are using it. Maybe it is the reason, but I am not very sure.

Regarding the reboot, I agree with you that it is another issue. I checked the /etc/crontab.
the /etc/crontab is like this:

SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/

# run-parts
01 * * * * root run-parts /etc/cron.hourly
02 4 * * * root run-parts /etc/cron.daily
22 4 * * 0 root run-parts /etc/cron.weekly
42 4 1 * * root run-parts /etc/cron.monthly

and the cron.hourly is like:
*** cron.hourly: directory ***

the cron.daily is like:
*** cron.daily: directory ***

the cron.weekly is like:
*** cron.weekly: directory ***

the cron.monthly is like:
*** cron.monthly: directory ***

It is the original version and I haven't changed it.

Thank you!
 
Old 12-14-2006, 05:25 PM   #15
QianChen
LQ Newbie
 
Registered: Jul 2006
Posts: 18

Original Poster
Rep: Reputation: 0
OK.
Now I have rearranged some files of Oracle. Now,the system is like this:
[root@flulinux01 udump]# df -k
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda7 4127076 1110460 2806972 29% /
/dev/sda3 194449 15120 169289 9% /boot
/dev/sda8 2063504 825032 1133652 43% /home
/dev/sda6 4127076 2679672 1237760 69% /opt/oracle
none 2047704 0 2047704 0% /dev/shm
/dev/sda12 1035660 33352 949700 4% /tmp
/dev/sda5 5162796 2430800 2469740 50% /usr
/dev/sdd1 141003764 20712152 113129036 16% /rawdata

the /opt/oracle change from 81% occupied to 69% occupied.

I will wait and see what will happen this weekend.

Thank you all!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Linuxfest Northwest This Weekend LXer Syndicated Linux News 0 04-27-2006 08:54 AM
weekend networker Kumado Linux - Networking 3 02-01-2006 07:43 PM
Anyone at Toorcon this weekend? chort Linux - Security 0 09-25-2004 03:18 AM
/ file system 81% from 50% over the weekend. AZDAVE Mandriva 6 05-12-2004 09:58 AM
makedev destroyed my weekend Ajeje Brazo Linux - Newbie 0 04-25-2004 11:37 AM


All times are GMT -5. The time now is 08:59 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration