LinuxQuestions.org - Why the sever goes down every weekend?

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Why the sever goes down every weekend? (https://www.linuxquestions.org/questions/linux-newbie-8/why-the-sever-goes-down-every-weekend-510049/)

Why the sever goes down every weekend?

Hello, all,

I have a Linux server which runs Rh 9 on it.

It works very stable before Thanksgiving. It performs backup to tapes from Tuesday early morning (4:30 am) to Saturday early morning. On Saturday, it backs up the whole system to the tape. On Sunday and Monday's early morning, it doesn't backup to the tape but rearranges some folders.

I forgot to change the cronjob during the Thanksgiving. No one changed the tapes during the Thanksgiving, so the backup should not performed correctly.

After the Thanksgiving, everything was OK except the first weekend when the server went down. I restarted it on the next Monday, and it worked well until the second weekend when it went down again.

I checked the /var/log/cron and found:

Dec 2 23:53:00 flulinux01 CROND[2472]: (root) CMD (/usr/lib/sa/sa2 -A)
Dec 3 00:00:00 flulinux01 CROND[2682]: (root) CMD (/usr/lib/sa/sa1 1 1)
Dec 3 00:00:00 flulinux01 CROND[2684]: (flulims) CMD (/home/flulims/flowlims-ap
plication-2.1/parse_lims_data_update.pl)
Dec 3 00:01:00 flulinux01 CROND[2730]: (root) CMD (run-parts /etc/cron.hourly)
Dec 4 15:46:09 flulinux01 crond[1454]: (CRON) STARTUP (fork ok)

Dec 9 04:30:00 flulinux01 CROND[3213]: (root) CMD (/script/backup_part_t
ape)
Dec 9 04:40:00 flulinux01 CROND[3482]: (flulims) CMD (/home/flulims/flow
lims-application-2.1/parse_lims_data_update.pl)
Dec 9 04:40:00 flulinux01 CROND[3481]: (root) CMD (/usr/lib/sa/sa1 1 1)
Dec 11 15:19:36 flulinux01 crond[1448]: (CRON) STARTUP (fork ok)

The /script/backup_part_tape is the backup script which backup the whole system to the tape.

I don't know how to check the system to find out the reason. I can only guess it is the backup_part_tape script which was wrong although I didn't reedit it.

Can anybody help? Any advice is highly appreciated!
Qian

How about checking /var/log/messages to see if you can find any clues around the time the server goes down. (Do you mean it shuts down or reboots, btw?)

This could indicate a hardware problem, but not info to go on just yet.

Thank you.

When the server is down, I just can not log in, and I can not ping it. So I have to go to the office where it locates physically to reboot it.

I checked the /var/log/messages and found every Sunday, at the same time, it will restart automatically. I don't know why. Around the time the server went down, there are no logs. I copied some of them as below:

Nov 12 04:02:02 flulinux01 syslogd 1.4.1: restart.

Nov 19 04:02:02 flulinux01 syslogd 1.4.1: restart.

Nov 26 04:02:02 flulinux01 syslogd 1.4.1: restart.

Dec 1 07:59:58 flulinux01 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1
00 Mbps Full Duplex
Dec 4 15:46:03 flulinux01 syslogd 1.4.1: restart.

Dec 8 04:53:10 flulinux01 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1
00 Mbps Full Duplex
Dec 11 15:19:31 flulinux01 syslogd 1.4.1: restart.

According to the cron log, it didn't work at 00:01:00 of Dec 3 and at 04:40:00 of Dec 9. But there are no message logs during these periods of time.

What should I do next?

Thank you very much!

I checked the boot.log and found these:

Nov 19 04:02:01 flulinux01 cups: cupsd shutdown succeeded
Nov 19 04:02:02 flulinux01 cups: cupsd startup succeeded

Nov 26 04:02:01 flulinux01 cups: cupsd shutdown succeeded
Nov 26 04:02:02 flulinux01 cups: cupsd startup succeeded

Dec 4 15:46:03 flulinux01 syslog: syslogd startup succeeded
Dec 4 15:46:03 flulinux01 syslog: klogd startup succeeded
Dec 4 15:46:03 flulinux01 irqbalance: irqbalance startup succeeded
Dec 4 15:46:03 flulinux01 portmap: portmap startup succeeded
Dec 4 15:46:03 flulinux01 nfslock: rpc.statd startup succeeded
Dec 4 15:46:03 flulinux01 mdmonitor: mdadm startup succeeded
Dec 4 15:46:03 flulinux01 mdmonitor: mdadm succeeded
Dec 4 15:46:04 flulinux01 audit: auditd startup succeeded
Dec 4 15:46:04 flulinux01 audit: auditd startup succeeded
Dec 4 15:46:04 flulinux01 raidmon: MegaCtrl startup succeeded
Dec 4 15:46:04 flulinux01 random: Initializing random number generator: succee
ded
Dec 4 15:46:04 flulinux01 rc: Starting pcmcia: succeeded
Dec 4 15:46:05 flulinux01 netfs: Mounting other filesystems: succeeded
Dec 4 15:46:05 flulinux01 oracleasm: succeeded
Dec 4 15:46:05 flulinux01 last message repeated 2 times
Dec 4 15:46:06 flulinux01 autofs: automount startup succeeded
Dec 4 15:46:08 flulinux01 cups: cupsd startup succeeded
Dec 4 15:46:09 flulinux01 sshd: succeeded
Dec 4 15:46:09 flulinux01 xinetd: xinetd startup succeeded
Dec 4 15:46:09 flulinux01 ntpd: succeeded
Dec 4 15:46:07 flulinux01 last message repeated 2 times
Dec 4 15:46:07 flulinux01 ntpd: ntpd startup succeeded
Dec 4 15:46:08 flulinux01 sendmail: sendmail startup succeeded
Dec 4 15:46:08 flulinux01 sendmail: sm-client startup succeeded
Dec 4 15:46:09 flulinux01 gpm: gpm startup succeeded
Dec 4 15:46:09 flulinux01 crond: crond startup succeeded
Dec 4 15:46:09 flulinux01 xfs: xfs startup succeeded
Dec 4 15:46:09 flulinux01 atd: atd startup succeeded
Dec 4 15:46:09 flulinux01 rhnsd: rhnsd startup succeeded

Dec 5 04:02:02 flulinux01 cups: cupsd shutdown succeeded
Dec 5 04:02:02 flulinux01 cups: cupsd startup succeeded

Dec 11 15:19:31 flulinux01 syslog: syslogd startup succeeded
Dec 11 15:19:31 flulinux01 syslog: klogd startup succeeded
Dec 11 15:19:31 flulinux01 irqbalance: irqbalance startup succeeded
Dec 11 15:19:31 flulinux01 portmap: portmap startup succeeded
Dec 11 15:19:31 flulinux01 nfslock: rpc.statd startup succeeded
Dec 11 15:19:31 flulinux01 mdmonitor: mdadm startup succeeded
Dec 11 15:19:31 flulinux01 mdmonitor: mdadm succeeded
Dec 11 15:19:32 flulinux01 audit: auditd startup succeeded
Dec 11 15:19:32 flulinux01 audit: auditd startup succeeded
Dec 11 15:19:32 flulinux01 raidmon: MegaCtrl startup succeeded
Dec 11 15:19:32 flulinux01 random: Initializing random number generator: succee
ded
Dec 11 15:19:32 flulinux01 rc: Starting pcmcia: succeeded
Dec 11 15:19:32 flulinux01 netfs: Mounting other filesystems: succeeded
Dec 11 15:19:32 flulinux01 oracleasm: succeeded
Dec 11 15:19:32 flulinux01 last message repeated 2 times
Dec 11 15:19:34 flulinux01 autofs: automount startup succeeded
Dec 11 15:19:36 flulinux01 cups: cupsd startup succeeded
Dec 11 15:19:36 flulinux01 sshd: succeeded
Dec 11 15:19:36 flulinux01 xinetd: xinetd startup succeeded
Dec 11 15:19:36 flulinux01 ntpd: succeeded
Dec 11 15:19:34 flulinux01 last message repeated 2 times
Dec 11 15:19:35 flulinux01 ntpd: ntpd startup succeeded
Dec 11 15:19:35 flulinux01 sendmail: sendmail startup succeeded
Dec 11 15:19:35 flulinux01 sendmail: sm-client startup succeeded
Dec 11 15:19:36 flulinux01 gpm: gpm startup succeeded
Dec 11 15:19:36 flulinux01 crond: crond startup succeeded
Dec 11 15:19:36 flulinux01 xfs: xfs startup succeeded
Dec 11 15:19:36 flulinux01 atd: atd startup succeeded
Dec 11 15:19:36 flulinux01 rhnsd: rhnsd startup succeeded

Dec 12 04:02:02 flulinux01 cups: cupsd shutdown succeeded
Dec 12 04:02:02 flulinux01 cups: cupsd startup succeeded

On Dec 4 and Dec 11, I restarted the server. Others were done automatically, I think.

Do the information help?

Quote:

When the server is down, I just can not log in, and I can not ping it. So I have to go to the office where it locates physically to reboot it.

This brings up a couple more questions. So when it runs into trouble you can't ssh to it or reboot it. Have you tried logging in to it directly on one of the VTs? If not, is the screen frozen? Or do you see dump / panic information or anything?

I'm trying to determine whether the problem is really with your NIC, or if the server is indeed crashing.

Sorry. I just cannot ssh it nor ping it remotely. I haven't tried to log in directly.

When I tried to ssh it, I input the host name, port number, user name(root) and password, and the screen was frozen without any responds. At the first time, I could ping it, but at the second time, I could not ping it.

The screen was just frozen, I couldn't see any dump/panic information. No, I saw nothing.

Why it keeps restarting every Sunday morning?

I'm not sure about the timing (Sunday mornings), but unattended lockups like that frequently mean hardware trouble.

A couple things to check out:

Run a memory test. I'd recommend using memtest86 or memtest86+ (I use the latter). Both have bootable image files you can burn to cd.
Get down to single-user mode, remount the / filesystem read-only, and then run fsck -A.

Pay attention to any output from both of these.

Is there anything differently happening over the weekends at your facility? A previous company we had a customer experience server crashes every few days very late at night. It turned out to be a janitor who was plugging his vacuum into the rack that the server was in. Is it possible something is happening at your site that is interrupting power or something similar that could be triggering your crashes.

The server is located in another office. I have no idea about any accidents happened there during weekends. According to the log, it reboots at the same time, 04:02:02. It is hard to imagine that an accident happened on time every Sunday.

Thank you for your advice!

The fact that it's happening at exactly the same time (04:02:02 on Sunday morning) makes me initially rule out a hardware issue. What does cron start at (or around) 4am on Sunday?

It's probably that backup script (although it looks like that should start at 4:30)... can you run the script manually or does it reboot?

How much detail does the script provide logging-wise and can you watch it when it runs?

Do you know what part of the script could be causing the reboot?

I'm a little confused. (Just re-read the thread to make sure I didn't miss something.) Is it freezing or rebooting?

Yes, according the boot log, it reboots on Sunday which I didn't know before.

The two times of down, on Dec 3 and Dec 9,it was just frozen.

The cron on Sunday and Monday morning( 4:30 am) is like this:

mkdir /rawdata/oraclebackup/`date '+%d-%B-%Y'`
cp /rawdata/data_* /rawdata/oraclebackup/`date '+%d-%B-%Y'`
cp /rawdata/c-* /rawdata/oraclebackup/`date '+%d-%B-%Y'`
cp /rawdata/arch* /rawdate/oraclebackup/`date '+%d-%B-%Y'`
rm -f /rawdata/data_*
rm -f /rawdata/c-*
rm -f /rawdata/arch*

I just rearrange some files.

The script run on Saturday is like this:
#move database backups to /raw/oraclebackup
mkdir /rawdata/oraclebackup/`date '+%d-%B-%Y'`
cp /rawdata/data_* /rawdata/oraclebackup/`date '+%d-%B-%Y'`
cp /rawdata/c-* /rawdata/oraclebackup/`date '+%d-%B-%Y'`
cp /rawdata/arch* /rawdata/oraclebackup/`date '+%d-%B-%Y'`
rm -f /rawdata/data_*
rm -f /rawdata/c-*
rm -f /rawdata/arch*
#end of moving

DIRECTORIES="/"
#DIRECTORIES="/etc /home /opt /root /var"
BACKUPTO=/dev/nst0
TAR=/bin/tar
PATH=/usr/local/bin:/usr/bin:/bin
START=`date +%s`

#Daily full backup
NEWER=""
echo "*****start time*****"
date
echo
if mt -f /$BACKUPTO status | grep "ONLINE"; then
echo "***** finding sockets*****"
find $DIRECTORIES -type s > sockets
echo
#echo "*****setting compression on*****"
#mt -f /$BACKUP compression 1
echo
#echo "*****setting type to DLT 35 Compressed*****"
#mt -f /$BACKUPTO setdensity 0x85
echo
echo "*****archiving*****:)"
$TAR $NEWER -cf $BACKUPTO $DIRECTORIES --exclude-from=sockets --absolut
e-names --totals
echo
echo "*****tape-drive status*****"
mt -f /$BACKUPTO status
echo
echo "*****ejection tape*****"
mt -f /$BACKUPTO offline
echo
echo "*****end time*****"
date
else
echo "*****WARNING TAPE DRIVE IS OFFLINE, NO BACKUPS PERFORMED*****"
fi
FINISH=`date +%s`
diff=$((FINISH - START))
echo -n "***** Total Run Time: "
HRS=`expr $diff / 3600`
MIN=`expr $diff % 3600 / 60`
SEC=`expr $diff % 3600 % 60`
if [ $HRS -gt 0 ]
then
echo -n "$HRS hrs. "
fi
if [ $MIN -gt 0 ]
then
echo -n "$MIN mins. "
fi
if [ $SEC -gt 0 ]
then
if [ $MIN -gt 0 ]
then
echo "and $SEC secs. "
elif [ $HRS -gt 0 ]
then
echo "and $SEC secs. "
else
echo "$SEC secs. "
fi
fi

If there is no tape in the tape drive, it will give out an error but won't reboot. During the Thanksgiving, because no one helped to change the rejected tape, it would give out errors for several days. Is that the possible reason?

Thank you all!

If it's had a few hiccups and now freezes occasionally, it's possible that the disk purging has got out of sync and one or more partitions (temporarily) fill up.
Usually it's a good idea to compress backups eg gzip, but this can use a lot of temp space whilst gzip is in progress.
If it's also re-booting on a regular basis, that sounds like a separate issue.
Could be worth looking at all cron files, esp /etc/crontab.
If the re-boots are not at the exact time, the janitor/cleaner thing could be happening; it's happened to me, it's not just an IT myth. sigh...

Maybe you are right. Because the /dev/sda6 is going to be used up. Here is the result of df.

Filesystem Size Used Avail Use% Mounted on
/dev/sda7 4.0G 1.1G 2.7G 29% /
none 0 0 0 - /proc
none 0 0 0 - /dev/pts
usbdevfs 0 0 0 - /proc/bus/usb
/dev/sda3 190M 15M 166M 9% /boot
/dev/sda8 2.0G 806M 1.1G 43% /home
/dev/sda6 4.0G 3.1G 741M 81% /opt/oracle
none 2.0G 0 2.0G 0% /dev/shm
/dev/sda12 1012M 33M 928M 4% /tmp
/dev/sda5 5.0G 2.4G 2.4G 50% /usr
/dev/sdd1 135G 20G 109G 16% /rawdata
oracleasmfs 0 0 0 - /dev/oracleasm

I have an Oracle database preinstalled on our server. Now we are using it. Maybe it is the reason, but I am not very sure.

Regarding the reboot, I agree with you that it is another issue. I checked the /etc/crontab.
the /etc/crontab is like this:

SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/

# run-parts
01 * * * * root run-parts /etc/cron.hourly
02 4 * * * root run-parts /etc/cron.daily
22 4 * * 0 root run-parts /etc/cron.weekly
42 4 1 * * root run-parts /etc/cron.monthly

and the cron.hourly is like:
*** cron.hourly: directory ***

the cron.daily is like:
*** cron.daily: directory ***

the cron.weekly is like:
*** cron.weekly: directory ***

the cron.monthly is like:
*** cron.monthly: directory ***

It is the original version and I haven't changed it.

Thank you!

OK.
Now I have rearranged some files of Oracle. Now,the system is like this:
[root@flulinux01 udump]# df -k
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda7 4127076 1110460 2806972 29% /
/dev/sda3 194449 15120 169289 9% /boot
/dev/sda8 2063504 825032 1133652 43% /home
/dev/sda6 4127076 2679672 1237760 69% /opt/oracle
none 2047704 0 2047704 0% /dev/shm
/dev/sda12 1035660 33352 949700 4% /tmp
/dev/sda5 5162796 2430800 2469740 50% /usr
/dev/sdd1 141003764 20712152 113129036 16% /rawdata

the /opt/oracle change from 81% occupied to 69% occupied.

I will wait and see what will happen this weekend.

Thank you all!