Why the sever goes down every weekend?
Hello, all,
I have a Linux server which runs Rh 9 on it. It works very stable before Thanksgiving. It performs backup to tapes from Tuesday early morning (4:30 am) to Saturday early morning. On Saturday, it backs up the whole system to the tape. On Sunday and Monday's early morning, it doesn't backup to the tape but rearranges some folders. I forgot to change the cronjob during the Thanksgiving. No one changed the tapes during the Thanksgiving, so the backup should not performed correctly. After the Thanksgiving, everything was OK except the first weekend when the server went down. I restarted it on the next Monday, and it worked well until the second weekend when it went down again. I checked the /var/log/cron and found: Dec 2 23:53:00 flulinux01 CROND[2472]: (root) CMD (/usr/lib/sa/sa2 -A) Dec 3 00:00:00 flulinux01 CROND[2682]: (root) CMD (/usr/lib/sa/sa1 1 1) Dec 3 00:00:00 flulinux01 CROND[2684]: (flulims) CMD (/home/flulims/flowlims-ap plication-2.1/parse_lims_data_update.pl) Dec 3 00:01:00 flulinux01 CROND[2730]: (root) CMD (run-parts /etc/cron.hourly) Dec 4 15:46:09 flulinux01 crond[1454]: (CRON) STARTUP (fork ok) Dec 9 04:30:00 flulinux01 CROND[3213]: (root) CMD (/script/backup_part_t ape) Dec 9 04:40:00 flulinux01 CROND[3482]: (flulims) CMD (/home/flulims/flow lims-application-2.1/parse_lims_data_update.pl) Dec 9 04:40:00 flulinux01 CROND[3481]: (root) CMD (/usr/lib/sa/sa1 1 1) Dec 11 15:19:36 flulinux01 crond[1448]: (CRON) STARTUP (fork ok) The /script/backup_part_tape is the backup script which backup the whole system to the tape. I don't know how to check the system to find out the reason. I can only guess it is the backup_part_tape script which was wrong although I didn't reedit it. Can anybody help? Any advice is highly appreciated! Qian |
How about checking /var/log/messages to see if you can find any clues around the time the server goes down. (Do you mean it shuts down or reboots, btw?)
This could indicate a hardware problem, but not info to go on just yet. |
Thank you.
When the server is down, I just can not log in, and I can not ping it. So I have to go to the office where it locates physically to reboot it. I checked the /var/log/messages and found every Sunday, at the same time, it will restart automatically. I don't know why. Around the time the server went down, there are no logs. I copied some of them as below: Nov 12 04:02:02 flulinux01 syslogd 1.4.1: restart. Nov 19 04:02:02 flulinux01 syslogd 1.4.1: restart. Nov 26 04:02:02 flulinux01 syslogd 1.4.1: restart. Dec 1 07:59:58 flulinux01 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1 00 Mbps Full Duplex Dec 4 15:46:03 flulinux01 syslogd 1.4.1: restart. Dec 8 04:53:10 flulinux01 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1 00 Mbps Full Duplex Dec 11 15:19:31 flulinux01 syslogd 1.4.1: restart. According to the cron log, it didn't work at 00:01:00 of Dec 3 and at 04:40:00 of Dec 9. But there are no message logs during these periods of time. What should I do next? Thank you very much! |
I checked the boot.log and found these:
Nov 19 04:02:01 flulinux01 cups: cupsd shutdown succeeded Nov 19 04:02:02 flulinux01 cups: cupsd startup succeeded Nov 26 04:02:01 flulinux01 cups: cupsd shutdown succeeded Nov 26 04:02:02 flulinux01 cups: cupsd startup succeeded Dec 4 15:46:03 flulinux01 syslog: syslogd startup succeeded Dec 4 15:46:03 flulinux01 syslog: klogd startup succeeded Dec 4 15:46:03 flulinux01 irqbalance: irqbalance startup succeeded Dec 4 15:46:03 flulinux01 portmap: portmap startup succeeded Dec 4 15:46:03 flulinux01 nfslock: rpc.statd startup succeeded Dec 4 15:46:03 flulinux01 mdmonitor: mdadm startup succeeded Dec 4 15:46:03 flulinux01 mdmonitor: mdadm succeeded Dec 4 15:46:04 flulinux01 audit: auditd startup succeeded Dec 4 15:46:04 flulinux01 audit: auditd startup succeeded Dec 4 15:46:04 flulinux01 raidmon: MegaCtrl startup succeeded Dec 4 15:46:04 flulinux01 random: Initializing random number generator: succee ded Dec 4 15:46:04 flulinux01 rc: Starting pcmcia: succeeded Dec 4 15:46:05 flulinux01 netfs: Mounting other filesystems: succeeded Dec 4 15:46:05 flulinux01 oracleasm: succeeded Dec 4 15:46:05 flulinux01 last message repeated 2 times Dec 4 15:46:06 flulinux01 autofs: automount startup succeeded Dec 4 15:46:08 flulinux01 cups: cupsd startup succeeded Dec 4 15:46:09 flulinux01 sshd: succeeded Dec 4 15:46:09 flulinux01 xinetd: xinetd startup succeeded Dec 4 15:46:09 flulinux01 ntpd: succeeded Dec 4 15:46:07 flulinux01 last message repeated 2 times Dec 4 15:46:07 flulinux01 ntpd: ntpd startup succeeded Dec 4 15:46:08 flulinux01 sendmail: sendmail startup succeeded Dec 4 15:46:08 flulinux01 sendmail: sm-client startup succeeded Dec 4 15:46:09 flulinux01 gpm: gpm startup succeeded Dec 4 15:46:09 flulinux01 crond: crond startup succeeded Dec 4 15:46:09 flulinux01 xfs: xfs startup succeeded Dec 4 15:46:09 flulinux01 atd: atd startup succeeded Dec 4 15:46:09 flulinux01 rhnsd: rhnsd startup succeeded Dec 5 04:02:02 flulinux01 cups: cupsd shutdown succeeded Dec 5 04:02:02 flulinux01 cups: cupsd startup succeeded Dec 11 15:19:31 flulinux01 syslog: syslogd startup succeeded Dec 11 15:19:31 flulinux01 syslog: klogd startup succeeded Dec 11 15:19:31 flulinux01 irqbalance: irqbalance startup succeeded Dec 11 15:19:31 flulinux01 portmap: portmap startup succeeded Dec 11 15:19:31 flulinux01 nfslock: rpc.statd startup succeeded Dec 11 15:19:31 flulinux01 mdmonitor: mdadm startup succeeded Dec 11 15:19:31 flulinux01 mdmonitor: mdadm succeeded Dec 11 15:19:32 flulinux01 audit: auditd startup succeeded Dec 11 15:19:32 flulinux01 audit: auditd startup succeeded Dec 11 15:19:32 flulinux01 raidmon: MegaCtrl startup succeeded Dec 11 15:19:32 flulinux01 random: Initializing random number generator: succee ded Dec 11 15:19:32 flulinux01 rc: Starting pcmcia: succeeded Dec 11 15:19:32 flulinux01 netfs: Mounting other filesystems: succeeded Dec 11 15:19:32 flulinux01 oracleasm: succeeded Dec 11 15:19:32 flulinux01 last message repeated 2 times Dec 11 15:19:34 flulinux01 autofs: automount startup succeeded Dec 11 15:19:36 flulinux01 cups: cupsd startup succeeded Dec 11 15:19:36 flulinux01 sshd: succeeded Dec 11 15:19:36 flulinux01 xinetd: xinetd startup succeeded Dec 11 15:19:36 flulinux01 ntpd: succeeded Dec 11 15:19:34 flulinux01 last message repeated 2 times Dec 11 15:19:35 flulinux01 ntpd: ntpd startup succeeded Dec 11 15:19:35 flulinux01 sendmail: sendmail startup succeeded Dec 11 15:19:35 flulinux01 sendmail: sm-client startup succeeded Dec 11 15:19:36 flulinux01 gpm: gpm startup succeeded Dec 11 15:19:36 flulinux01 crond: crond startup succeeded Dec 11 15:19:36 flulinux01 xfs: xfs startup succeeded Dec 11 15:19:36 flulinux01 atd: atd startup succeeded Dec 11 15:19:36 flulinux01 rhnsd: rhnsd startup succeeded Dec 12 04:02:02 flulinux01 cups: cupsd shutdown succeeded Dec 12 04:02:02 flulinux01 cups: cupsd startup succeeded On Dec 4 and Dec 11, I restarted the server. Others were done automatically, I think. Do the information help? |
Quote:
I'm trying to determine whether the problem is really with your NIC, or if the server is indeed crashing. |
Sorry. I just cannot ssh it nor ping it remotely. I haven't tried to log in directly.
When I tried to ssh it, I input the host name, port number, user name(root) and password, and the screen was frozen without any responds. At the first time, I could ping it, but at the second time, I could not ping it. The screen was just frozen, I couldn't see any dump/panic information. No, I saw nothing. Why it keeps restarting every Sunday morning? |
I'm not sure about the timing (Sunday mornings), but unattended lockups like that frequently mean hardware trouble.
A couple things to check out:
Pay attention to any output from both of these. |
Is there anything differently happening over the weekends at your facility? A previous company we had a customer experience server crashes every few days very late at night. It turned out to be a janitor who was plugging his vacuum into the rack that the server was in. Is it possible something is happening at your site that is interrupting power or something similar that could be triggering your crashes.
|
The server is located in another office. I have no idea about any accidents happened there during weekends. According to the log, it reboots at the same time, 04:02:02. It is hard to imagine that an accident happened on time every Sunday.
Thank you for your advice! |
The fact that it's happening at exactly the same time (04:02:02 on Sunday morning) makes me initially rule out a hardware issue. What does cron start at (or around) 4am on Sunday?
It's probably that backup script (although it looks like that should start at 4:30)... can you run the script manually or does it reboot? How much detail does the script provide logging-wise and can you watch it when it runs? Do you know what part of the script could be causing the reboot? |
I'm a little confused. (Just re-read the thread to make sure I didn't miss something.) Is it freezing or rebooting?
|
Yes, according the boot log, it reboots on Sunday which I didn't know before.
The two times of down, on Dec 3 and Dec 9,it was just frozen. The cron on Sunday and Monday morning( 4:30 am) is like this: mkdir /rawdata/oraclebackup/`date '+%d-%B-%Y'` cp /rawdata/data_* /rawdata/oraclebackup/`date '+%d-%B-%Y'` cp /rawdata/c-* /rawdata/oraclebackup/`date '+%d-%B-%Y'` cp /rawdata/arch* /rawdate/oraclebackup/`date '+%d-%B-%Y'` rm -f /rawdata/data_* rm -f /rawdata/c-* rm -f /rawdata/arch* I just rearrange some files. The script run on Saturday is like this: #move database backups to /raw/oraclebackup mkdir /rawdata/oraclebackup/`date '+%d-%B-%Y'` cp /rawdata/data_* /rawdata/oraclebackup/`date '+%d-%B-%Y'` cp /rawdata/c-* /rawdata/oraclebackup/`date '+%d-%B-%Y'` cp /rawdata/arch* /rawdata/oraclebackup/`date '+%d-%B-%Y'` rm -f /rawdata/data_* rm -f /rawdata/c-* rm -f /rawdata/arch* #end of moving DIRECTORIES="/" #DIRECTORIES="/etc /home /opt /root /var" BACKUPTO=/dev/nst0 TAR=/bin/tar PATH=/usr/local/bin:/usr/bin:/bin START=`date +%s` #Daily full backup NEWER="" echo "*****start time*****" date echo if mt -f /$BACKUPTO status | grep "ONLINE"; then echo "***** finding sockets*****" find $DIRECTORIES -type s > sockets echo #echo "*****setting compression on*****" #mt -f /$BACKUP compression 1 echo #echo "*****setting type to DLT 35 Compressed*****" #mt -f /$BACKUPTO setdensity 0x85 echo echo "*****archiving*****:)" $TAR $NEWER -cf $BACKUPTO $DIRECTORIES --exclude-from=sockets --absolut e-names --totals echo echo "*****tape-drive status*****" mt -f /$BACKUPTO status echo echo "*****ejection tape*****" mt -f /$BACKUPTO offline echo echo "*****end time*****" date else echo "*****WARNING TAPE DRIVE IS OFFLINE, NO BACKUPS PERFORMED*****" fi FINISH=`date +%s` diff=$((FINISH - START)) echo -n "***** Total Run Time: " HRS=`expr $diff / 3600` MIN=`expr $diff % 3600 / 60` SEC=`expr $diff % 3600 % 60` if [ $HRS -gt 0 ] then echo -n "$HRS hrs. " fi if [ $MIN -gt 0 ] then echo -n "$MIN mins. " fi if [ $SEC -gt 0 ] then if [ $MIN -gt 0 ] then echo "and $SEC secs. " elif [ $HRS -gt 0 ] then echo "and $SEC secs. " else echo "$SEC secs. " fi fi If there is no tape in the tape drive, it will give out an error but won't reboot. During the Thanksgiving, because no one helped to change the rejected tape, it would give out errors for several days. Is that the possible reason? Thank you all! |
If it's had a few hiccups and now freezes occasionally, it's possible that the disk purging has got out of sync and one or more partitions (temporarily) fill up.
Usually it's a good idea to compress backups eg gzip, but this can use a lot of temp space whilst gzip is in progress. If it's also re-booting on a regular basis, that sounds like a separate issue. Could be worth looking at all cron files, esp /etc/crontab. If the re-boots are not at the exact time, the janitor/cleaner thing could be happening; it's happened to me, it's not just an IT myth. sigh... |
Maybe you are right. Because the /dev/sda6 is going to be used up. Here is the result of df.
Filesystem Size Used Avail Use% Mounted on /dev/sda7 4.0G 1.1G 2.7G 29% / none 0 0 0 - /proc none 0 0 0 - /dev/pts usbdevfs 0 0 0 - /proc/bus/usb /dev/sda3 190M 15M 166M 9% /boot /dev/sda8 2.0G 806M 1.1G 43% /home /dev/sda6 4.0G 3.1G 741M 81% /opt/oracle none 2.0G 0 2.0G 0% /dev/shm /dev/sda12 1012M 33M 928M 4% /tmp /dev/sda5 5.0G 2.4G 2.4G 50% /usr /dev/sdd1 135G 20G 109G 16% /rawdata oracleasmfs 0 0 0 - /dev/oracleasm I have an Oracle database preinstalled on our server. Now we are using it. Maybe it is the reason, but I am not very sure. Regarding the reboot, I agree with you that it is another issue. I checked the /etc/crontab. the /etc/crontab is like this: SHELL=/bin/bash PATH=/sbin:/bin:/usr/sbin:/usr/bin MAILTO=root HOME=/ # run-parts 01 * * * * root run-parts /etc/cron.hourly 02 4 * * * root run-parts /etc/cron.daily 22 4 * * 0 root run-parts /etc/cron.weekly 42 4 1 * * root run-parts /etc/cron.monthly and the cron.hourly is like: *** cron.hourly: directory *** the cron.daily is like: *** cron.daily: directory *** the cron.weekly is like: *** cron.weekly: directory *** the cron.monthly is like: *** cron.monthly: directory *** It is the original version and I haven't changed it. Thank you! |
OK.
Now I have rearranged some files of Oracle. Now,the system is like this: [root@flulinux01 udump]# df -k Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda7 4127076 1110460 2806972 29% / /dev/sda3 194449 15120 169289 9% /boot /dev/sda8 2063504 825032 1133652 43% /home /dev/sda6 4127076 2679672 1237760 69% /opt/oracle none 2047704 0 2047704 0% /dev/shm /dev/sda12 1035660 33352 949700 4% /tmp /dev/sda5 5162796 2430800 2469740 50% /usr /dev/sdd1 141003764 20712152 113129036 16% /rawdata the /opt/oracle change from 81% occupied to 69% occupied. I will wait and see what will happen this weekend. Thank you all! |
All times are GMT -5. The time now is 08:01 PM. |