server hangs at same time every day

Avatar · 03-06-2007, 09:58 AM

This post is a continuation of this thread: http://www.linuxquestions.org/questi...d.php?t=517683

But due to new developments, it's not the problem I originally thought so I believe it is appropriate to start a new thread.

The problem is: my old, Mandrake Multi Network Firewall-based server (well the server isn't old, but the OS is.) crashes every single night just after 4:00 AM. We come in to find the server hung, and have to do a hard reset to get it back up.

/var/log/messages log looks like this:

Code:

Mar  6 04:02:01 MDKSERV CROND[12928]: (root) CMD (nice -n 19 run-parts /etc/cron.daily)
Mar  6 04:02:01 MDKSERV CROND[12929]: (root) CMD (   /usr/share/msec/promisc_check.sh)
Mar  6 04:02:01 MDKSERV anacron[12939]: Updated timestamp for job `cron.daily' to 2007-03-06
Mar  6 04:03:00 MDKSERV CROND[12996]: (root) CMD (   /usr/share/msec/promisc_check.sh)
Mar  6 04:04:00 MDKSERV CROND[13048]: (root) CMD (   /usr/share/msec/promisc_check.sh)
Mar  6 08:09:23 MDKSERV syslogd 1.4.1: restart.
Mar  6 08:09:23 MDKSERV kernel: klogd 1.4.1, log source = /proc/kmsg started.
Mar  6 08:09:23 MDKSERV kernel: Inspecting /boot/System.map-2.4.18-8.1mdksecure
Mar  6 08:09:24 MDKSERV kernel: Loaded 16536 symbols from /boot/System.map-2.4.18-8.1mdksecure.
Mar  6 08:09:24 MDKSERV kernel: Symbols match kernel version 2.4.18.
(etc...)

How can I disable whatever is causing the server to hang? This log doesn't tell me what the problem is, how can I find it?

Thanks! This has been driving me nuts for a long time!

theNbomr · 03-06-2007, 10:23 AM

The problem appears to be in a daily cron job. Look in the files in /etc/cron.daily/.... You may be able to see which one is blowing up. If you can't see any obvious problem, start removing various jobs until the pain stops. You may be able to test individual components by running them at the commandline (as root).

--- rod.

Avatar · 03-06-2007, 10:57 AM

Hi rod, thanks for your answer. I looked in my cron.dailydirectory but I did not see what could be causing a problem.

Do you think it would be possible for me to edit all those file with a line like

Code:

echo -n " Running <filename>"

and then run all the files at once with the command that is inside my /etc/crontab? i.e.

Code:

nice -n 19 run-parts /etc/cron.daily

What I am hoping this will do, is show me an output like

Code:

Running 0anacrontab
Running 0sarg
Running clean-naat
Running logrotate
(...)

and it would stop at the one causing the problem? Do you think this would work?

theNbomr · 03-06-2007, 12:23 PM

That should be a valid diagnostic. Note that some things that run as cron jobs will behave differently when run repetitively in the manner we are discussing. Things that cleanup logfiles, temp directories, etc may not do anything on the second or third iterations in close succession.

--- rod.

TigerOC · 03-06-2007, 02:13 PM

Quote:

Mar 6 04:04:00 MDKSERV CROND[13048]: (root) CMD ( /usr/share/msec/promisc_check.sh)

Look at your crontab and the problem either lies in this line or probably the succeeding line. Examine the instruction carefully and see if it is valid.
Also check /var/log/dmesg and syslog for errors.

Avatar · 03-06-2007, 02:54 PM

Hi TigerOC,

That promisc_check.sh thing runs every single minute. So it would be in another directory, I think. The problem I have happens only once a day at just after 4:00 AM. How can I know which instruction would execute immediately after? It seems to me, and I'm just guessing, that the daily, hourly, and minute crons can run at the same time? As in, the minute one can run in between the daily ones?

The only cron.daily entry I see in the log is the one from "anacron" as there is an entry in /etc/cron.daily called "0anacron"

Edit: I checked /var/log/dmesg; it looks like all the stuff that my screen says when I boot up. There are no other kinds of things logged in there. If you want it, I will post it.

I also checked /var/log/syslog and was shocked to find it is 2.9 GB in size. I could hardly believe my eyes: 2.9 GB of TEXT?? Yikes. Looks like that log hasn't been rotated since December 10, 4:03 AM (right around the time the crashing started). I am trying to pull out only the stuff from today but it will take a while...

Avatar · 03-06-2007, 03:08 PM

Here is my syslog: It's the same as the other one.

Code:

(...)
Mar  6 04:00:26 MDKSERV adsl: adsl-start startup succeeded
Mar  6 04:00:59 MDKSERV CROND[12900]: (root) CMD (nice -n 19 run-parts /etc/cron.hourly)
Mar  6 04:00:59 MDKSERV CROND[12902]: (root) CMD (   /usr/share/msec/promisc_check.sh)
Mar  6 04:02:01 MDKSERV CROND[12928]: (root) CMD (nice -n 19 run-parts /etc/cron.daily)
Mar  6 04:02:01 MDKSERV CROND[12929]: (root) CMD (   /usr/share/msec/promisc_check.sh)
Mar  6 04:02:01 MDKSERV anacron[12939]: Updated timestamp for job `cron.daily' to 2007-03-06
Mar  6 04:03:00 MDKSERV CROND[12996]: (root) CMD (   /usr/share/msec/promisc_check.sh)
Mar  6 04:04:00 MDKSERV CROND[13048]: (root) CMD (   /usr/share/msec/promisc_check.sh)
Mar  6 08:09:23 MDKSERV syslogd 1.4.1: restart.
Mar  6 08:09:23 MDKSERV kernel: klogd 1.4.1, log source = /proc/kmsg started.
Mar  6 08:09:23 MDKSERV kernel: Inspecting /boot/System.map-2.4.18-8.1mdksecure Mar  6 08:09:24 MDKSERV kernel: Loaded 16536 symbols from /boot/System.map-2.4.18-8.1mdksecure.
Mar  6 08:09:24 MDKSERV kernel: Symbols match kernel version 2.4.18.
Mar  6 08:09:24 MDKSERV kernel: Loaded 257 symbols from 11 modules.
(...)

BillyGalbreath · 03-06-2007, 03:09 PM

Is there a backup application installed on the server?

Last time I've seen a server do this was due to a daily overnight backup system freezing because it ran out of memory. It was also at 4:07AM every single night like clockwork.

Try disableing your backup system for a night or two and see what happens - Or just upgrade your RAM (and maybe SWAP too) to at least double what you currently have.

TigerOC · 03-07-2007, 01:57 AM

There are some funnies here and also relates to the previous thread. 1stly there is no way your syslog should be that big. Normally the system (mine anyway) starts a new log every day and the oldest one is dumped. So how old is the syslog? Apache is stopped and restarted once a week by cron and a new log started. This is not normal. I would say that this is not a crash but the system freezing up because of lack of resources???? Is the system totally unresponsive to input? Do you have to reboot and if so are you using reset or powering off?

Avatar · 03-07-2007, 02:59 PM

Edit: to Billy: No there is no backup system, but you are right SOMETHING is freezing up.
Tiger: Yes it is locking up completely, keyboard is unresponsive. By the time we come in in the morning, the screen has gone to sleep and I never saw the error message that was (apparently) being displayed (see below). We had to power off/on the server by using the power button.

OK So, removing everything from /etc/cron.d and then re-adding one at a time, I managed to trace it down to the logrotate script. (This explains the huge syslog file.)

So then I looked in /etc/logrotate.d and by process of elimination narrowed it down to the squid's logrotate script. (I found that 2 of squid's logs were being rotated but not the other 2. So it must crash in between). The actual error message causing the server to hang is

Code:

Serverworks OSB4 in impossible state.
Disable UDMA or if you are using Seagate then try switching disk types on this controller.
OSB4: Continuing might cause disk corruption

I have seen this error message before and it apparently is a bug in the kernel version I am using. I tried to upgrade to the latest 2.4 kernel before, because of this error would happen sometimes on boot, and that didn't work at all, so I had to revert.

Anyway, a workaround for now is to remove the squid script from logrotate entirely. The bad news is, we use squid and its logs are going to be huge.

Any suggestions welcome.

BillyGalbreath · 03-07-2007, 04:49 PM

Try upgrading to the newest 2.4 kernel again. If no dice, then try 2.6 kernel. If no dice, disable DMA. If no dice, just dont run that script.

TigerOC · 03-08-2007, 01:52 AM

I would suggest installing a 2.6 kernel. Mandrake must have a package for download which would be easy to install. I am very surprised that you have not had corruption already either from hard reboots or the kernel bug. At least you know the cause and it should be fairly easy to correct. If all else fails install a new drive and dd the contents over.

Avatar · 03-09-2007, 12:56 PM

Thanks for the replies! I just wanted to confirm that it is that script, I moved it out of the logrotate.d directory and no more lock ups in 2 days!.

I am installing Ubuntu Edgy 6.10 which has kernel 2.6.17, on another machine and I will replace this one. Hopefully I will never see that error message again!

Thanks for the help, i appreciate it.