Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
But due to new developments, it's not the problem I originally thought so I believe it is appropriate to start a new thread.
The problem is: my old, Mandrake Multi Network Firewall-based server (well the server isn't old, but the OS is.) crashes every single night just after 4:00 AM. We come in to find the server hung, and have to do a hard reset to get it back up.
/var/log/messages log looks like this:
Code:
Mar 6 04:02:01 MDKSERV CROND[12928]: (root) CMD (nice -n 19 run-parts /etc/cron.daily)
Mar 6 04:02:01 MDKSERV CROND[12929]: (root) CMD ( /usr/share/msec/promisc_check.sh)
Mar 6 04:02:01 MDKSERV anacron[12939]: Updated timestamp for job `cron.daily' to 2007-03-06
Mar 6 04:03:00 MDKSERV CROND[12996]: (root) CMD ( /usr/share/msec/promisc_check.sh)
Mar 6 04:04:00 MDKSERV CROND[13048]: (root) CMD ( /usr/share/msec/promisc_check.sh)
Mar 6 08:09:23 MDKSERV syslogd 1.4.1: restart.
Mar 6 08:09:23 MDKSERV kernel: klogd 1.4.1, log source = /proc/kmsg started.
Mar 6 08:09:23 MDKSERV kernel: Inspecting /boot/System.map-2.4.18-8.1mdksecure
Mar 6 08:09:24 MDKSERV kernel: Loaded 16536 symbols from /boot/System.map-2.4.18-8.1mdksecure.
Mar 6 08:09:24 MDKSERV kernel: Symbols match kernel version 2.4.18.
(etc...)
How can I disable whatever is causing the server to hang? This log doesn't tell me what the problem is, how can I find it?
Thanks! This has been driving me nuts for a long time!
The problem appears to be in a daily cron job. Look in the files in /etc/cron.daily/.... You may be able to see which one is blowing up. If you can't see any obvious problem, start removing various jobs until the pain stops. You may be able to test individual components by running them at the commandline (as root).
That should be a valid diagnostic. Note that some things that run as cron jobs will behave differently when run repetitively in the manner we are discussing. Things that cleanup logfiles, temp directories, etc may not do anything on the second or third iterations in close succession.
Mar 6 04:04:00 MDKSERV CROND[13048]: (root) CMD ( /usr/share/msec/promisc_check.sh)
Look at your crontab and the problem either lies in this line or probably the succeeding line. Examine the instruction carefully and see if it is valid.
Also check /var/log/dmesg and syslog for errors.
That promisc_check.sh thing runs every single minute. So it would be in another directory, I think. The problem I have happens only once a day at just after 4:00 AM. How can I know which instruction would execute immediately after? It seems to me, and I'm just guessing, that the daily, hourly, and minute crons can run at the same time? As in, the minute one can run in between the daily ones?
The only cron.daily entry I see in the log is the one from "anacron" as there is an entry in /etc/cron.daily called "0anacron"
Edit: I checked /var/log/dmesg; it looks like all the stuff that my screen says when I boot up. There are no other kinds of things logged in there. If you want it, I will post it.
I also checked /var/log/syslog and was shocked to find it is 2.9 GB in size. I could hardly believe my eyes: 2.9 GB of TEXT?? Yikes. Looks like that log hasn't been rotated since December 10, 4:03 AM (right around the time the crashing started). I am trying to pull out only the stuff from today but it will take a while...
Is there a backup application installed on the server?
Last time I've seen a server do this was due to a daily overnight backup system freezing because it ran out of memory. It was also at 4:07AM every single night like clockwork.
Try disableing your backup system for a night or two and see what happens - Or just upgrade your RAM (and maybe SWAP too) to at least double what you currently have.
There are some funnies here and also relates to the previous thread. 1stly there is no way your syslog should be that big. Normally the system (mine anyway) starts a new log every day and the oldest one is dumped. So how old is the syslog? Apache is stopped and restarted once a week by cron and a new log started. This is not normal. I would say that this is not a crash but the system freezing up because of lack of resources???? Is the system totally unresponsive to input? Do you have to reboot and if so are you using reset or powering off?
Edit: to Billy: No there is no backup system, but you are right SOMETHING is freezing up.
Tiger: Yes it is locking up completely, keyboard is unresponsive. By the time we come in in the morning, the screen has gone to sleep and I never saw the error message that was (apparently) being displayed (see below). We had to power off/on the server by using the power button.
OK So, removing everything from /etc/cron.d and then re-adding one at a time, I managed to trace it down to the logrotate script. (This explains the huge syslog file.)
So then I looked in /etc/logrotate.d and by process of elimination narrowed it down to the squid's logrotate script. (I found that 2 of squid's logs were being rotated but not the other 2. So it must crash in between). The actual error message causing the server to hang is
Code:
Serverworks OSB4 in impossible state.
Disable UDMA or if you are using Seagate then try switching disk types on this controller.
OSB4: Continuing might cause disk corruption
I have seen this error message before and it apparently is a bug in the kernel version I am using. I tried to upgrade to the latest 2.4 kernel before, because of this error would happen sometimes on boot, and that didn't work at all, so I had to revert.
Anyway, a workaround for now is to remove the squid script from logrotate entirely. The bad news is, we use squid and its logs are going to be huge.
I would suggest installing a 2.6 kernel. Mandrake must have a package for download which would be easy to install. I am very surprised that you have not had corruption already either from hard reboots or the kernel bug. At least you know the cause and it should be fairly easy to correct. If all else fails install a new drive and dd the contents over.
Thanks for the replies! I just wanted to confirm that it is that script, I moved it out of the logrotate.d directory and no more lock ups in 2 days!.
I am installing Ubuntu Edgy 6.10 which has kernel 2.6.17, on another machine and I will replace this one. Hopefully I will never see that error message again!
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.