Trace down system crashes?

J_Szucs · 03-31-2003, 10:29 PM

Our main FreeBSD server running many services crashes (freezes) too often, sometimes two times a day.
I have no clue what causes the crashes, as normally I do not find any error messages in /var/log/messages or in root's mailbox.
Memory and processor usage seems to be moderate in normal operation (however I do not know what is the situation immediately before the crashes).
Are there any known issues that can result in a system crash on FreeBSD?
Could you give me some hints, how to find the reason for the crashes?
E.g. is there a way to continuously monitor the server (the running or just started processes, system resources) and/or produce verbose logs so that the reason for the crashes can be found?
What do you do in such cases?

Quintesse · 04-01-2003, 01:38 AM

I have the same problem sporadically on my RH8 system and would like some help as well in figuring out what is going wrong. But I haven't got a clue where to start looking (it never happens when I'm at the keyboard).

Blackknight · 04-01-2003, 02:47 AM

Hi,
For FreeBSD (sorry, I don't have RH), if the system crashes with a kernel panic, you should look at keeping the trace by saving the core generated by the kernel. That way, you should read these two articles from Michael Lucas :
http://www.onlamp.com/pub/a/bsd/2002...y_Daemons.html
http://www.onlamp.com/pub/a/bsd/2002...y_Daemons.html
If your system doesn't panic, that's another problem (in fact, it could be many)

Quintesse · 04-01-2003, 04:45 AM

Well yes, but that's the point, the problem that I experience (and J_Szucs at times as weel it seems) is that the computer freezes (no SSH access either) and they only thing left to do is reboot.
It DOES react to the Ctrl+alt+del combination so SOMETHING is still alive it seems.

leifton · 04-01-2003, 06:07 AM

Quintesse, what do you mean by "It DOES react to the Ctrl+alt+del". How does it react? Does it shut down? Does it go back to something? Are you running XWindows or some other windowing system? What services (you said many, J_Szucs) are you running and what is their average load? What is your kernel and BSD versions? Are there other users on the system at the same time?

Lets start with that

Quintesse · 04-01-2003, 06:39 AM

Sorry for not having been specific enough

First of all, when I find the system in its frozen state it is always with a blank/black screen. It does not react to any input from keyboard (except for ctrl+alt+del) or mouse. Trying to access any of the servers running on the system from the LAN fails (the computer can't be found at all).

CTRL+ALT+DEL: you can hear the harddrive start to whirr and after several minutes the system will automatically reboot.

Distro: RedHat 8
Kernel: 2.4.18-24.8.0
Desktop: KDE 3.1.1
Services: mostly standard RH8 services but include at least SSH, Samba, DHCP, DNS, iptables. No web, no ftp.
Load: unknown because I'm never there when it happens, but I would expect it to be very low because that is what the avarage system load is.
Users: 0 (hopefully :-)

Is that enough info?

leifton · 04-01-2003, 06:47 AM

Try to first look in your /var/log/messages (latest file, sometimes there are several). Now make note that the last set of messages will be from your startup, so find the last ones before those. Check other log files related to your services in the /var/log dir also.

If not there, then sometimes they have their own log files in other directories.

My thoughts are that something should be in the messages file, but maybe not...let's start there. Or if there is a core dump in root $HOME or in the user's $HOME that was logged on at the actual machine (not remotely) at the time of the crash.

Do you have the latest RH rpm's for your services (# up2date)?

Quintesse · 04-01-2003, 08:40 AM

I'll take a good look the next time it happens, but in J_Szucs' case there aren't any messages, so what would you do in such a case?

And where would I look for cores if nobody was logged in? (no $HOME)

Nah, don't use up2date, hate it. I use apt-get for my system but I can assure you that the FreshRPMs that I use are up-to-date as well.

I do have one thing that consider suspect, the fact that it always happens when the computer is left alone for a longish time might suggest that it is either a screensaver or a power saving function that causes the problem. I might try turning it off, but unfortunately the problem is rather sporadic for me (not twice a day like J_Szucs). In reality I don't even care that much, it's just that it would be cool to have uptimes running in the months (years??)

leifton · 04-01-2003, 09:56 AM

Usually I always start with the /var/log/messages, unless I know another log file more specific for what is causing the problems.

After that I would do some diagnostics:
Take all of his services down run only one at a time for a day (that would seem to be inside of his consistent crash frequency) until he has or has not determined that they, or one of them, are or are not causing the problem.

If non of them are causing a problem, I would boot the machine with no unnecessary services running, I MEAN CORE SERVICES. Leave the machine running and see if it fails then. That would leave me to believe either a bad kernel, or hardware.

There are many other logging services, core dump finders and readers and diagnostic tools available throughout your open source sites on the Internet.

As problems or more information is found on each step of the previous section, I may or may not continue to test the other things...like if a service is found to have a problem, still test the others to make sure there are not two causing problems together, and so on...

There really are so many other things to check...which ultimately is why Linux is so awesome...

I would have had a year with my Linux box running at home if I didn't have to move (8 months)

leifton · 04-02-2003, 06:19 AM

J_Szucs,

Are you still pursuing help through this thread?

J_Szucs · 04-03-2003, 09:27 AM

Yeah,

Here are some details:
FreeBSD-4.4-STABLE
Intel 400MHz, 256M Kingston SD 100 Registered RAM
2 SCSI and one IDE HDD
Services:
Ipfw, NAT (for 64k internet connection), Squid, Httpd, Smbd, Nmbd, Named, Sendmail, Procmail, Spamassassin, Anomy sanitizer, cron, Rsync, SSH.
Regular, extensive backups into tar.gz and zip each night.
There are some 50 users.
I also plan to fire up PHP and PostgreSQL on this machine.

A snaphot by Top (not at peak load time):
load averages: 0.06, 0.03, 0.02
44 processes: 1 running, 43 sleeping
CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Mem: 39M Active, 149M Inact, 38M Wired, 14M Cache, 35M Buf, 8976K Free
Swap: 300M Total, 13M Used, 278M Free, 4% Inuse

The amount of free memory (8944K above) used to be about 200K - 2M at peak load time (when most of the clients are active), but there is still some 100M Inact memory in those times. Swap usage used to be about 200K (even immediately after restart), the 13M indicated above is unusual.

Processor usage is normally low: 0 - 2 %. I experienced the highest processor usage with gzip, it was about 85%.

I read about savecore, and I plan to use it (re-configure the server after the next crash).
However, I suspect that I should use a debug kernel to use it (it is somewhere mentioned). Is that so, or will it work with the my present (not debug) kernel?

Crashes occur at irregular times; once I had a 45 days uptime, which was followed by a hard week with 5 crashes. The 45 days uptime was exceptional, I rather have an average of 4 days.

The crashes are different, sometimes I can restart by CTRL+ALT+DEL, but mostly not.
Errors do not seem to be logged in any case. (Only once I had an error message on the screen; it was something about being out of memory)

leifton · 04-03-2003, 11:04 AM

I would highly suggest using the savecore. The out of memory error message makes me suspicious though.

Is it possible that one of the services you are running has a memory leak? Did any one service have minimal use during your 45 days uptime, and hard consistent use during your difficult week? If so, that would likely point to that service having the memory leak. Also the swap usage being high, may also point to a memory leak. Was there a patch applied to a service or kernel to end your 45 day uptime?

Have you read post 9 to this thread? Maybe painful, but works consistently at determining the a crashing problem.

Does your kernel have logging turned on, if so, yes it is verbose, but check that out maybe. It will log the crap out of anything that is going through the kernel, and has good state information at all times...

J_Szucs · 04-03-2003, 02:12 PM

I will use savecore, but does it need a debug kernel?

When I had the 45-days uptime, the server did the same work as before and after, I could not find any difference. There were no patches applied or new programs installed lately.

There may be programs leaking memory, but how to find them?

I have, however, two other ideas:
The FreeBSD kernel: it is not fine-tuned for server use, as it is a default FreeBSD installation. Can it result in crashes? If so, what to change?
A related question: I saw the kernel is configured for some 20 users. Does it mean that there should not be more than 20 clients connecting to the server simultaneously via e.g. smbd? (I am in doubt, because the smbd processes are owned by root, not by the specific users.) If the 50 smbd processes count as one, then the 20-users kernel option is more than sufficient, if they count as 50, then it is a bottleneck. Which is the case?

The motherboard manufacturer (Intel) 'strongly recommends' the use of ECC RAMs, when using that motherboard in servers. The RAM being in the server is not an ECC, only a registered one. Can it result in crashes?

leifton · 04-04-2003, 05:49 AM

I believe it does. It is usually good practice to have a debug kernel image that is the exact same everything except debug, and logging turned on sitting around for use in a situation like this. That way, a simple reboot can still supply your services (maybe a small amount slower) and you have much more diagnostic abilities.

I believe there is a program called memprof (http://www.gnome.org/projects/memprof/)
that can help you, but it may also just be for gnome. But there are other programs like it that can statistically provide the number of malloc, calloc, realloc and free calls made. They can be given for a program or for everything running and the like. If you are not much of a programmer and may not understand how this would show a leak, there is sure to be documentation with them to show certain usages.

Sounds like at least the debugging and logging kernel should be used...