LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Other *NIX Forums > *BSD
User Name
Password
*BSD This forum is for the discussion of all BSD variants.
FreeBSD, OpenBSD, NetBSD, etc.

Notices

Reply
 
Search this Thread
Old 03-31-2003, 10:29 PM   #1
J_Szucs
Senior Member
 
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126

Rep: Reputation: 58
Trace down system crashes?


Our main FreeBSD server running many services crashes (freezes) too often, sometimes two times a day.
I have no clue what causes the crashes, as normally I do not find any error messages in /var/log/messages or in root's mailbox.
Memory and processor usage seems to be moderate in normal operation (however I do not know what is the situation immediately before the crashes).
Are there any known issues that can result in a system crash on FreeBSD?
Could you give me some hints, how to find the reason for the crashes?
E.g. is there a way to continuously monitor the server (the running or just started processes, system resources) and/or produce verbose logs so that the reason for the crashes can be found?
What do you do in such cases?

Last edited by J_Szucs; 04-01-2003 at 12:53 AM.
 
Old 04-01-2003, 01:38 AM   #2
Quintesse
LQ Newbie
 
Registered: Feb 2003
Posts: 6

Rep: Reputation: 0
I have the same problem sporadically on my RH8 system and would like some help as well in figuring out what is going wrong. But I haven't got a clue where to start looking (it never happens when I'm at the keyboard).
 
Old 04-01-2003, 02:47 AM   #3
Blackknight
Member
 
Registered: Apr 2002
Location: Rouen, France
Distribution: Slackware, FreeBSD
Posts: 34

Rep: Reputation: 15
Hi,
For FreeBSD (sorry, I don't have RH), if the system crashes with a kernel panic, you should look at keeping the trace by saving the core generated by the kernel. That way, you should read these two articles from Michael Lucas :
http://www.onlamp.com/pub/a/bsd/2002...y_Daemons.html
http://www.onlamp.com/pub/a/bsd/2002...y_Daemons.html
If your system doesn't panic, that's another problem (in fact, it could be many)
 
Old 04-01-2003, 04:45 AM   #4
Quintesse
LQ Newbie
 
Registered: Feb 2003
Posts: 6

Rep: Reputation: 0
Well yes, but that's the point, the problem that I experience (and J_Szucs at times as weel it seems) is that the computer freezes (no SSH access either) and they only thing left to do is reboot.
It DOES react to the Ctrl+alt+del combination so SOMETHING is still alive it seems.
 
Old 04-01-2003, 06:07 AM   #5
leifton
LQ Newbie
 
Registered: Dec 2002
Location: Cincinnati, Ohio, USA
Distribution: RedHat 7.3, FreeBSD 4.4
Posts: 15

Rep: Reputation: 0
Quintesse, what do you mean by "It DOES react to the Ctrl+alt+del". How does it react? Does it shut down? Does it go back to something? Are you running XWindows or some other windowing system? What services (you said many, J_Szucs) are you running and what is their average load? What is your kernel and BSD versions? Are there other users on the system at the same time?

Lets start with that
 
Old 04-01-2003, 06:39 AM   #6
Quintesse
LQ Newbie
 
Registered: Feb 2003
Posts: 6

Rep: Reputation: 0
Sorry for not having been specific enough

First of all, when I find the system in its frozen state it is always with a blank/black screen. It does not react to any input from keyboard (except for ctrl+alt+del) or mouse. Trying to access any of the servers running on the system from the LAN fails (the computer can't be found at all).

CTRL+ALT+DEL: you can hear the harddrive start to whirr and after several minutes the system will automatically reboot.

Distro: RedHat 8
Kernel: 2.4.18-24.8.0
Desktop: KDE 3.1.1
Services: mostly standard RH8 services but include at least SSH, Samba, DHCP, DNS, iptables. No web, no ftp.
Load: unknown because I'm never there when it happens, but I would expect it to be very low because that is what the avarage system load is.
Users: 0 (hopefully :-)

Is that enough info?
 
Old 04-01-2003, 06:47 AM   #7
leifton
LQ Newbie
 
Registered: Dec 2002
Location: Cincinnati, Ohio, USA
Distribution: RedHat 7.3, FreeBSD 4.4
Posts: 15

Rep: Reputation: 0
Try to first look in your /var/log/messages (latest file, sometimes there are several). Now make note that the last set of messages will be from your startup, so find the last ones before those. Check other log files related to your services in the /var/log dir also.

If not there, then sometimes they have their own log files in other directories.

My thoughts are that something should be in the messages file, but maybe not...let's start there. Or if there is a core dump in root $HOME or in the user's $HOME that was logged on at the actual machine (not remotely) at the time of the crash.

Do you have the latest RH rpm's for your services (# up2date)?
 
Old 04-01-2003, 08:40 AM   #8
Quintesse
LQ Newbie
 
Registered: Feb 2003
Posts: 6

Rep: Reputation: 0
I'll take a good look the next time it happens, but in J_Szucs' case there aren't any messages, so what would you do in such a case?

And where would I look for cores if nobody was logged in? (no $HOME)

Nah, don't use up2date, hate it. I use apt-get for my system but I can assure you that the FreshRPMs that I use are up-to-date as well.

I do have one thing that consider suspect, the fact that it always happens when the computer is left alone for a longish time might suggest that it is either a screensaver or a power saving function that causes the problem. I might try turning it off, but unfortunately the problem is rather sporadic for me (not twice a day like J_Szucs). In reality I don't even care that much, it's just that it would be cool to have uptimes running in the months (years??)
 
Old 04-01-2003, 09:56 AM   #9
leifton
LQ Newbie
 
Registered: Dec 2002
Location: Cincinnati, Ohio, USA
Distribution: RedHat 7.3, FreeBSD 4.4
Posts: 15

Rep: Reputation: 0
Usually I always start with the /var/log/messages, unless I know another log file more specific for what is causing the problems.

After that I would do some diagnostics:
Take all of his services down run only one at a time for a day (that would seem to be inside of his consistent crash frequency) until he has or has not determined that they, or one of them, are or are not causing the problem.

If non of them are causing a problem, I would boot the machine with no unnecessary services running, I MEAN CORE SERVICES. Leave the machine running and see if it fails then. That would leave me to believe either a bad kernel, or hardware.

There are many other logging services, core dump finders and readers and diagnostic tools available throughout your open source sites on the Internet.

As problems or more information is found on each step of the previous section, I may or may not continue to test the other things...like if a service is found to have a problem, still test the others to make sure there are not two causing problems together, and so on...

There really are so many other things to check...which ultimately is why Linux is so awesome...

I would have had a year with my Linux box running at home if I didn't have to move (8 months)
 
Old 04-02-2003, 06:19 AM   #10
leifton
LQ Newbie
 
Registered: Dec 2002
Location: Cincinnati, Ohio, USA
Distribution: RedHat 7.3, FreeBSD 4.4
Posts: 15

Rep: Reputation: 0
J_Szucs,

Are you still pursuing help through this thread?
 
Old 04-03-2003, 09:27 AM   #11
J_Szucs
Senior Member
 
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126

Original Poster
Rep: Reputation: 58
Yeah,

Here are some details:
FreeBSD-4.4-STABLE
Intel 400MHz, 256M Kingston SD 100 Registered RAM
2 SCSI and one IDE HDD
Services:
Ipfw, NAT (for 64k internet connection), Squid, Httpd, Smbd, Nmbd, Named, Sendmail, Procmail, Spamassassin, Anomy sanitizer, cron, Rsync, SSH.
Regular, extensive backups into tar.gz and zip each night.
There are some 50 users.
I also plan to fire up PHP and PostgreSQL on this machine.

A snaphot by Top (not at peak load time):
load averages: 0.06, 0.03, 0.02
44 processes: 1 running, 43 sleeping
CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Mem: 39M Active, 149M Inact, 38M Wired, 14M Cache, 35M Buf, 8976K Free
Swap: 300M Total, 13M Used, 278M Free, 4% Inuse

The amount of free memory (8944K above) used to be about 200K - 2M at peak load time (when most of the clients are active), but there is still some 100M Inact memory in those times. Swap usage used to be about 200K (even immediately after restart), the 13M indicated above is unusual.

Processor usage is normally low: 0 - 2 %. I experienced the highest processor usage with gzip, it was about 85%.

I read about savecore, and I plan to use it (re-configure the server after the next crash).
However, I suspect that I should use a debug kernel to use it (it is somewhere mentioned). Is that so, or will it work with the my present (not debug) kernel?

Crashes occur at irregular times; once I had a 45 days uptime, which was followed by a hard week with 5 crashes. The 45 days uptime was exceptional, I rather have an average of 4 days.

The crashes are different, sometimes I can restart by CTRL+ALT+DEL, but mostly not.
Errors do not seem to be logged in any case. (Only once I had an error message on the screen; it was something about being out of memory)

Last edited by J_Szucs; 04-03-2003 at 09:35 AM.
 
Old 04-03-2003, 11:04 AM   #12
leifton
LQ Newbie
 
Registered: Dec 2002
Location: Cincinnati, Ohio, USA
Distribution: RedHat 7.3, FreeBSD 4.4
Posts: 15

Rep: Reputation: 0
I would highly suggest using the savecore. The out of memory error message makes me suspicious though.

Is it possible that one of the services you are running has a memory leak? Did any one service have minimal use during your 45 days uptime, and hard consistent use during your difficult week? If so, that would likely point to that service having the memory leak. Also the swap usage being high, may also point to a memory leak. Was there a patch applied to a service or kernel to end your 45 day uptime?

Have you read post 9 to this thread? Maybe painful, but works consistently at determining the a crashing problem.

Does your kernel have logging turned on, if so, yes it is verbose, but check that out maybe. It will log the crap out of anything that is going through the kernel, and has good state information at all times...
 
Old 04-03-2003, 02:12 PM   #13
J_Szucs
Senior Member
 
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126

Original Poster
Rep: Reputation: 58
I will use savecore, but does it need a debug kernel?

When I had the 45-days uptime, the server did the same work as before and after, I could not find any difference. There were no patches applied or new programs installed lately.

There may be programs leaking memory, but how to find them?

I have, however, two other ideas:
The FreeBSD kernel: it is not fine-tuned for server use, as it is a default FreeBSD installation. Can it result in crashes? If so, what to change?
A related question: I saw the kernel is configured for some 20 users. Does it mean that there should not be more than 20 clients connecting to the server simultaneously via e.g. smbd? (I am in doubt, because the smbd processes are owned by root, not by the specific users.) If the 50 smbd processes count as one, then the 20-users kernel option is more than sufficient, if they count as 50, then it is a bottleneck. Which is the case?

The motherboard manufacturer (Intel) 'strongly recommends' the use of ECC RAMs, when using that motherboard in servers. The RAM being in the server is not an ECC, only a registered one. Can it result in crashes?

Last edited by J_Szucs; 04-03-2003 at 02:14 PM.
 
Old 04-04-2003, 05:49 AM   #14
leifton
LQ Newbie
 
Registered: Dec 2002
Location: Cincinnati, Ohio, USA
Distribution: RedHat 7.3, FreeBSD 4.4
Posts: 15

Rep: Reputation: 0
I believe it does. It is usually good practice to have a debug kernel image that is the exact same everything except debug, and logging turned on sitting around for use in a situation like this. That way, a simple reboot can still supply your services (maybe a small amount slower) and you have much more diagnostic abilities.

I believe there is a program called memprof (http://www.gnome.org/projects/memprof/)
that can help you, but it may also just be for gnome. But there are other programs like it that can statistically provide the number of malloc, calloc, realloc and free calls made. They can be given for a program or for everything running and the like. If you are not much of a programmer and may not understand how this would show a leak, there is sure to be documentation with them to show certain usages.

Sounds like at least the debugging and logging kernel should be used...
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
System Crashes amer_58 Linux - Newbie 5 03-11-2005 01:31 PM
X server crashes system Parksy Linux - Hardware 3 09-07-2003 09:38 AM
Info on linux system or app. trace tools. tomb Linux - General 2 09-02-2003 02:59 PM
Linux Box crashes with no trace in syslog eDubster Linux - General 2 05-19-2003 03:43 PM
System Crashes linuxeco Linux - General 12 10-13-2002 08:31 AM


All times are GMT -5. The time now is 08:10 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration