System Crashes
Usually my system runs a large variety of processes, but lately after about
12-24 hours of usage, it starts thrashing real hard and before you know it the system is too slow to log in without having a timeout happen on the password prompt. I run the 2.4.19 kernel and the latest stable versions of software. The system is a 233MHz pentium based processor with 200 megs of memory in my system and usually have about 475 megabytes of swap (which I turned off below so I could demonstrate the issues I was having, but yes, lots of swap was available at the time of these imcidents.) The memory usgae thoughout the next part of this showed that all ecept about 5 megs of the RAM was being used, and 20 megs of swap was being utilized too. Last night the system started it's usual lockup and was showing unusally high load averages. When this happened I was compiling a kernel. I cancelled the compiling and checked the load average. Code:
2:31am up 12:25, 1 user, load average: 3.44, 3.49, 3.28 Code:
2:45am up 12:39, 1 user, load average: 3.06, 4.99, 4.7 shutting stuff down the load averge in the 1 minute colum spiked up 4.99. So finally I shut down most of my other services, nfsd, qmail, cron daemon, sysklogd and inetd (which was running a CVS pserver). Code:
2:57am up 12:50, 1 user, load average: 0.58, 1.55, 3.00 to high for a system that isn't doing anything except handle a single sshd session. So finally after about a half hour of not touching the system, the load average hit somewhere in the range of 0.08. So now I start looking at the current memory usage (which is not very different from what it was before. except that the swap was down to 1 megabyte. So I decided to see if I could get any answers by trying to break the system in a controlled fashion. I then swapped of the swap sspace which took about 15 seconds to do. Immediately me ssh session died. So I go to the console and see an out of memory error. I decide to continue screwing around with it. Code:
total used free shared buffers cached I decided to see if I could allocate any memory that was left into a file. Code:
mount -t ramfs /dev/ram0 /mnt and dd, bash and my login was killed. Ok cool, I now have all my memory used up, so I let it sit there until morning to see if the system would recover any more. Sleep ...... So today about noon-thirty, I went over to the console and there was no recovery, there was more memory being used up, because agetty was dying with an out of memory error and then respawning Any help any one could offer is much appreciated because I have no idea what to do next. Sorry for the long post but I wanted to provide you with as much information as I could. |
Im guessing here, but it sounds like a memory leak, there was a thread about ,memory leaks a while back if i remember correctly, Try to find it (search the board) and meanwhile im sure some of the more knowledgable people will help you out more.
Also you might want to get a program called memtest86 to ckeck your RAM in case you suspect the RAM sticks are corrupted. Sorry i couldnt be of much help... -NSKL |
Personally. I would like to see the output of top with your system normally loaded. Obviously something starts running that eats your ram causing your VMM to start thrashing stuff into and out of swap. Or, something like logrotate, logcheck and maybe aide are running at the same time. Since they are all doing disk access, this can really bog down a system with a slower processor.
Just tossing out some ideas. Your mileage may vary. -mk |
Here is the output of top on a pretty normal day.
Code:
11:30pm up 10:22, 1 user, load average: 0.04, 0.05, 0.01 |
I also want to mention that the following kinds of activity happen in a given day.
1 cron job that runs updatedb logs are rotated manually, whenever they get to be more than a couple megabytes in size. The email server probably only handles about 100 messages a day The web server handles about 250 requests a day. which is about one request every 6 minutes. NFS transfers about 250 megabytes a day, but it can vary quite a bit. changedfiles syncs a directory on my system to a directory on another system using ssh sessions. Anywhere from 10-15 transfers a day and varies from 1-15 megabytes I also do alot of compiling sometimes, anywhere from 30 minutes to 3 hours on a given day. Let me know if any other information is required. I plan on running memory tests tomorow to see if that shows anything useful. |
The output of top you posted, is that when the system is healthy or thrashing? -mk
|
The system was healthy when I grabbed the output of top.
As a side note, I ran memtest86 today and no errrors were detected. |
I ran a memory testing script earlier that is supposed to run massive diffs in parallel to test memory management. I don't know how acurate this is supposed to be, but I ended up going into 'super-thrash mode' before the test could complete.
The odd thing is that I had atop running every 60 seconds and logging to a file. It showed that only 15 megs of swap was being used when it crashed and burned. Now I know a 233 processor is nothing great, but I can't understand why the system can't handle paging 15 megs worth of swap space. If a memory leak is indeed what the problem is, how would I know. Would atop or top show the rogue process using more memory than it should or does memory just disappear off the face of the earth. I am gonna keep plugging at the problem, so let me know if you have any ideas on what I could or should do. |
When your system went iinto Super Thrash mode, using top, what were the top 3 cpu intensive programs running? -mk
|
tar, gzip and diff.
If you think it would help, I can run somekind of test, log the heck out of it and post the results. It's not like I am too worried about crashing it at this point. |
If you would, grab the first 12 lines of output ftom top, when the system is thrashing, and post them here. -mk
|
9.31 Load Average.....
After about two days of running my system, a single web page hit did this ....
The load average was pretty low before I tried to load a web page. Before I loaded the page even though the load average was down, their was a lot of thrashing going on and lots of spikes to the load average. Code:
7:04am up 1 day, 14:01, 1 user, load average: 9.31, 4.19, 1.73 |
Top output shows an accumulation of things, what you want to see is a detailed overview what goes on. Try running Atsar or Sysstat (see Freshmeat) with a low interval, and process the logs daily. Also review your system limits and your /proc/sys/vm settings, limits can do all sorts of mucking from denying logins to crashing X11. Proper (for your situation that is) bdflush/kswapd values may result in some performance downgrading but less bursting I/O which could be usefull on an already I/O bound box.
|
All times are GMT -5. The time now is 11:11 AM. |