LinuxQuestions.org - System Crashes

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - System Crashes (https://www.linuxquestions.org/questions/linux-general-1/system-crashes-32254/)

Usually my system runs a large variety of processes, but lately after about
12-24 hours of usage, it starts thrashing real hard and before you know it
the system is too slow to log in without having a timeout happen on the password
prompt. I run the 2.4.19 kernel and the latest stable versions of software.

The system is a 233MHz pentium based processor with 200 megs of memory in my system
and usually have about 475 megabytes of swap (which I turned off below so I could
demonstrate the issues I was having, but yes, lots of swap was available at the
time of these imcidents.)

The memory usgae thoughout the next part of this showed that all ecept about 5
megs of the RAM was being used, and 20 megs of swap was being utilized too.

Last night the system started it's usual lockup and was showing unusally high load
averages. When this happened I was compiling a kernel. I cancelled the compiling
and checked the load average.

Code:

2:31am up 12:25, 1 user, load average: 3.44, 3.49, 3.28

So I shut down apache and mysql ...

Code:

2:45am up 12:39, 1 user, load average: 3.06, 4.99, 4.7

which showed results that seemed pretty normal to me. except that while I was
shutting stuff down the load averge in the 1 minute colum spiked up 4.99.

So finally I shut down most of my other services, nfsd, qmail, cron daemon,
sysklogd and inetd (which was running a CVS pserver).

Code:

2:57am up 12:50, 1 user, load average: 0.58, 1.55, 3.00

Finally things seem like they are starting to improve, but 0.58 is still way
to high for a system that isn't doing anything except handle a single sshd session.

So finally after about a half hour of not touching the system, the load average hit
somewhere in the range of 0.08.

So now I start looking at the current memory usage (which is not very different from what
it was before. except that the swap was down to 1 megabyte. So I decided to see if I could
get any answers by trying to break the system in a controlled fashion.

I then swapped of the swap sspace which took about 15 seconds to do. Immediately me ssh
session died. So I go to the console and see an out of memory error. I decide to continue
screwing around with it.

Code:

            total      used      free    shared    buffers    cached

Mem:        192676    188324      4352          0        356      1060

-/+ buffers/cache:    186908      5768

Swap:            0          0          0





  procs                      memory      swap          io    system      cpu

 r  b  w  swpd  free  buff  cache  si  so    bi    bo  in    cs us sy id

 0  0  0      0  4344    336  1076  62  13  160  103  179  370  9  2 89





  3:53am  up 13:47,  3 users,  load average: 0.00, 0.00, 0.11

14 processes: 13 sleeping, 1 running, 0 zombie, 0 stopped

CPU states:  8.9% user,  2.3% system,  0.0% nice,  2.2% idle

Mem:  192676K av,  188788K used,    3888K free,      0K shrd,    348K buff

        1504K Active,              2428K Inactive

Swap:      0K av,      0K used,      0K free                    1412K cached



  PID USER    PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM  TIME COMMAND

 6826 root      19  0  868  864  680 R    0.8  0.4  0:00 top -c -b -n 1

    1 root      8  0  136  132    52 S    0.0  0.0  0:06 init [3]

    2 root      9  0    0    0    0 SW    0.0  0.0  0:00 keventd

    3 root      19  19    0    0    0 SWN  0.0  0.0  0:00 ksoftirqd_CPU0

    4 root      11  0    0    0    0 SW    0.0  0.0  1:00 kswapd

    5 root      9  0    0    0    0 SW    0.0  0.0  0:00 bdflush

    6 root      10  0    0    0    0 SW    0.0  0.0  0:04 kupdated

    7 root      9  0    0    0    0 SW    0.0  0.0  0:00 kreiserfsd

21159 root      9  0    72  72    0 S    0.0  0.0  0:00 dhcpcd eth0

22673 root      9  0  304  304  180 S    0.0  0.1  0:21 /usr/sbin/sshd

19398 root      9  0  712  712    52 S    0.0  0.3  0:00 -bash

16918 root      9  0    68  68    0 S    0.0  0.0  0:00 /sbin/agetty tty1 9600

  336 root      13  0  516  516  336 S    0.0  0.2  0:00 /usr/sbin/sshd

18441 root      16  0  996  992  344 S    0.0  0.5  0:00 -bash^M

Lets crash it...
I decided to see if I could allocate any memory that was left into a file.

Code:

mount -t ramfs /dev/ram0 /mnt

cd /mnt

dd if=/dev/zero of=/mnt/memuseup bs=512k count=20

This should have created a 10 megabyte file in memory, but obviously it didn't finish
and dd, bash and my login was killed.

Ok cool, I now have all my memory used up, so I let it sit there until morning to see
if the system would recover any more.

Sleep ......

So today about noon-thirty, I went over to the console and there was no recovery, there
was more memory being used up, because agetty was dying with an out of memory error and
then respawning

Any help any one could offer is much appreciated because I have no idea what to do next.
Sorry for the long post but I wanted to provide you with as much information as I could.

Im guessing here, but it sounds like a memory leak, there was a thread about ,memory leaks a while back if i remember correctly, Try to find it (search the board) and meanwhile im sure some of the more knowledgable people will help you out more.
Also you might want to get a program called memtest86 to ckeck your RAM in case you suspect the RAM sticks are corrupted.
Sorry i couldnt be of much help...
-NSKL

Personally. I would like to see the output of top with your system normally loaded. Obviously something starts running that eats your ram causing your VMM to start thrashing stuff into and out of swap. Or, something like logrotate, logcheck and maybe aide are running at the same time. Since they are all doing disk access, this can really bog down a system with a slower processor.

Just tossing out some ideas. Your mileage may vary. -mk

Here is the output of top on a pretty normal day.

Code:

 11:30pm  up 10:22,  1 user,  load average: 0.04, 0.05, 0.01

66 processes: 65 sleeping, 1 running, 0 zombie, 0 stopped

CPU states:  0.0% user,  0.0% system,  0.0% nice, 99.8% idle

Mem:  192676K av,  99852K used,  92824K free,      0K shrd,  35600K buff

        10496K Active,              80932K Inactive

Swap:  465876K av,      0K used,  465876K free                  45644K cached



  PID USER    PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM  TIME COMMAND

26884 root      17  0  892  888  680 R    3.8  0.4  0:00 top -c -b -n 1

    1 root      9  0  528  524  452 S    0.0  0.2  0:03 init [3]

    2 root      9  0    0    0    0 SW    0.0  0.0  0:00 keventd

    3 root      19  19    0    0    0 SWN  0.0  0.0  0:00 ksoftirqd_CPU0

    4 root      9  0    0    0    0 SW    0.0  0.0  0:00 kswapd

    5 root      9  0    0    0    0 SW    0.0  0.0  0:00 bdflush

    6 root      9  0    0    0    0 SW    0.0  0.0  0:00 kupdated

    7 root      9  0    0    0    0 SW    0.0  0.0  0:00 kreiserfsd

 8338 root      9  0  676  676  556 S    0.0  0.3  0:00 /usr/sbin/syslogd -m 0

29005 root      9  0  548  548  404 S    0.0  0.2  0:00 /usr/sbin/klogd

22666 root      9  0  472  472  400 S    0.0  0.2  0:00 dhcpcd eth0

 2257 bin        9  0  524  524  440 S    0.0  0.2  0:00 /sbin/portmap

 9133 root      9  0  820  820  672 S    0.0  0.4  0:00 /usr/sbin/rpc.mountd

 4048 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd

13697 root      9  0    0    0    0 SW    0.0  0.0  0:00 lockd

10311 root      9  0    0    0    0 SW    0.0  0.0  0:00 rpciod

12097 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd

10432 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd

18465 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd

30406 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd

 7239 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd

19878 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd

14648 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd

20219 root      9  0  772  772  648 S    0.0  0.4  0:00 /usr/sbin/rpc.statd

 6867 root      9  0  464  464  376 S    0.0  0.2  0:00 /usr/sbin/rpc.rquotad

 8155 root      8  0  564  564  472 S    0.0  0.2  0:00 /usr/sbin/madcron

 2314 root      9  0  1300 1300  1180 S    0.0  0.6  0:05 /usr/sbin/sshd

 8705 root      9  0  1124 1124  912 S    0.0  0.5  0:00 /bin/sh /usr/bin/safe_mysqld --datadir=/var/mysql --pid-file=/var/mysql/server1.pid

27978 mysql      9  0  4340 4340  1584 S    0.0  2.2  0:00 /usr/bin/mysqld --basedir=/usr --datadir=/var/mysql --user=mysql --pid-file=/var/mysql/server1.

31594 fetchmai  9  0  936  936  768 S    0.0  0.4  0:00 fetchmail --daemon 300 --syslog -f /etc/fetchmailrc

 6500 root      9  0  496  496  424 S    0.0  0.2  0:00 inetd /etc/inetd.conf

 9242 root      9  0  532  532  464 S    0.0  0.2  0:00 /sbin/agetty tty1 9600

 5488 root      9  0  532  532  464 S    0.0  0.2  0:00 /sbin/agetty tty2 9600

 1044 mysql      9  0  4340 4340  1584 S    0.0  2.2  0:00 /usr/bin/mysqld --basedir=/usr --datadir=/var/mysql --user=mysql --pid-file=/var/mysql/server1.

  478 mysql      9  0  4340 4340  1584 S    0.0  2.2  0:00 /usr/bin/mysqld --basedir=/usr --datadir=/var/mysql --user=mysql --pid-file=/var/mysql/server1.

 7350 mysql      9  0  4340 4340  1584 S    0.0  2.2  0:00 /usr/bin/mysqld --basedir=/usr --datadir=/var/mysql --user=mysql --pid-file=/var/mysql/server1.

 2868 root      8  0  4520 4520  4332 S    0.0  2.3  0:00 /usr/bin/httpd -DSSL

 3446 nobody    9  0  4636 4636  4400 S    0.0  2.4  0:00 /usr/bin/httpd -DSSL

28014 nobody    9  0  4648 4648  4404 S    0.0  2.4  0:00 /usr/bin/httpd -DSSL

25743 nobody    9  0  4636 4636  4400 S    0.0  2.4  0:00 /usr/bin/httpd -DSSL

25613 nobody    9  0  4636 4636  4400 S    0.0  2.4  0:00 /usr/bin/httpd -DSSL

17479 nobody    9  0  4636 4636  4400 S    0.0  2.4  0:00 /usr/bin/httpd -DSSL

28563 root      13  0  1764 1764  1576 S    0.0  0.9  0:00 /usr/sbin/sshd

19854 root      11  0  1720 1716  1084 S    0.0  0.8  0:00 -bash

13844 root      9  0  1116 1116  912 S    0.0  0.5  0:00 /bin/sh /command/svscanboot

 3109 root      9  0  336  332  260 S    0.0  0.1  0:00 svscan /service

32534 root      9  0  260  260  208 S    0.0  0.1  0:00 readproctitle service errors: .................................................................

 6579 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise qmail-pop3d

26589 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise log

 1609 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise qmail-smtpd

 2060 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise log

31978 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise qmail-send

13374 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise log

27644 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise qmail-pop3ds

31614 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise log

11147 vpopmail  9  0  332  328  272 S    0.0  0.1  0:00 tcpserver -l 0 -R -H -v -u220 -g220 0 110 qmail-popup server1.ecolinux2.servebeer.com /home/vpo

31489 vpopmail  8  0  476  476  400 S    0.0  0.2  0:00 tcpserver -H -R -l 0 -x /home/vpopmail/etc/tcp.smtp.cdb -c 20 -u 220 -g 220 0 smtp qmail-smtpd

13620 qmails    9  0  384  380  292 S    0.0  0.1  0:00 qmail-send

 8421 qmaill    9  0  296  292  236 S    0.0  0.1  0:00 multilog t /var/log/qmail/pop3d

11589 qmaill    9  0  292  288  232 S    0.0  0.1  0:00 multilog t /var/log/qmail/smtpd

 6917 qmaill    9  0  296  292  236 S    0.0  0.1  0:00 multilog t /var/log/qmail

21340 vpopmail  9  0  332  328  272 S    0.0  0.1  0:00 tcpserver -l 0 -R -H -v -u220 -g220 0 995 stunnel -f -p /var/qmail/control/servercert.pem -l qm

18790 qmaill    9  0  296  292  236 S    0.0  0.1  0:00 multilog t /var/log/qmail/pop3ds

 8100 root      9  0  308  304  240 S    0.0  0.1  0:00 qmail-lspawn ./Maildir/

 3326 qmailr    8  0  340  336  268 S    0.0  0.1  0:00 qmail-rspawn

 7095 qmailq    9  0  328  324  260 S    0.0  0.1  0:00 qmail-clean

11633 root      11  0  948  948  756 S    0.0  0.4  0:00 changedfiles -c /etc/sync.conf

20422 root      11  0  948  948  756 S    0.0  0.4  0:00 changedfiles -c /etc/sync.conf

25132 root      11  0  948  948  756 S    0.0  0.4  0:00 changedfiles -c /etc/sync.conf

 6168 root      11  0  948  948  756 S    0.0  0.4  0:00 changedfiles -c /etc/sync.conf

I also want to mention that the following kinds of activity happen in a given day.

1 cron job that runs updatedb
logs are rotated manually, whenever they get to be more than a couple megabytes in size.

The email server probably only handles about 100 messages a day

The web server handles about 250 requests a day. which is about one request every 6 minutes.

NFS transfers about 250 megabytes a day, but it can vary quite a bit.

changedfiles syncs a directory on my system to a directory on another system using ssh sessions. Anywhere from 10-15 transfers a day and varies from 1-15 megabytes

I also do alot of compiling sometimes, anywhere from 30 minutes to 3 hours on a given day.

Let me know if any other information is required. I plan on running memory tests tomorow to see if that shows anything useful.

The output of top you posted, is that when the system is healthy or thrashing? -mk

The system was healthy when I grabbed the output of top.

As a side note, I ran memtest86 today and no errrors were detected.

I ran a memory testing script earlier that is supposed to run massive diffs in parallel to test memory management. I don't know how acurate this is supposed to be, but I ended up going into 'super-thrash mode' before the test could complete.

The odd thing is that I had atop running every 60 seconds and logging to a file. It showed that only 15 megs of swap was being used when it crashed and burned. Now I know a 233 processor is nothing great, but I can't understand why the system can't handle paging 15 megs worth of swap space.

If a memory leak is indeed what the problem is, how would I know. Would atop or top show the rogue process using more memory than it should or does memory just disappear off the face of the earth.

I am gonna keep plugging at the problem, so let me know if you have any ideas on what I could or should do.

When your system went iinto Super Thrash mode, using top, what were the top 3 cpu intensive programs running? -mk

tar, gzip and diff.

If you think it would help, I can run somekind of test, log the heck out of it and post the results. It's not like I am too worried about crashing it at this point.

If you would, grab the first 12 lines of output ftom top, when the system is thrashing, and post them here. -mk

9.31 Load Average.....

After about two days of running my system, a single web page hit did this ....

The load average was pretty low before I tried to load a web page. Before I loaded the page even though the load average was down, their was a lot of thrashing going on and lots of spikes to the load average.

Code:

  7:04am  up 1 day, 14:01,  1 user,  load average: 9.31, 4.19, 1.73

77 processes: 76 sleeping, 1 running, 0 zombie, 0 stopped

CPU states:  0.2% user,  2.3% system,  0.0% nice, 97.4% idle

Mem:  192748K av,  188300K used,    4448K free,      0K shrd,    252K buff

        1136K Active,              1336K Inactive

Swap:  465876K av,  16868K used,  449008K free                    964K cached



  PID USER    PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM  TIME COMMAND

    4 root      14  0    0    0    0 SW    0.8  0.0  0:13 kswapd

32238 root      14  0  364  252  252 R    0.2  0.1  0:21 top

21843 nobody    11  0  2776  252  252 D    0.1  0.1  0:04 httpd

32290 nobody    10  0  2160  300  284 D    0.1  0.1  0:00 httpd

32292 nobody    10  0  2136  256  244 D    0.1  0.1  0:00 httpd

    1 root      9  0    92  24    24 S    0.0  0.0  0:03 init

    2 root      9  0    0    0    0 SW    0.0  0.0  0:00 keventd

    3 root      19  19    0    0    0 SWN  0.0  0.0  0:00 ksoftirqd_CPU0

    5 root      9  0    0    0    0 SW    0.0  0.0  0:00 bdflush

    6 root      9  0    0    0    0 DW    0.0  0.0  0:06 kupdated

    7 root      9  0    0    0    0 SW    0.0  0.0  0:00 kreiserfsd

  47 root      9  0  224  20    20 S    0.0  0.0  0:00 svscanboot

  55 root      9  0  128  96    96 D    0.0  0.0  0:00 svscan

  56 root      9  0    56    4    4 S    0.0  0.0  0:00 readproctitle

  59 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise

  60 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise

  61 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise

  62 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise

  63 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise

  64 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise

  65 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise

  66 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise

  68 vpopmail  8  0    96  20    20 S    0.0  0.0  0:00 tcpserver

  69 qmaill    9  0    64    8    8 S    0.0  0.0  0:00 multilog

  71 qmaill    9  0    60    4    4 S    0.0  0.0  0:00 multilog

  72 vpopmail  9  0    68  12    12 S    0.0  0.0  0:00 tcpserver

  73 qmails    9  0    96  16    16 S    0.0  0.0  0:01 qmail-send

  74 qmaill    9  0    64    8    8 S    0.0  0.0  0:00 multilog

  75 vpopmail  9  0    68  12    12 S    0.0  0.0  0:00 tcpserver

  81 qmaill    9  0    64    8    8 S    0.0  0.0  0:00 multilog

  89 root      8  0    76  12    12 S    0.0  0.0  0:00 qmail-lspawn

  90 qmailr    9  0    84  16    16 S    0.0  0.0  0:00 qmail-rspawn

Top output shows an accumulation of things, what you want to see is a detailed overview what goes on. Try running Atsar or Sysstat (see Freshmeat) with a low interval, and process the logs daily. Also review your system limits and your /proc/sys/vm settings, limits can do all sorts of mucking from denying logins to crashing X11. Proper (for your situation that is) bdflush/kswapd values may result in some performance downgrading but less bursting I/O which could be usefull on an already I/O bound box.