LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   System Crashes (https://www.linuxquestions.org/questions/linux-general-1/system-crashes-32254/)

linuxeco 10-08-2002 01:08 PM

System Crashes
 
Usually my system runs a large variety of processes, but lately after about
12-24 hours of usage, it starts thrashing real hard and before you know it
the system is too slow to log in without having a timeout happen on the password
prompt. I run the 2.4.19 kernel and the latest stable versions of software.

The system is a 233MHz pentium based processor with 200 megs of memory in my system
and usually have about 475 megabytes of swap (which I turned off below so I could
demonstrate the issues I was having, but yes, lots of swap was available at the
time of these imcidents.)

The memory usgae thoughout the next part of this showed that all ecept about 5
megs of the RAM was being used, and 20 megs of swap was being utilized too.

Last night the system started it's usual lockup and was showing unusally high load
averages. When this happened I was compiling a kernel. I cancelled the compiling
and checked the load average.

Code:

2:31am  up 12:25,  1 user,  load average: 3.44, 3.49, 3.28
So I shut down apache and mysql ...

Code:

2:45am  up 12:39,  1 user,  load average: 3.06, 4.99, 4.7
which showed results that seemed pretty normal to me. except that while I was
shutting stuff down the load averge in the 1 minute colum spiked up 4.99.

So finally I shut down most of my other services, nfsd, qmail, cron daemon,
sysklogd and inetd (which was running a CVS pserver).

Code:

2:57am  up 12:50,  1 user,  load average: 0.58, 1.55, 3.00
Finally things seem like they are starting to improve, but 0.58 is still way
to high for a system that isn't doing anything except handle a single sshd session.

So finally after about a half hour of not touching the system, the load average hit
somewhere in the range of 0.08.

So now I start looking at the current memory usage (which is not very different from what
it was before. except that the swap was down to 1 megabyte. So I decided to see if I could
get any answers by trying to break the system in a controlled fashion.

I then swapped of the swap sspace which took about 15 seconds to do. Immediately me ssh
session died. So I go to the console and see an out of memory error. I decide to continue
screwing around with it.

Code:

            total      used      free    shared    buffers    cached
Mem:        192676    188324      4352          0        356      1060
-/+ buffers/cache:    186908      5768
Swap:            0          0          0


  procs                      memory      swap          io    system      cpu
 r  b  w  swpd  free  buff  cache  si  so    bi    bo  in    cs us sy id
 0  0  0      0  4344    336  1076  62  13  160  103  179  370  9  2 89


  3:53am  up 13:47,  3 users,  load average: 0.00, 0.00, 0.11
14 processes: 13 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  8.9% user,  2.3% system,  0.0% nice,  2.2% idle
Mem:  192676K av,  188788K used,    3888K free,      0K shrd,    348K buff
        1504K Active,              2428K Inactive
Swap:      0K av,      0K used,      0K free                    1412K cached

  PID USER    PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM  TIME COMMAND
 6826 root      19  0  868  864  680 R    0.8  0.4  0:00 top -c -b -n 1
    1 root      8  0  136  132    52 S    0.0  0.0  0:06 init [3]
    2 root      9  0    0    0    0 SW    0.0  0.0  0:00 keventd
    3 root      19  19    0    0    0 SWN  0.0  0.0  0:00 ksoftirqd_CPU0
    4 root      11  0    0    0    0 SW    0.0  0.0  1:00 kswapd
    5 root      9  0    0    0    0 SW    0.0  0.0  0:00 bdflush
    6 root      10  0    0    0    0 SW    0.0  0.0  0:04 kupdated
    7 root      9  0    0    0    0 SW    0.0  0.0  0:00 kreiserfsd
21159 root      9  0    72  72    0 S    0.0  0.0  0:00 dhcpcd eth0
22673 root      9  0  304  304  180 S    0.0  0.1  0:21 /usr/sbin/sshd
19398 root      9  0  712  712    52 S    0.0  0.3  0:00 -bash
16918 root      9  0    68  68    0 S    0.0  0.0  0:00 /sbin/agetty tty1 9600
  336 root      13  0  516  516  336 S    0.0  0.2  0:00 /usr/sbin/sshd
18441 root      16  0  996  992  344 S    0.0  0.5  0:00 -bash^M

Lets crash it...
I decided to see if I could allocate any memory that was left into a file.

Code:

mount -t ramfs /dev/ram0 /mnt
cd /mnt
dd if=/dev/zero of=/mnt/memuseup bs=512k count=20

This should have created a 10 megabyte file in memory, but obviously it didn't finish
and dd, bash and my login was killed.

Ok cool, I now have all my memory used up, so I let it sit there until morning to see
if the system would recover any more.

Sleep ......

So today about noon-thirty, I went over to the console and there was no recovery, there
was more memory being used up, because agetty was dying with an out of memory error and
then respawning

Any help any one could offer is much appreciated because I have no idea what to do next.
Sorry for the long post but I wanted to provide you with as much information as I could.

NSKL 10-08-2002 03:21 PM

Im guessing here, but it sounds like a memory leak, there was a thread about ,memory leaks a while back if i remember correctly, Try to find it (search the board) and meanwhile im sure some of the more knowledgable people will help you out more.
Also you might want to get a program called memtest86 to ckeck your RAM in case you suspect the RAM sticks are corrupted.
Sorry i couldnt be of much help...
-NSKL

mikek147 10-08-2002 03:45 PM

Personally. I would like to see the output of top with your system normally loaded. Obviously something starts running that eats your ram causing your VMM to start thrashing stuff into and out of swap. Or, something like logrotate, logcheck and maybe aide are running at the same time. Since they are all doing disk access, this can really bog down a system with a slower processor.

Just tossing out some ideas. Your mileage may vary. -mk

linuxeco 10-08-2002 10:28 PM

Here is the output of top on a pretty normal day.

Code:

11:30pm  up 10:22,  1 user,  load average: 0.04, 0.05, 0.01
66 processes: 65 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  0.0% user,  0.0% system,  0.0% nice, 99.8% idle
Mem:  192676K av,  99852K used,  92824K free,      0K shrd,  35600K buff
        10496K Active,              80932K Inactive
Swap:  465876K av,      0K used,  465876K free                  45644K cached

  PID USER    PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM  TIME COMMAND
26884 root      17  0  892  888  680 R    3.8  0.4  0:00 top -c -b -n 1
    1 root      9  0  528  524  452 S    0.0  0.2  0:03 init [3]
    2 root      9  0    0    0    0 SW    0.0  0.0  0:00 keventd
    3 root      19  19    0    0    0 SWN  0.0  0.0  0:00 ksoftirqd_CPU0
    4 root      9  0    0    0    0 SW    0.0  0.0  0:00 kswapd
    5 root      9  0    0    0    0 SW    0.0  0.0  0:00 bdflush
    6 root      9  0    0    0    0 SW    0.0  0.0  0:00 kupdated
    7 root      9  0    0    0    0 SW    0.0  0.0  0:00 kreiserfsd
 8338 root      9  0  676  676  556 S    0.0  0.3  0:00 /usr/sbin/syslogd -m 0
29005 root      9  0  548  548  404 S    0.0  0.2  0:00 /usr/sbin/klogd
22666 root      9  0  472  472  400 S    0.0  0.2  0:00 dhcpcd eth0
 2257 bin        9  0  524  524  440 S    0.0  0.2  0:00 /sbin/portmap
 9133 root      9  0  820  820  672 S    0.0  0.4  0:00 /usr/sbin/rpc.mountd
 4048 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd
13697 root      9  0    0    0    0 SW    0.0  0.0  0:00 lockd
10311 root      9  0    0    0    0 SW    0.0  0.0  0:00 rpciod
12097 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd
10432 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd
18465 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd
30406 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd
 7239 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd
19878 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd
14648 root      9  0    0    0    0 SW    0.0  0.0  0:00 nfsd
20219 root      9  0  772  772  648 S    0.0  0.4  0:00 /usr/sbin/rpc.statd
 6867 root      9  0  464  464  376 S    0.0  0.2  0:00 /usr/sbin/rpc.rquotad
 8155 root      8  0  564  564  472 S    0.0  0.2  0:00 /usr/sbin/madcron
 2314 root      9  0  1300 1300  1180 S    0.0  0.6  0:05 /usr/sbin/sshd
 8705 root      9  0  1124 1124  912 S    0.0  0.5  0:00 /bin/sh /usr/bin/safe_mysqld --datadir=/var/mysql --pid-file=/var/mysql/server1.pid
27978 mysql      9  0  4340 4340  1584 S    0.0  2.2  0:00 /usr/bin/mysqld --basedir=/usr --datadir=/var/mysql --user=mysql --pid-file=/var/mysql/server1.
31594 fetchmai  9  0  936  936  768 S    0.0  0.4  0:00 fetchmail --daemon 300 --syslog -f /etc/fetchmailrc
 6500 root      9  0  496  496  424 S    0.0  0.2  0:00 inetd /etc/inetd.conf
 9242 root      9  0  532  532  464 S    0.0  0.2  0:00 /sbin/agetty tty1 9600
 5488 root      9  0  532  532  464 S    0.0  0.2  0:00 /sbin/agetty tty2 9600
 1044 mysql      9  0  4340 4340  1584 S    0.0  2.2  0:00 /usr/bin/mysqld --basedir=/usr --datadir=/var/mysql --user=mysql --pid-file=/var/mysql/server1.
  478 mysql      9  0  4340 4340  1584 S    0.0  2.2  0:00 /usr/bin/mysqld --basedir=/usr --datadir=/var/mysql --user=mysql --pid-file=/var/mysql/server1.
 7350 mysql      9  0  4340 4340  1584 S    0.0  2.2  0:00 /usr/bin/mysqld --basedir=/usr --datadir=/var/mysql --user=mysql --pid-file=/var/mysql/server1.
 2868 root      8  0  4520 4520  4332 S    0.0  2.3  0:00 /usr/bin/httpd -DSSL
 3446 nobody    9  0  4636 4636  4400 S    0.0  2.4  0:00 /usr/bin/httpd -DSSL
28014 nobody    9  0  4648 4648  4404 S    0.0  2.4  0:00 /usr/bin/httpd -DSSL
25743 nobody    9  0  4636 4636  4400 S    0.0  2.4  0:00 /usr/bin/httpd -DSSL
25613 nobody    9  0  4636 4636  4400 S    0.0  2.4  0:00 /usr/bin/httpd -DSSL
17479 nobody    9  0  4636 4636  4400 S    0.0  2.4  0:00 /usr/bin/httpd -DSSL
28563 root      13  0  1764 1764  1576 S    0.0  0.9  0:00 /usr/sbin/sshd
19854 root      11  0  1720 1716  1084 S    0.0  0.8  0:00 -bash
13844 root      9  0  1116 1116  912 S    0.0  0.5  0:00 /bin/sh /command/svscanboot
 3109 root      9  0  336  332  260 S    0.0  0.1  0:00 svscan /service
32534 root      9  0  260  260  208 S    0.0  0.1  0:00 readproctitle service errors: .................................................................
 6579 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise qmail-pop3d
26589 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise log
 1609 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise qmail-smtpd
 2060 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise log
31978 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise qmail-send
13374 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise log
27644 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise qmail-pop3ds
31614 root      9  0  312  304  248 S    0.0  0.1  0:00 supervise log
11147 vpopmail  9  0  332  328  272 S    0.0  0.1  0:00 tcpserver -l 0 -R -H -v -u220 -g220 0 110 qmail-popup server1.ecolinux2.servebeer.com /home/vpo
31489 vpopmail  8  0  476  476  400 S    0.0  0.2  0:00 tcpserver -H -R -l 0 -x /home/vpopmail/etc/tcp.smtp.cdb -c 20 -u 220 -g 220 0 smtp qmail-smtpd
13620 qmails    9  0  384  380  292 S    0.0  0.1  0:00 qmail-send
 8421 qmaill    9  0  296  292  236 S    0.0  0.1  0:00 multilog t /var/log/qmail/pop3d
11589 qmaill    9  0  292  288  232 S    0.0  0.1  0:00 multilog t /var/log/qmail/smtpd
 6917 qmaill    9  0  296  292  236 S    0.0  0.1  0:00 multilog t /var/log/qmail
21340 vpopmail  9  0  332  328  272 S    0.0  0.1  0:00 tcpserver -l 0 -R -H -v -u220 -g220 0 995 stunnel -f -p /var/qmail/control/servercert.pem -l qm
18790 qmaill    9  0  296  292  236 S    0.0  0.1  0:00 multilog t /var/log/qmail/pop3ds
 8100 root      9  0  308  304  240 S    0.0  0.1  0:00 qmail-lspawn ./Maildir/
 3326 qmailr    8  0  340  336  268 S    0.0  0.1  0:00 qmail-rspawn
 7095 qmailq    9  0  328  324  260 S    0.0  0.1  0:00 qmail-clean
11633 root      11  0  948  948  756 S    0.0  0.4  0:00 changedfiles -c /etc/sync.conf
20422 root      11  0  948  948  756 S    0.0  0.4  0:00 changedfiles -c /etc/sync.conf
25132 root      11  0  948  948  756 S    0.0  0.4  0:00 changedfiles -c /etc/sync.conf
 6168 root      11  0  948  948  756 S    0.0  0.4  0:00 changedfiles -c /etc/sync.conf


linuxeco 10-09-2002 12:12 AM

I also want to mention that the following kinds of activity happen in a given day.

1 cron job that runs updatedb
logs are rotated manually, whenever they get to be more than a couple megabytes in size.

The email server probably only handles about 100 messages a day

The web server handles about 250 requests a day. which is about one request every 6 minutes.

NFS transfers about 250 megabytes a day, but it can vary quite a bit.

changedfiles syncs a directory on my system to a directory on another system using ssh sessions. Anywhere from 10-15 transfers a day and varies from 1-15 megabytes

I also do alot of compiling sometimes, anywhere from 30 minutes to 3 hours on a given day.

Let me know if any other information is required. I plan on running memory tests tomorow to see if that shows anything useful.

mikek147 10-09-2002 05:54 AM

The output of top you posted, is that when the system is healthy or thrashing? -mk

linuxeco 10-09-2002 03:23 PM

The system was healthy when I grabbed the output of top.

As a side note, I ran memtest86 today and no errrors were detected.

linuxeco 10-09-2002 04:18 PM

I ran a memory testing script earlier that is supposed to run massive diffs in parallel to test memory management. I don't know how acurate this is supposed to be, but I ended up going into 'super-thrash mode' before the test could complete.

The odd thing is that I had atop running every 60 seconds and logging to a file. It showed that only 15 megs of swap was being used when it crashed and burned. Now I know a 233 processor is nothing great, but I can't understand why the system can't handle paging 15 megs worth of swap space.

If a memory leak is indeed what the problem is, how would I know. Would atop or top show the rogue process using more memory than it should or does memory just disappear off the face of the earth.

I am gonna keep plugging at the problem, so let me know if you have any ideas on what I could or should do.

mikek147 10-10-2002 02:34 AM

When your system went iinto Super Thrash mode, using top, what were the top 3 cpu intensive programs running? -mk

linuxeco 10-10-2002 02:56 AM

tar, gzip and diff.

If you think it would help, I can run somekind of test, log the heck out of it and post the results. It's not like I am too worried about crashing it at this point.

mikek147 10-10-2002 04:20 AM

If you would, grab the first 12 lines of output ftom top, when the system is thrashing, and post them here. -mk

linuxeco 10-13-2002 06:03 AM

9.31 Load Average.....
 
After about two days of running my system, a single web page hit did this ....

The load average was pretty low before I tried to load a web page. Before I loaded the page even though the load average was down, their was a lot of thrashing going on and lots of spikes to the load average.

Code:

  7:04am  up 1 day, 14:01,  1 user,  load average: 9.31, 4.19, 1.73
77 processes: 76 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  0.2% user,  2.3% system,  0.0% nice, 97.4% idle
Mem:  192748K av,  188300K used,    4448K free,      0K shrd,    252K buff
        1136K Active,              1336K Inactive
Swap:  465876K av,  16868K used,  449008K free                    964K cached

  PID USER    PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM  TIME COMMAND
    4 root      14  0    0    0    0 SW    0.8  0.0  0:13 kswapd
32238 root      14  0  364  252  252 R    0.2  0.1  0:21 top
21843 nobody    11  0  2776  252  252 D    0.1  0.1  0:04 httpd
32290 nobody    10  0  2160  300  284 D    0.1  0.1  0:00 httpd
32292 nobody    10  0  2136  256  244 D    0.1  0.1  0:00 httpd
    1 root      9  0    92  24    24 S    0.0  0.0  0:03 init
    2 root      9  0    0    0    0 SW    0.0  0.0  0:00 keventd
    3 root      19  19    0    0    0 SWN  0.0  0.0  0:00 ksoftirqd_CPU0
    5 root      9  0    0    0    0 SW    0.0  0.0  0:00 bdflush
    6 root      9  0    0    0    0 DW    0.0  0.0  0:06 kupdated
    7 root      9  0    0    0    0 SW    0.0  0.0  0:00 kreiserfsd
  47 root      9  0  224  20    20 S    0.0  0.0  0:00 svscanboot
  55 root      9  0  128  96    96 D    0.0  0.0  0:00 svscan
  56 root      9  0    56    4    4 S    0.0  0.0  0:00 readproctitle
  59 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise
  60 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise
  61 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise
  62 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise
  63 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise
  64 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise
  65 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise
  66 root      9  0    72  12    12 S    0.0  0.0  0:00 supervise
  68 vpopmail  8  0    96  20    20 S    0.0  0.0  0:00 tcpserver
  69 qmaill    9  0    64    8    8 S    0.0  0.0  0:00 multilog
  71 qmaill    9  0    60    4    4 S    0.0  0.0  0:00 multilog
  72 vpopmail  9  0    68  12    12 S    0.0  0.0  0:00 tcpserver
  73 qmails    9  0    96  16    16 S    0.0  0.0  0:01 qmail-send
  74 qmaill    9  0    64    8    8 S    0.0  0.0  0:00 multilog
  75 vpopmail  9  0    68  12    12 S    0.0  0.0  0:00 tcpserver
  81 qmaill    9  0    64    8    8 S    0.0  0.0  0:00 multilog
  89 root      8  0    76  12    12 S    0.0  0.0  0:00 qmail-lspawn
  90 qmailr    9  0    84  16    16 S    0.0  0.0  0:00 qmail-rspawn


unSpawn 10-13-2002 08:31 AM

Top output shows an accumulation of things, what you want to see is a detailed overview what goes on. Try running Atsar or Sysstat (see Freshmeat) with a low interval, and process the logs daily. Also review your system limits and your /proc/sys/vm settings, limits can do all sorts of mucking from denying logins to crashing X11. Proper (for your situation that is) bdflush/kswapd values may result in some performance downgrading but less bursting I/O which could be usefull on an already I/O bound box.


All times are GMT -5. The time now is 11:11 AM.