Trouble killing processes

plisken · 04-10-2012, 02:33 PM

I recently was forced to do a power button reboot on my server, basically, I couldnt ssh in with any username other than root and noticed that when I ran ps -A there were 100's of "sh" entries in there.

Also noticed my messages log shown multiple failed password attempts for root and the other usual suspects.

so basically my mahicne had several hundred processes running and cpu usage was through the roof so I'm kind of thinking is there a way to limit the number of sh instances that can be opened and this might possibly prevent this going forward.

Thanks as always...

MensaWater · 04-10-2012, 02:42 PM

If you limit the number of sh processes you'd still have the same problem because at some point you wouldn't be able to login as your login opens a shell.

It sounds to me almost as if someone is existing the system improperly and leaving sh processes running. You can kill those with a "kill -1 <pid>" but I'd try to track down owners of the processes and find out how they're exiting the system. My guess is they're just closing windows or turning off workstations.

ponce · 04-10-2012, 02:45 PM

I think you might be interested in the "ulimit" section of "man bash", maybe the -u option.

Quote:

ulimit [-HSTabcdefilmnpqrstuvx [limit]]

Provides control over the resources available to the shell and to processes started by it, on systems that allow such control. The -H and -S options specify that the hard or soft limit is set for the given resource. A hard limit cannot be increased by a non-root user once it is set; a soft limit may be increased up to the value of the hard limit. If neither -H nor -S is specified, both the soft and hard limits are set. The value of limit can be a number in the unit specified for the resource or one of the special values hard, soft, or unlimited, which stand for the current hard limit, the current soft limit, and no limit, respectively. If limit is omitted, the current value of the soft limit of the resource is printed, unless the -H option is given.

When more than one resource is specified, the limit name and unit are printed before the value. Other options are interpreted as follows:

-a All current limits are reported
-b The maximum socket buffer size
-c The maximum size of core files created
-d The maximum size of a process's data segment
-e The maximum scheduling priority ("nice")
-f The maximum size of files written by the shell and its children
-i The maximum number of pending signals
-l The maximum size that may be locked into memory
-m The maximum resident set size (many systems do not honor this limit)
-n The maximum number of open file descriptors (most systems do not allow this value to be set)
-p The pipe size in 512-byte blocks (this may not be set)
-q The maximum number of bytes in POSIX message queues
-r The maximum real-time scheduling priority
-s The maximum stack size
-t The maximum amount of cpu time in seconds
-u The maximum number of processes available to a single user
-v The maximum amount of virtual memory available to the shell and, on some systems, to its children
-x The maximum number of file locks
-T The maximum number of threads

and you really should have a look also at "man initscript" (reference).

plisken · 05-21-2012, 08:22 AM

A few times now, I've had to reboot my server as a result of 100's of /bin/sh processes spawning, not entirely sure the cause of this yet but regardless, i'm having problems killing these processes.

using kill -9 PID or killall sh doesnt seem to remove any of them.

Now I'm assuming I cant kill the init process, so without actually rebooting the machine, are there any other options open to me, until of course I find out why so many are spawning.

Cant change to runlevel 1 or run the init 6, I'm forced to do a power off/on to get the server back to normal.

Thanks in advance...

pan64 · 05-21-2012, 08:34 AM

you need to know at least how it was started? (What did this spawning started?) probably you can find the root process by parent pid or by name.

plisken · 05-21-2012, 08:53 AM

When I run pstree it seems to come from init.

Can't check anything else for now as server is down and I'm many miles away

ttk · 05-21-2012, 10:36 AM

Also check "tail -n 40 /var/log/messages" and "dmesg | tail -n 40" for clues.

-- TTK

unSpawn · 05-21-2012, 11:53 AM

@OP: I've merged your "Is it possible to limit number of /sbin/sh instances???" thread with this recent one as it is the same topic. Also note that you never responded to replies in that thread. If you did you might have solved or mitigated the problem over a month ago. Next to that it shouldn't just be one way traffic and any usable replies should warrant a response from you.

plisken · 05-21-2012, 12:45 PM

Sorry, I couldnt find that post actually, sorry and thanks for the heads up.

Update:

I had a cron script that run every 5 minutes, basically I'm now thinking this may have been the cause of the problem and if so, just wont bother using it. I was using it to log ADSL connection drops.

With regards to checking the logs, all logs after my logrotate were empty, shouldnt normally be the case I know.

Boring bit, I didnt think as was posted above that limiting the number of sh instances running is actually what I was looking for which might have contributed to me not replying to the original post.

My cry for help now is that when this happened again, I was unable to kill the offending processes and thats why I posted again.

Sorry for any hassle and thanks for all interest and posts.

plisken · 05-22-2012, 08:26 AM

A 'ps' I managed while the problem was there...

Code:

USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   488   68 ?        S    May14   0:08 init [3] 
root         2  0.0  0.0     0    0 ?        SW   May14   0:00 [keventd]
root         3  0.0  0.0     0    0 ?        SWN  May14   0:00 [ksoftirqd_CPU0]
root         4  0.0  0.0     0    0 ?        SWN  May14   0:00 [ksoftirqd_CPU1]
root         5  0.0  0.0     0    0 ?        SW   May14   0:22 [kswapd]
root         6  0.0  0.0     0    0 ?        SW   May14   0:00 [bdflush]
root         7  0.0  0.0     0    0 ?        Z    May14   0:44 [kupdated <defunct>]
root         9  0.0  0.0     0    0 ?        SW   May14   0:00 [ahc_dv_0]
root        10  0.0  0.0     0    0 ?        SW   May14   0:00 [ahc_dv_1]
root        11  0.0  0.0     0    0 ?        SW   May14   0:00 [scsi_eh_1]
root        12  0.0  0.0     0    0 ?        SW   May14   0:00 [scsi_eh_2]
root        13  0.0  0.0     0    0 ?        SW<  May14   0:00 [mdrecoveryd]
root        14  0.0  0.0     0    0 ?        SW   May14   0:03 [kreiserfsd]
root       400  0.0  0.0  1412  500 ?        S    May14   0:01 /usr/sbin/inetd
root       403  0.0  0.2  3212 1052 ?        S    May14   0:33 /usr/sbin/sshd
root       410  0.0  0.0  1560  436 ?        S    May14   1:15 /usr/sbin/crond -l10
root       415  0.0  0.2  4836 1380 ?        S    May14   0:39 sendmail: rejecting connections on daemon MSA: load average: 266
root       477  0.0  0.1  2204  924 ?        S    May14   0:00 /bin/sh /usr/bin/mysqld_safe --datadir=/var/lib/mysql --pid-file=/var/run/mysql/mysql.pid --skip-networking
root       554  0.0  0.8 74896 4464 ?        S    May14   0:20 /usr/sbin/httpd
root       581  0.0  0.3  9184 1968 ?        S    May14   0:14 /usr/bin/perl /usr/local/webmin/miniserv.pl /etc/webmin/miniserv.conf
root       761  0.0  0.0  1368  416 tty1     S    May14   0:00 [agetty]
root       762  0.0  0.0  1368  416 tty2     S    May14   0:00                             
root       763  0.0  0.0  1368  416 tty3     S    May14   0:00                             
root       764  0.0  0.0  1368  416 tty4     S    May14   0:00  :??   @q??)   0u??    p???
root       765  0.0  0.0  1368  416 tty5     S    May14   0:00         ?        ???    
root       766  0.0  0.0  1368  416 tty6     S    May14   0:00                             
root      3653  0.0  0.6 17584 3208 ?        S    May15   1:52 /usr/bin/python /usr/bin/fail2ban-server -b -s /var/run/fail2ban/fail2ban.sock
root      3654  0.0  0.6 17584 3208 ?        S    May15   2:37 /usr/bin/python /usr/bin/fail2ban-server -b -s /var/run/fail2ban/fail2ban.sock
root      3657  0.0  0.6 17584 3208 ?        S    May15   1:41 /usr/bin/python /usr/bin/fail2ban-server -b -s /var/run/fail2ban/fail2ban.sock
root      3658  0.0  0.6 17584 3208 ?        S    May15   2:16 /usr/bin/python /usr/bin/fail2ban-server -b -s /var/run/fail2ban/fail2ban.sock
root      3662  0.0  0.6 17584 3208 ?        S    May15   2:10 /usr/bin/python /usr/bin/fail2ban-server -b -s /var/run/fail2ban/fail2ban.sock
root      4410  0.0  8.0 45704 41692 ?       S    May19   1:37 /usr/bin/perl5.8.0 -T -w /usr/bin/spamd -c -d
root      4419  0.1  9.0 52612 46756 ?       S    May19   5:24 /usr/bin/perl5.8.0 -T -w /usr/bin/spamd -c -d
root      4420  0.0  8.8 48984 45456 ?       S    May19   0:16 /usr/bin/perl5.8.0 -T -w /usr/bin/spamd -c -d
root      6517  0.0  0.4  5772 2520 ?        S    May19   0:00 sendmail: ./q4JHR1EF006516 from queue 
root      6520  0.0  0.4  5772 2516 ?        S    May19   0:00 sendmail: ./q4JHTTEF006519 from queue 
root      6526  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JHUeEF006525 from queue 
root      6531  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JHXLEF006530 from queue 
root      6541  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JHcBEF006540 from queue 
root      6554  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JHiYEF006553 from queue 
root      6560  0.0  0.4  5776 2512 ?        S    May19   0:00 sendmail: ./q4JHk9EF006559 from queue 
root      6599  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JHoWEF006598 from queue 
root      6602  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JHovEF006601 from queue 
root      6608  0.0  0.4  5776 2524 ?        S    May19   0:00 sendmail: ./q4JHshEF006607 from queue 
root      6614  0.0  0.4  5776 2512 ?        S    May19   0:00 sendmail: ./q4JHw1EF006613 from queue 
root      6627  0.0  0.4  5780 2532 ?        S    May19   0:00 sendmail: ./q4JI2ZEF006626 from queue 
root      6637  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JI6HEF006635 from queue 
root      6640  0.0  0.4  5772 2508 ?        S    May19   0:00 sendmail: ./q4JI6YEF006639 from queue 
root      6644  0.0  0.4  5776 2520 ?        S    May19   0:00 sendmail: ./q4JI74EF006643 from queue 
root      6659  0.0  0.4  5776 2524 ?        S    May19   0:00 sendmail: ./q4JICQEF006658 from queue 
root      6662  0.0  0.4  5776 2520 ?        S    May19   0:00 sendmail: ./q4JICfEF006661 from queue 
root      6665  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JICrEF006664 from queue 
root      6672  0.0  0.4  5776 2504 ?        S    May19   0:00 sendmail: ./q4JIDSEF006671 from queue 
root     10177  0.0  0.1  2032  908 ?        S    May20   0:00 /bin/sh -c /usr/bin/run-parts /etc/cron.daily 1> /dev/null?
root     10178  0.0  0.1  2040  988 ?        S    May20   0:00 /bin/sh /usr/bin/run-parts /etc/cron.daily
root     10197  0.0  0.1  2024  900 ?        S    May20   0:00 /bin/sh /etc/cron.daily/logrotate
root     10198  0.0  0.1  1576  732 ?        D    May20   0:00 /usr/sbin/logrotate /etc/logrotate.conf
root     13534  0.0  0.1  2032  908 ?        S    04:41   0:00 /bin/sh -c /usr/bin/run-parts /etc/cron.daily 1> /dev/null?
root     13535  0.0  0.1  2040  988 ?        S    04:41   0:00 /bin/sh /usr/bin/run-parts /etc/cron.daily
root     13554  0.0  0.1  2024  900 ?        S    04:41   0:00 /bin/sh /etc/cron.daily/logrotate
root     13555  0.0  0.1  1560  644 ?        D    04:41   0:00 /usr/sbin/logrotate /etc/logrotate.conf
root     29247  0.0  0.3  5892 1840 ?        S    10:32   0:00 sshd: root@pts/1 
root     29266  0.0  0.2  2304 1316 pts/1    S    10:32   0:00 -bash
root     29589  0.0  0.0  1376  420 ?        S    10:37   0:00 /usr/sbin/popa3d
root     29596  0.0  0.1  2948 1004 pts/1    R    10:37   0:00 ps -ux
root     29597  0.0  0.1  2764  872 pts/1    S    10:37   0:00 mail ***.*******@*******.com

Mark Pettit · 05-22-2012, 09:42 AM

Quote:

Originally Posted by plisken

Also noticed my messages log shown multiple failed password attempts for root and the other usual suspects.

Are you not worried about this ? Looks to me as if someone was trying to hack into your box. Perhaps consider using a program like 'denyhosts' or 'fail2ban'.

unSpawn · 05-22-2012, 09:51 AM

Quote:

Originally Posted by Mark Pettit

Are you not worried about this ? Looks to me as if someone was trying to hack into your box. Perhaps consider using a program like 'denyhosts' or 'fail2ban'.

If you look at his process table you'll see he already does.

Mark Pettit · 05-22-2012, 10:13 AM

Ah - you are correct. I hadn't looked that deep. Well done to him then :-)

unSpawn · 05-22-2012, 07:39 PM

Quote:

Originally Posted by plisken

I had a cron script that run every 5 minutes, basically I'm now thinking this may have been the cause of the problem

If that is the case then deleting the cron job should show the load go down after the processes die or get killed or after the box is rebooted. And there's no need to "think": MensaWater is right about tracking down the UID of the processes first for analysis. If you can't do that manually then run Atop, collectl or dstat to gather system stats automagically.

Quote:

Originally Posted by plisken

With regards to checking the logs, all logs after my logrotate were empty, shouldnt normally be the case I know.

That is odd. Has this happened before?
Do you run a standard logrotate configuration?
Are all logs empty including all rotated ones?
Does your syslog, cron or any other daemon log show any anomalies around the time of the log rotation?
Are there any login (attempts) during or prior to this?

Quote:

Originally Posted by plisken

I was unable to kill the offending processes

See Ponce's advice: if the processes ran as root then see if the cron job can be run from an unprivileged account and apply a process limit.

A few remarks if I may in random order:

Quote:

Originally Posted by plisken

Code:

USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
root       403  0.0  0.2  3212 1052 ?        S    May14   0:33 /usr/sbin/sshd
root     29247  0.0  0.3  5892 1840 ?        S    10:32   0:00 sshd: root@pts/1 
root     29266  0.0  0.2  2304 1316 pts/1    S    10:32   0:00 -bash
root     29596  0.0  0.1  2948 1004 pts/1    R    10:37   0:00 ps -ux
root     29597  0.0  0.1  2764  872 pts/1    S    10:37   0:00 mail ***.*******@*******.com

- 'sshd' doesn't show the "[priv]" tag on your login process Id 29247 and I don't know if that's due to 0) your distros (which one?) implementation of 'ps' (unlikely), 1) ps output doctored by you (please confirm), 2) your distros implementation of 'sshd' (unlikely), or 3) you running OpenSSH without privilege separation. In case of the latter please verify your binaries integrity and correct it as it shouldn't be configured to run without.
- you seem to be logging in as root user. That is not a security best practice regardless of any seemingly mitigating arguments. Do use an unprivileged user account with pubkey auth to log in with.

Quote:

Originally Posted by plisken

Code:

USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
root       581  0.0  0.3  9184 1968 ?        S    May14   0:14 /usr/bin/perl /usr/local/webmin/miniserv.pl /etc/webmin/miniserv.conf

Please ensure your Webmin installation is current, access is restricted to "known good" IP (ranges?) and preferably over SSL.

Quote:

Originally Posted by plisken

Code:

USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
root       761  0.0  0.0  1368  416 tty1     S    May14   0:00 [agetty]
root       762  0.0  0.0  1368  416 tty2     S    May14   0:00                             
root       763  0.0  0.0  1368  416 tty3     S    May14   0:00                             
root       764  0.0  0.0  1368  416 tty4     S    May14   0:00  :??   @q??)   0u??    p???
root       765  0.0  0.0  1368  416 tty5     S    May14   0:00         ?        ???    
root       766  0.0  0.0  1368  416 tty6     S    May14   0:00

I don't know what to make of this but it seems odd.

Quote:

Originally Posted by plisken

Code:

USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
root       415  0.0  0.2  4836 1380 ?        S    May14   0:39 sendmail: rejecting connections on daemon MSA: load average: 266
root      6517  0.0  0.4  5772 2520 ?        S    May19   0:00 sendmail: ./q4JHR1EF006516 from queue 
root      6520  0.0  0.4  5772 2516 ?        S    May19   0:00 sendmail: ./q4JHTTEF006519 from queue 
root      6526  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JHUeEF006525 from queue 
root      6531  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JHXLEF006530 from queue 
root      6541  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JHcBEF006540 from queue 
root      6554  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JHiYEF006553 from queue 
root      6560  0.0  0.4  5776 2512 ?        S    May19   0:00 sendmail: ./q4JHk9EF006559 from queue 
root      6599  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JHoWEF006598 from queue 
root      6602  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JHovEF006601 from queue 
root      6608  0.0  0.4  5776 2524 ?        S    May19   0:00 sendmail: ./q4JHshEF006607 from queue 
root      6614  0.0  0.4  5776 2512 ?        S    May19   0:00 sendmail: ./q4JHw1EF006613 from queue 
root      6627  0.0  0.4  5780 2532 ?        S    May19   0:00 sendmail: ./q4JI2ZEF006626 from queue 
root      6637  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JI6HEF006635 from queue 
root      6640  0.0  0.4  5772 2508 ?        S    May19   0:00 sendmail: ./q4JI6YEF006639 from queue 
root      6644  0.0  0.4  5776 2520 ?        S    May19   0:00 sendmail: ./q4JI74EF006643 from queue 
root      6659  0.0  0.4  5776 2524 ?        S    May19   0:00 sendmail: ./q4JICQEF006658 from queue 
root      6662  0.0  0.4  5776 2520 ?        S    May19   0:00 sendmail: ./q4JICfEF006661 from queue 
root      6665  0.0  0.4  5776 2516 ?        S    May19   0:00 sendmail: ./q4JICrEF006664 from queue 
root      6672  0.0  0.4  5776 2504 ?        S    May19   0:00 sendmail: ./q4JIDSEF006671 from queue

Apart from a load of 266 being ludicrous please check the mail spool for clues why these messages aren't sent.

plisken · 05-25-2012, 03:43 AM

Quote:

If that is the case then deleting the cron job should show the load go down after the processes die or get killed or after the box is rebooted.

killing the crond I thought would have done this but it along with certain other processes wouldnt kill, but aye, rebooting returns to normal.

Quote:

That is odd. Has this happened before?
Do you run a standard logrotate configuration?
Are all logs empty including all rotated ones?
Does your syslog, cron or any other daemon log show any anomalies around the time of the log rotation?
Are there any login (attempts) during or prior to this?

Never noticed it before, my logrotate is pretty much standard, a few extra entries in there for things but nothing has been changed in this for years, with the exception of adding my fail2ban entry.
From what I remember, only the secure/messages/maillog entries were empty, rotated ones were populated as expected.

Additionally, I've killed webmin for the time being.
I could only login as root, all other users simply hung after password entry from console.

As always all comments are appreciated and I'm looking into the other points mentioned and when/if this happens again, I'll try and better gather the information to answer the questions I've been asked but as yet not been able to answer.

O and this is 9.1