LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   Cannot kill process - please help! (https://www.linuxquestions.org/questions/linux-general-1/cannot-kill-process-please-help-6023/)

thomas.nichols 08-30-2001 06:01 AM

Cannot kill process - please help!
 
Good Day,

I have a RedHat 7.0 box running Tomcat Java app server, which has become unstable a few times for reasons I cannot fathom. The current ps -ef listing shows a number of Java JVMs (IBM JDK 1.3, FWIW) - and some of them will not die, even with kill -9 as root. How else can I get rid of the little blighters?

"netstat -a | grep LISTEN" shows that the ports used by Tomcat are still being listened to -- presumably by these rogue processes.

Over the past week I've rebooted twice -- once because "rm -rf" on a directory froze (again, kill -9 did nothing) and again because HTTP and SSH connections abruptly disppeared (though PING remained OK). What diagnostics could I run here?

Thanks a lot for any suggestions,
Regards,
Thomas.

isajera 08-30-2001 04:24 PM

night of the living dead...
 
sounds like you've got a few zombie processes running. what happens is that the parent process that calls the zombies doesn't call wait() or waitpid(), so the child process hangs around until the parent process dies. you can't get rid of it with kill... although i've noticed that the kde process manager can somehow get rid of zombies. don't ask me how. i haven't the faintest idea. the only way i know to get rid of them is to kill the parent process, but many times you don't want to do that.

r3b00t 08-30-2001 05:12 PM

Gentlemen, we should think solutions :)
 
Maybe it's an idea to find out _what_ is causing the problem, so you can fix the problem at it's source.

What software is running on the box, which kernel, etc etc

isajera 08-30-2001 05:29 PM

the process causing the problem is going to be the parent process... not the child. when you have processes that can't be killed - make a note of which process is the parent.

it's pretty strange that "rm -rf" would hang tho...

r3b00t 08-30-2001 05:39 PM

Well, if he can be sure he's running the latest stable software, and even rm hangs, it might really be possible that there's a hardware issue...

thomas.nichols 08-30-2001 06:06 PM

"Pretty strange" is my assessment also - here's the top of the ps listing:
$ ps -ef | grep IBMJ
thomasn 2248 1 0 Aug28 ? 00:01:15 /opt/IBMJava2-13/jre/bin/exe/jav
thomasn 2320 1 0 Aug28 ? 00:00:00 /opt/IBMJava2-13/jre/bin/exe/jav
thomasn 2344 1 0 Aug28 ? 00:00:01 /opt/IBMJava2-13/jre/bin/exe/jav

... which I read as meaning that these three are owned by '1' -- which is shown by a full ps to be:

UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Aug28 ? 00:00:07 init [3]

This looks Bad to me.

The java process was kicked off by a bash script (eestart), using nohup:
$nohup eestart
which sends all the stdout stuff to nohup.out, and allows me to disconnect and leave the Java process (Tomcat) still running. So I'm guessing the problem is that something weird is happening to the nohup process.

Suspect #1: updatedb (!?) I got root mail a couple of days ago telling me that a cron job had failed, seems updatedb threw a segfault. Looking at the ps -ef again, I see, starting at 4:01 am this morning:

root 4956 599 0 04:01 ? 00:00:00 CROND
root 4957 4956 0 04:01 ? 00:00:00 bash /usr/bin/run-parts /etc/cro
root 5056 4957 0 04:02 ? 00:00:00 sh /etc/cron.daily/slocate.cron
root 5057 4957 0 04:02 ? 00:00:00 awk -v progname=/etc/cron.daily/
root 5058 5056 0 04:02 ? 00:00:02 /usr/bin/updatedb -f NFS,SMBFS,N
thomasn 5286 1 0 09:13 ? 00:00:03 /opt/IBMJava2-13/jre/bin/exe/jav
thomasn 5297 5286 0 09:13 ? 00:00:00 /opt/IBMJava2-13/jre/bin/exe/jav
thomasn 5298 5297 0 09:13 ? 00:00:00 /opt/IBMJava2-13/jre/bin/exe/jav
thomasn 5299 5297 0 09:13 ? 00:00:00 /opt/IBMJava2-13/jre/bin/exe/jav
thomasn 5300 5297 0 09:13 ? 00:00:00 /opt/IBMJava2-13/jre/bin/exe/jav

Note that pesky IBMJava process starting up one minute later. Then at 9:13, we start doing some work and find problems...

Can anyone suggest a better way of running the shell script? Do I use a crontab entry that invokes the script every minute, and have the script create a lock file that causes the next invocation of the script to exit quietly? When the script finishes, it can clean up the lock file. Yuchh.

Now that I've started... any suggesttions as to how I can get a 'monitor' process to check that the port is accepting connections and kill and restart the server if not?

Thanks to you all,
T.

r3b00t 08-30-2001 06:23 PM

You can use the core files with gdb to (hopefully) backtrace where the problem occurred. (gdb <app> core, enter bt on prompt)

Question, do you get Signal 11's during compiling heavy (or maybe even light) software?

isajera 08-30-2001 06:29 PM

ok... no definite solution yet, but we're getting somewhere.

in addition to allowing you to logout and keep a process running, executing a program or script with nohup tells the process to ignore a SIGHUP - this is the signal that tells a child process to die.

i think you can clean up by sending a "kill -s SIGHUP"

is there anyway you can run the program without using the nohup?

thomas.nichols 08-30-2001 06:50 PM

> if he can be sure he's running the latest stable software...

What do you need to assess this?
$uname -a
Linux ns.delta123.net 2.2.19-7.0.1 #1 Tue Apr 10 00:55:03 EDT 2001 i686 unknown
$ rpm -q -a | grep glibc
glibc-common-2.2-12
glibc-2.2-12
glibc-devel-2.2-12

... but rpm then hangs!! ...
A separate terminal shows ps -ef:
thomasn 5889 5865 0 00:10 pts/15 00:00:01 /usr/lib/rpm/rpmq -q --all
root 5893 599 0 00:20 ? 00:00:00 CROND
root 5894 5893 0 00:20 ? 00:00:00 /bin/sh -c /sbin/rmmod -as

and I can run rpm -q -a without problems.
A complete rpm -q -a listing, FWIW, is at http://users.4mymail.co.uk/nexus10/12xu/rpm-q-a.txt

Nyaargh... I'm seriously baffled here, what do I try next? It's a remote rack-mount machine (I've never seen it) -- what filesystem checks should I be running?
Oh joy.
Thanks all,
Thomas.

thomas.nichols 08-30-2001 07:10 PM

>> i think you can clean up by sending a "kill -s SIGHUP" <<

'fraid not. Tried this as thomasn (the process owner) and as root. Also with -s SIGTERM and SIGKILL:
kill -s SIGKILL 2248
It Still Lives.

>> is there anyway you can run the program without using the nohup? <<

If you can suggest one I'd be most grateful. It's a remote machine, access is via SSH. The only physical control I have is to ask for it to be rebooted. I need to have the Tomcat app server (a glorified Java web server with extras) running, as user 'thomasn'. How can I do this?

>> You can use the core files with gdb to (hopefully) backtrace where the problem occurred. (gdb <app> core, enter bt on prompt) << Thanks for this. Are you suggesting I do this with updatedb?

>> Question, do you get Signal 11's during compiling heavy (or maybe even light) software? <<

Not compiled much on here, mostly used rpms. Would this information help track the problem down? I'll try compiling a package if you suggest one.


Thanks again,
Thomas.

r3b00t 08-30-2001 07:13 PM

Well, a decent compile without sig11's would rule out about 99% of this being a hardware problem.
The kernel is a good place to meet the sig11's. So maybe you could compile a new kernel?

Also have a look at http://www.bitwizard.nl/sig11/

[edit]

About updatedb: You must get a core file from the updatedb process. If it's running as root, it should dump a core. It it's running as another user, make sure that user has the right to dump a core. To enable coredumps for a user, do ulimit -c unlimited.

This is the only way to see why updatedb is crashing.

About the server: Do you have root access on the box? And are coredumps enabled under your account?

[/edit]

thomas.nichols 08-30-2001 08:50 PM

r3b00t,
Thanks for the swift response.
>> About updatedb: You must get a core file from the updatedb process. If it's running as root, it should dump a core. It it's running as another user, make sure that user has the right to dump a core. To enable coredumps for a user, do ulimit -c unlimited.
This is the only way to see why updatedb is crashing. <<

Ok - /usr/bin/updatedb is a symlink to /usr/bin/slocate -- there's no /usr/bin/*.core.
[/]$find . -name "*.core" | tee /tmp/core.find &
produces a zero-byte file.

The cron job is in /etc/cron.daily:
$ cat slocate.cron
#!/bin/sh
/usr/bin/updatedb -f "nfs,smbfs,ncpfs,proc,devpts" -e "/tmp,/var/tmp,/usr/tmp,/afs,/net"

This script executes just fine when run as root.

>> About the server: Do you have root access on the box? <<

Thankfully, yes.

>> And are coredumps enabled under your account? <<
Not sure:
[thomasn@ns thomasn]$ ulimit
unlimited
[thomasn@ns thomasn]$ ulimit -c
1000000
[thomasn@ns thomasn]$

Re SIG11 - that is one very scary document. I've never tried rebuilding a kernel, do I just get a RedHat7.0 src tarball from rpmfind.net and ./configure;make on it?

Re system oddities - I installed compat-libstdc++-6.2-2.9.0.9.i386.rpm in a vain attempt to get the Sun Java VM running (it seg-faulted also, in fact. Hmm.)


FWIW, the output of 'top' is at
http://users.4mymail.co.uk/nexus10/12xu/top.txt

====
This is beginning to point to H/W, methinks.
Thanks,
Thomas.

isajera 09-01-2001 11:54 AM

i hate to ask this after two days, but, just making sure:

does the eestart run at all? i mean, does it run when you run it while logged in, but not when you try to run it with nohup?

also, i had the "kill -s SIGHUP" suggested to me as a way to clean up zombies, but it needs to be run on the parent process - process 1 in other words. "kill -s SIGHUP 1" - i've been trying to create a few zombies on my box to test this, but haven't been able to get anything that can't be kill-ed yet. however, i have tried the "kill -s SIGHUP 1" command, and it doesn't crash the init process... so it should be safe to try.

it does sound like it might be something else tho... i'd try compiling the kernel, even if just to see if it will compile.

thomas.nichols 09-01-2001 03:29 PM

Thanks for the thought isajera:
>> i hate to ask this after two days, but, just making sure:
does the eestart run at all? i mean, does it run when you run it while logged in, but not when you try to run it with nohup? <<

Runs fine without nohup whilst I'm connected -- but dies as soon as I disconnect (as expected). With nohup it also runs fine (apparently), but after a day I get this "zombified" process.

Next time I get some zombified processes I'll try a kill -s SIGHUP on the parent -- thanks for this suggestion.


I've found a "how do I recompile the kernel" FAQ at linuxdoc.org, will try that on Monday.

The company we're renting the server from
(dedicated-servers.co.uk) seem very open to swapping out the hard disk into another box, so if it's a RAM error that might well resolve it - assuming the kernel or other system components haven't been corrupted by it. . What's the best way to test whether we have a dodgy HDD? This would presumably mean a full OS reload, starting from scratch. And I'd thought we might get some work done...
Thanks again, Regards
Thomas.

thomas.nichols 09-04-2001 05:27 PM

I think the problem has been found - RAM. It's a 128Mb (Celeron) box, running RH7.0 - and at minimal load top is reporting ~1Mb free! (the top listing I posted earlier, at http://users.4mymail.co.uk/nexus10/12xu/top.txt , has 1,288Kb free). This is in its 'idle' state, the servlet actually invokes a new Java VM (i.e. invokes 'java' again).

From the tech support people:
"Once you get a zombie kswapd (due to such a heavy load, it seems) the kernel
can't access the swap space, and that's causing segfaults. "

Does this sound reasonable? I had no idea low memory could cause seg faults, is this accurate?

Thanks again for your help in tracking this down,
Regards,
Thomas.


All times are GMT -5. The time now is 12:55 PM.