LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 08-30-2001, 06:01 AM   #1
thomas.nichols
LQ Newbie
 
Registered: Jul 2001
Location: Worcester, UK
Distribution: Mandrake
Posts: 14

Rep: Reputation: 0
Question Cannot kill process - please help!


Good Day,

I have a RedHat 7.0 box running Tomcat Java app server, which has become unstable a few times for reasons I cannot fathom. The current ps -ef listing shows a number of Java JVMs (IBM JDK 1.3, FWIW) - and some of them will not die, even with kill -9 as root. How else can I get rid of the little blighters?

"netstat -a | grep LISTEN" shows that the ports used by Tomcat are still being listened to -- presumably by these rogue processes.

Over the past week I've rebooted twice -- once because "rm -rf" on a directory froze (again, kill -9 did nothing) and again because HTTP and SSH connections abruptly disppeared (though PING remained OK). What diagnostics could I run here?

Thanks a lot for any suggestions,
Regards,
Thomas.
 
Old 08-30-2001, 04:24 PM   #2
isajera
Senior Member
 
Registered: Jun 2001
Posts: 1,635

Rep: Reputation: 45
night of the living dead...

sounds like you've got a few zombie processes running. what happens is that the parent process that calls the zombies doesn't call wait() or waitpid(), so the child process hangs around until the parent process dies. you can't get rid of it with kill... although i've noticed that the kde process manager can somehow get rid of zombies. don't ask me how. i haven't the faintest idea. the only way i know to get rid of them is to kill the parent process, but many times you don't want to do that.
 
Old 08-30-2001, 05:12 PM   #3
r3b00t
Member
 
Registered: May 2001
Distribution: OpenBSD 3.0-beta
Posts: 50

Rep: Reputation: 15
Gentlemen, we should think solutions :)

Maybe it's an idea to find out _what_ is causing the problem, so you can fix the problem at it's source.

What software is running on the box, which kernel, etc etc
 
Old 08-30-2001, 05:29 PM   #4
isajera
Senior Member
 
Registered: Jun 2001
Posts: 1,635

Rep: Reputation: 45
the process causing the problem is going to be the parent process... not the child. when you have processes that can't be killed - make a note of which process is the parent.

it's pretty strange that "rm -rf" would hang tho...
 
Old 08-30-2001, 05:39 PM   #5
r3b00t
Member
 
Registered: May 2001
Distribution: OpenBSD 3.0-beta
Posts: 50

Rep: Reputation: 15
Well, if he can be sure he's running the latest stable software, and even rm hangs, it might really be possible that there's a hardware issue...
 
Old 08-30-2001, 06:06 PM   #6
thomas.nichols
LQ Newbie
 
Registered: Jul 2001
Location: Worcester, UK
Distribution: Mandrake
Posts: 14

Original Poster
Rep: Reputation: 0
Unhappy

"Pretty strange" is my assessment also - here's the top of the ps listing:
$ ps -ef | grep IBMJ
thomasn 2248 1 0 Aug28 ? 00:01:15 /opt/IBMJava2-13/jre/bin/exe/jav
thomasn 2320 1 0 Aug28 ? 00:00:00 /opt/IBMJava2-13/jre/bin/exe/jav
thomasn 2344 1 0 Aug28 ? 00:00:01 /opt/IBMJava2-13/jre/bin/exe/jav

... which I read as meaning that these three are owned by '1' -- which is shown by a full ps to be:

UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Aug28 ? 00:00:07 init [3]

This looks Bad to me.

The java process was kicked off by a bash script (eestart), using nohup:
$nohup eestart
which sends all the stdout stuff to nohup.out, and allows me to disconnect and leave the Java process (Tomcat) still running. So I'm guessing the problem is that something weird is happening to the nohup process.

Suspect #1: updatedb (!?) I got root mail a couple of days ago telling me that a cron job had failed, seems updatedb threw a segfault. Looking at the ps -ef again, I see, starting at 4:01 am this morning:

root 4956 599 0 04:01 ? 00:00:00 CROND
root 4957 4956 0 04:01 ? 00:00:00 bash /usr/bin/run-parts /etc/cro
root 5056 4957 0 04:02 ? 00:00:00 sh /etc/cron.daily/slocate.cron
root 5057 4957 0 04:02 ? 00:00:00 awk -v progname=/etc/cron.daily/
root 5058 5056 0 04:02 ? 00:00:02 /usr/bin/updatedb -f NFS,SMBFS,N
thomasn 5286 1 0 09:13 ? 00:00:03 /opt/IBMJava2-13/jre/bin/exe/jav
thomasn 5297 5286 0 09:13 ? 00:00:00 /opt/IBMJava2-13/jre/bin/exe/jav
thomasn 5298 5297 0 09:13 ? 00:00:00 /opt/IBMJava2-13/jre/bin/exe/jav
thomasn 5299 5297 0 09:13 ? 00:00:00 /opt/IBMJava2-13/jre/bin/exe/jav
thomasn 5300 5297 0 09:13 ? 00:00:00 /opt/IBMJava2-13/jre/bin/exe/jav

Note that pesky IBMJava process starting up one minute later. Then at 9:13, we start doing some work and find problems...

Can anyone suggest a better way of running the shell script? Do I use a crontab entry that invokes the script every minute, and have the script create a lock file that causes the next invocation of the script to exit quietly? When the script finishes, it can clean up the lock file. Yuchh.

Now that I've started... any suggesttions as to how I can get a 'monitor' process to check that the port is accepting connections and kill and restart the server if not?

Thanks to you all,
T.
 
Old 08-30-2001, 06:23 PM   #7
r3b00t
Member
 
Registered: May 2001
Distribution: OpenBSD 3.0-beta
Posts: 50

Rep: Reputation: 15
You can use the core files with gdb to (hopefully) backtrace where the problem occurred. (gdb <app> core, enter bt on prompt)

Question, do you get Signal 11's during compiling heavy (or maybe even light) software?
 
Old 08-30-2001, 06:29 PM   #8
isajera
Senior Member
 
Registered: Jun 2001
Posts: 1,635

Rep: Reputation: 45
ok... no definite solution yet, but we're getting somewhere.

in addition to allowing you to logout and keep a process running, executing a program or script with nohup tells the process to ignore a SIGHUP - this is the signal that tells a child process to die.

i think you can clean up by sending a "kill -s SIGHUP"

is there anyway you can run the program without using the nohup?
 
Old 08-30-2001, 06:50 PM   #9
thomas.nichols
LQ Newbie
 
Registered: Jul 2001
Location: Worcester, UK
Distribution: Mandrake
Posts: 14

Original Poster
Rep: Reputation: 0
Question

> if he can be sure he's running the latest stable software...

What do you need to assess this?
$uname -a
Linux ns.delta123.net 2.2.19-7.0.1 #1 Tue Apr 10 00:55:03 EDT 2001 i686 unknown
$ rpm -q -a | grep glibc
glibc-common-2.2-12
glibc-2.2-12
glibc-devel-2.2-12

... but rpm then hangs!! ...
A separate terminal shows ps -ef:
thomasn 5889 5865 0 00:10 pts/15 00:00:01 /usr/lib/rpm/rpmq -q --all
root 5893 599 0 00:20 ? 00:00:00 CROND
root 5894 5893 0 00:20 ? 00:00:00 /bin/sh -c /sbin/rmmod -as

and I can run rpm -q -a without problems.
A complete rpm -q -a listing, FWIW, is at http://users.4mymail.co.uk/nexus10/12xu/rpm-q-a.txt

Nyaargh... I'm seriously baffled here, what do I try next? It's a remote rack-mount machine (I've never seen it) -- what filesystem checks should I be running?
Oh joy.
Thanks all,
Thomas.
 
Old 08-30-2001, 07:10 PM   #10
thomas.nichols
LQ Newbie
 
Registered: Jul 2001
Location: Worcester, UK
Distribution: Mandrake
Posts: 14

Original Poster
Rep: Reputation: 0
>> i think you can clean up by sending a "kill -s SIGHUP" <<

'fraid not. Tried this as thomasn (the process owner) and as root. Also with -s SIGTERM and SIGKILL:
kill -s SIGKILL 2248
It Still Lives.

>> is there anyway you can run the program without using the nohup? <<

If you can suggest one I'd be most grateful. It's a remote machine, access is via SSH. The only physical control I have is to ask for it to be rebooted. I need to have the Tomcat app server (a glorified Java web server with extras) running, as user 'thomasn'. How can I do this?

>> You can use the core files with gdb to (hopefully) backtrace where the problem occurred. (gdb <app> core, enter bt on prompt) << Thanks for this. Are you suggesting I do this with updatedb?

>> Question, do you get Signal 11's during compiling heavy (or maybe even light) software? <<

Not compiled much on here, mostly used rpms. Would this information help track the problem down? I'll try compiling a package if you suggest one.


Thanks again,
Thomas.
 
Old 08-30-2001, 07:13 PM   #11
r3b00t
Member
 
Registered: May 2001
Distribution: OpenBSD 3.0-beta
Posts: 50

Rep: Reputation: 15
Well, a decent compile without sig11's would rule out about 99% of this being a hardware problem.
The kernel is a good place to meet the sig11's. So maybe you could compile a new kernel?

Also have a look at http://www.bitwizard.nl/sig11/

[edit]

About updatedb: You must get a core file from the updatedb process. If it's running as root, it should dump a core. It it's running as another user, make sure that user has the right to dump a core. To enable coredumps for a user, do ulimit -c unlimited.

This is the only way to see why updatedb is crashing.

About the server: Do you have root access on the box? And are coredumps enabled under your account?

[/edit]

Last edited by r3b00t; 08-30-2001 at 07:22 PM.
 
Old 08-30-2001, 08:50 PM   #12
thomas.nichols
LQ Newbie
 
Registered: Jul 2001
Location: Worcester, UK
Distribution: Mandrake
Posts: 14

Original Poster
Rep: Reputation: 0
r3b00t,
Thanks for the swift response.
>> About updatedb: You must get a core file from the updatedb process. If it's running as root, it should dump a core. It it's running as another user, make sure that user has the right to dump a core. To enable coredumps for a user, do ulimit -c unlimited.
This is the only way to see why updatedb is crashing. <<

Ok - /usr/bin/updatedb is a symlink to /usr/bin/slocate -- there's no /usr/bin/*.core.
[/]$find . -name "*.core" | tee /tmp/core.find &
produces a zero-byte file.

The cron job is in /etc/cron.daily:
$ cat slocate.cron
#!/bin/sh
/usr/bin/updatedb -f "nfs,smbfs,ncpfs,proc,devpts" -e "/tmp,/var/tmp,/usr/tmp,/afs,/net"

This script executes just fine when run as root.

>> About the server: Do you have root access on the box? <<

Thankfully, yes.

>> And are coredumps enabled under your account? <<
Not sure:
[thomasn@ns thomasn]$ ulimit
unlimited
[thomasn@ns thomasn]$ ulimit -c
1000000
[thomasn@ns thomasn]$

Re SIG11 - that is one very scary document. I've never tried rebuilding a kernel, do I just get a RedHat7.0 src tarball from rpmfind.net and ./configure;make on it?

Re system oddities - I installed compat-libstdc++-6.2-2.9.0.9.i386.rpm in a vain attempt to get the Sun Java VM running (it seg-faulted also, in fact. Hmm.)


FWIW, the output of 'top' is at
http://users.4mymail.co.uk/nexus10/12xu/top.txt

====
This is beginning to point to H/W, methinks.
Thanks,
Thomas.
 
Old 09-01-2001, 11:54 AM   #13
isajera
Senior Member
 
Registered: Jun 2001
Posts: 1,635

Rep: Reputation: 45
i hate to ask this after two days, but, just making sure:

does the eestart run at all? i mean, does it run when you run it while logged in, but not when you try to run it with nohup?

also, i had the "kill -s SIGHUP" suggested to me as a way to clean up zombies, but it needs to be run on the parent process - process 1 in other words. "kill -s SIGHUP 1" - i've been trying to create a few zombies on my box to test this, but haven't been able to get anything that can't be kill-ed yet. however, i have tried the "kill -s SIGHUP 1" command, and it doesn't crash the init process... so it should be safe to try.

it does sound like it might be something else tho... i'd try compiling the kernel, even if just to see if it will compile.
 
Old 09-01-2001, 03:29 PM   #14
thomas.nichols
LQ Newbie
 
Registered: Jul 2001
Location: Worcester, UK
Distribution: Mandrake
Posts: 14

Original Poster
Rep: Reputation: 0
Thanks for the thought isajera:
>> i hate to ask this after two days, but, just making sure:
does the eestart run at all? i mean, does it run when you run it while logged in, but not when you try to run it with nohup? <<

Runs fine without nohup whilst I'm connected -- but dies as soon as I disconnect (as expected). With nohup it also runs fine (apparently), but after a day I get this "zombified" process.

Next time I get some zombified processes I'll try a kill -s SIGHUP on the parent -- thanks for this suggestion.


I've found a "how do I recompile the kernel" FAQ at linuxdoc.org, will try that on Monday.

The company we're renting the server from
(dedicated-servers.co.uk) seem very open to swapping out the hard disk into another box, so if it's a RAM error that might well resolve it - assuming the kernel or other system components haven't been corrupted by it. . What's the best way to test whether we have a dodgy HDD? This would presumably mean a full OS reload, starting from scratch. And I'd thought we might get some work done...
Thanks again, Regards
Thomas.
 
Old 09-04-2001, 05:27 PM   #15
thomas.nichols
LQ Newbie
 
Registered: Jul 2001
Location: Worcester, UK
Distribution: Mandrake
Posts: 14

Original Poster
Rep: Reputation: 0
I think the problem has been found - RAM. It's a 128Mb (Celeron) box, running RH7.0 - and at minimal load top is reporting ~1Mb free! (the top listing I posted earlier, at http://users.4mymail.co.uk/nexus10/12xu/top.txt , has 1,288Kb free). This is in its 'idle' state, the servlet actually invokes a new Java VM (i.e. invokes 'java' again).

From the tech support people:
"Once you get a zombie kswapd (due to such a heavy load, it seems) the kernel
can't access the swap space, and that's causing segfaults. "

Does this sound reasonable? I had no idea low memory could cause seg faults, is this accurate?

Thanks again for your help in tracking this down,
Regards,
Thomas.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Process won't kill mrsolo Linux - Software 5 08-03-2007 12:45 PM
I want to kill process Jeon, Chang-Min Linux - General 6 08-09-2005 12:14 AM
cannot kill process (kill -9 does not work) mazer13a Linux - General 1 05-27-2005 02:32 PM
Cannot kill process Zeno McDohl Linux - General 13 04-14-2005 02:31 PM
can't kill the process ust Linux - Software 1 09-15-2004 06:05 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 08:57 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration