Debugging hanging child process with strace

Dogza · 06-06-2011, 01:56 AM

Hi All,

I am using a script the spawns child processes but once in a while a child process won't go away which causes the script stop working.
I have done a strace on a child process that stops working, with the following results:

Quote:

rt_sigaction(SIGALRM, {SIG_IGN}, {SIG_DFL}, 8) = 0
alarm(0) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
accept(3,

A good working child process gives the following results:

Quote:

rt_sigaction(SIGALRM, {SIG_IGN}, {SIG_DFL}, 8) = 0
alarm(0) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
accept(3, {sa_family=AF_INET, sin_port=htons(34607), sin_addr=inet_addr("127.0.0.1")}, [17179869200]) = 4
rt_sigaction(SIGALRM, {SIG_IGN}, {SIG_IGN}, 8) = 0
alarm(0) = 0
close(3) = 0
getsockname(4, {sa_family=AF_INET, sin_port=htons(2000), sin_addr=inet_addr("127.0.0.1")}, [17179869200]) = 0
select(5, [0 4], NULL, NULL, NULL) = 2 (in [0 4])
read(4, ""..., 8192) = 0
read(0, ""..., 8192) = 0
close(0) = 0
close(4) = 0
exit_group(0) = ?

I notices the 'accept(3,' but I don't really know what is means or how to read it. I hope that someone can help me to find the cause of this problem.

Thanks in advance!

paulsm4 · 06-06-2011, 06:56 PM

Hi -

It looks like this is a server program that listens on some TCP/IP port and accepts incoming client connections.

It also look like the "accept()" code uses "alarm()" to trigger some kind of timeout: "Do something if a new connection doesn't arrive within N seconds".

By default, "accept()" will block: it will hang forever if nobody ever connects.

That's why they implemented the "alarm()" - to break out if a new connection DOESN'T arrive promptly.

"alarm(0)", as opposed to "alarm (n)", CLEARS any alarm that might be set. So both your strace's are CLEARING an alarm that was presumably set somewhere else.

The "good" case (your second example) is where a connection DID arrive.

The "bad" case (your first example) indicates that your child process is waiting for an incoming connection. Which will presumably never arrive.

So your mission is to figure out who is supposed to connect to the child, and why that somebody is failing to do so.

'Hope that helps .. PSM

Dogza · 06-15-2011, 02:55 AM

Hi Paulsm4,

That helped me a lot. I noticed that Netcat is making connections and it is using TCP/IP. So I am going to try to set a timeout value to see if that fix the problem.

Thank you very much!