waiting vs. axing child processes?

bean street · 12-19-2011, 05:08 PM

Hi everybody;

We are running a family of remote data collection devices (2.6.21 on ARM) that store data onto a USB memory stick. Because the write times are so long onto the stick, we have opted to have a child process handle the file copy so that the parent is unfettered in its quest to continue collecting geomagnetic data. And this has us at an odd tradeoff decision, (which I am sure the gurus here can answer).

We are using the traditional fork/execl and waitpid from the C API. New files are written to a ramdisk and need to be copied to memory stick every 12 minutes. File copy to memory stick takes ~2 minutes.

Choice 1 : child finishes file copy and exits, entering the defunct state. The parent then waits on the defunct child just before it forks another child. This means that the child is sitting in the defunct state for about 10 minutes before it is waited on.

Choice 2 : child finishes file copy and sleeps. The parent then axes the sleeping child just before it forks another child.

Other choices?

Anyone know which approach might give us the most stable system?

Thanks a lot!!

Nominal Animal · 12-19-2011, 09:23 PM

The first choice -- letting the child stay in defunct state -- is perfectly okay. The kernel releases the resources used by the child, only keeping the process ID and exit status, so there is really no downsides for this.

You can call waitpid(childpid,&status,WNOHANG) every now and then to see if the child has exited yet. It will return childpid if the child has exited, 0 if it is still running, and -1 (and errno set) if an error has occurred. In your case, I don't think it is necessary -- as I said, it is perfectly okay to let the child stay defunct until just before the next fork.

jailbait · 12-19-2011, 10:09 PM

You are assuming that the normal timing of a 2 minute write every 12 minutes will always hold true. Suppose you have some kind of error condition where the write has not completed in 12 minutes. In that case your solution 1 will wait for the write to finish. If the write actually finishes, but late, then solution 1 will preserve your file integrity. If the write never finishes then solution 1 will hang your parent process.

Under the same error condition solution 2 will lose the data still waiting to be written and possibly might corrupt the file. But the parent process will not hang.

I suggest that you use solution 1 but test the wait condition instead of issuing an unconditional wait. If the test shows that the child process is not finished then treat it as an error condition.

---------------------
Steve Stites

Edit: After carefully rereading Nominal Animal's reply I think that his answer is the same as mine, just with different wording.

Nominal Animal · 12-20-2011, 03:17 AM

Yes, it is definitely a good idea to use waitpid(childpid,&status,WNOHANG) to reap the child process without waiting for it to exit. If sufficient time has passed to indicate the child has hung, kill the child via kill(childpid,SIGKILL); and reap it using waitpid(childpid,&status,0) .

If the application is critical, you can install a dummy SIGALRM signal handler (empty body), and set a timeout (using alarm(seconds)) before the waitpid() call; the alarm signal will interrupt the waitpid() call, which will then return (pid_t)-1 with errno set to EINTR. If the waitpid() call is successful, alarm(0) will defuse the timeout. This is extremely robust and reliable.

As to the other choices, you could always install a SIGCHLD handler, and use waitpid(childpid,&status,WNOHANG) in the signal handler -- waitpid() is async-signal-safe, thus okay to use in a signal handler -- to reap the child whenever it exits. It is a bit tricky to implement correctly, because the status must be accessed atomically (as other parts of the code might be midway reading the status just when the signal handler is triggered), and the child may exit at any time after the parent fork()s. In particular, you cannot assume that the main program can save the child PID (to be tried by the signal handler) before the SIGCHLD signal handler may be run; the signal handler must rely on the si_pid field in the siginfo_t structure, or you risk missing a signal. Considering the complexity, I would only use the signal approach in an asynchronous and/or multithreaded program.