I know what the problem here is, but not how to fix it, and am wondering if anybody has a brilliant idea.
I have a temperature/fan control daemon for large rack system that has to run all the time in the background. If it dies or is killed by the user, it needs to restart. The application is critical in that it's the one controlling the fan speeds and shutting down power to the system if it gets too hot.
The application starts up at init and runs fine as it is, everything's working. It's operating as a daemon, doing all it needs to do, communicating to all the parts it needs to. All is well, there are no piles of melted chips and solder on the bottom of the chassis. (I've seen this happen elsewhere, it's pretty, but quite embarrassing, not to mention expensive)
The problem that I'm running into is getting the daemon to restart via inittab:respawn.
This is an embedded system running on busybox on the control processor. This really shouldn't be an issue, as this is really a generic linux/inittab/daemon question.
Here's the problem.
1) All the standard Linux documentation that I have seen says that a daemon app should fork() from the parent, so it can set the ssid of the child properly, and close off the stdout/stderr/stdin ports so that it runs properly in the background. The parent process returns (dies a clean death) and in final effect, the pid moves to a new value (the child process)
2) The very bottom the busybox implementation the RESPAWN option is this - (in init/init.c)
if (a->action_type & (RESPAWN | ASKFIRST)) {// Only run stuff with pid == 0. If pid != 0, it is already running
if (a->pid == 0)a->pid = run(a);
}
In run(), it's 1) doing a fork() from init, and then 2) calling exec() to execute my daemon application, and returning its PID to the code above.
There's some other code in init that scans the PID's that it has against the system PID table, and zeros out those values in it's table that no longer have valid entries.
The problem here of course is that the PID that's being returned from my daemon, is that of the parent before the fork, which is going to no longer exist (the pid is going to disappear) shortly after it has forked off the child.
Inittab:respawn gives me the start at boot time that I need, and all the background code that does the restarting of the application if it dies.
The problem is that because the "parent" is dead, it the inittab:respawn entry to bring it back alive again. It's only that my code has a built in "run no more that one instance of the application" check that has preventing it from having 1000's of copies running after a few hours.
I don't have any working keepalive code on the system at this time, and would prefer not to have to implement it just for this one feature.
Obviously this has got to be a common problem, but I'm missing something here in coming up with a common solution.
Any ideas on how to make this scenario (busybox inittab:respawn and child killing parent standard daemon) work?
Thanks