[Bash/unkown]Cron creating Defunct Processes from Script
Hi all,
I am working on a server which has to restart a daemon every day. I have created a script to reduce the worload by our company's tech support team since they are not experienced in linux enough to know what to do. I have the script set up in cron to restart the process however, it only ever returns a zombie process that is killed without waiting. The script I wrote is BASH based, however the daemon uses some binary file to start up and since there are multiple parts of the script that have to be run I would prefer to keep it in the script. I am trying to find out where I am going wrong or if there is a way around this and any help would be appreciated. I have posted the script I wrote and the crontab below with some redactions due to company policy. Crontab Code:
0 4,20 * * * /usr/local/bin/<XXXXX> stop the word BINARY denotes the name of the script Code:
#!/bin/bash Here is an image of the processes spawned from that daemon. http://i46.tinypic.com/53nv69.jpg Thanks in advance! |
It is better to implement the restart in a single script. If the service exits or is killed, no children left, then you start the service. Now, you just start the service regardless of what happened with stopping it. That's just not reliable at all.
(I'm sorry, but I'm not interested in debugging an approach that will not work reliably anyway. This is why I won't analyze your script.) Assuming BINARY is a Java process -- I'm assuming that because you use pgrep BINARY instead of ps -o pid= -C BINARY --, then make sure your script executes it via setsid to make sure it starts in a new session (it is perfectly okay if it daemonizes and happens to start a yet new session). Then, Code:
sessions="$(ps -o sess= -o cmd= -C java | awk '/BINARY/ { i = int($1); if (i > 1) s[i] } END { sp="" ; for (i in s) { printf("%s%d", sp, i); sp=" " } }')" That list is extremely useful. For example, Code:
[ -n "$(ps -o pid= -s "$sessions")" ] Code:
pids="$(ps -o pid= -s "$sessions")" In Linux, I like to use date -u +%s to get the current Unix timestamp (seconds since Epoch) when killing processes. The following approach seems to work well for most situations: Code:
TIMEOUT_SLEEP=1 # Duration to sleep before resending the signal If used in crontab, I'd consider writing the actual stuff in a subshell (within ( ... )), capturing the output (either to a variable or a temporary file). That way, if the restart is successful, you can just discard the output; but, if there is a failure, you can append it to a log file, or send the restart log via e-mail. |
:doh:
My apologies Nominal, but I realize I was being very vague in the original post since I had higher-ups here looking over my shoulder telling me what I can and cannot type out. The issue I am having is on the Start portion of the script and getting it to work with Cron. The Stop portion works flawlessly in Cron, and the reason I have the script broken up is so that I could debug it (in cron) and also not have to automatically resort to restarting the daemon every time. Pretty much the screenshot I showed is from Cron executing the 'Start' command to the script. If the user executes the script on their own everything works fine but the problem I am having is related to how Cron forks out the child processes and waits for them to complete. Although I will say that is a very elegant shutdown script. However this monstrosity of a program I am dealing with doesn't acknowledge any kill command other than the SIGKILL and that is only because, as you know, the kernel forces it to shutdown. Also, we fired all the developers for anything *NIX based over 2 years ago so there is no support for this, I just happen to be, "The guy who knows Linux" and got thrown on the project. --UPDATE-- Managed to find someone with at least partial knowledge of this program. It appears to be written in C since it has to be able to go across Linux and AIX |
Quote:
It is possible the program requires a terminal to run. It would be weird, but if it is designed to be run by a user (and say displays some tables or menus), it could be a curses-type application, and not work unless it has a terminal. Note that it does not mean that the actual service needs a terminal, but that the program that eventually forks the service processes needs a terminal. I haven't ran into this in real life, but you can either use a dedicated console TTY (one without a getty running, say /dev/tty63 ) or use an utility similar to screen . If you use a dedicated console TTY, you'll need the cron script make sure the console owner and mode is suitable first; that requires root access, so you'd better use a wrapper script that does that before running sudo, if the BINARY needs to be run as a specific user. (That console will, however, be accessible from the physical console on the server too.) In any case I'd recommend you to write a trivial wrapper script. Instead of running the BINARY directly, run something like the following script: Code:
#!/bin/bash Code:
#!/bin/bash Code:
#!/bin/bash When restarting, I'd still make sure the old processes are killed before I'd try to start a new one, and make sure there is a long enough delay inbetween for socket resources to be released (by the kernel). TCP in particular is tricky: any TCP sockets open when the process was killed are subject to TCP timeout, since they will be left in the TIME_WAIT state. I think the typical default timeout is one minute, but it might be as long as 15 minutes on older AIX. Any attempt to bind to those sockets, even by the same service (since it will be a new process!), will fail until the timeout expires. You can check using nc -l [local-ip] port to see if local ip port port is still in TIME_WAIT state: it will only succeed if the port is free (not used and not in TIME_WAIT state). |
Sorry for the late response, was out for the weekend.
Quote:
Quote:
Quote:
Quote:
Code:
netstat -ant |grep <ports> In case this doesn't work, and I think this may be the solution, is there a way that you can take the cron daemon's spawned processes (i.e. the wrapper script and any thing run by the wrapper script) and force it to fork off as a parent process running as a child process of the init system (PID 1) and NOT a child processes of either the Cron daemon or the wrapper script? |
Quote:
I only pointed that out because I did not notice you're using per-user crontab entries; sorry. Quote:
Code:
#!/bin/bash Quote:
Quote:
You see, the ( command... & ) starts command.. at the background of a subshell. When the subshell exits, the command.. is reparented to init (process 1). setsid runs it in a new session (process group), and redirecting standard input, output and error (away from the parent) completely detaches it from the parent process. Here is a practical example: Code:
nominal@farm:~$ ps -o pid,ppid,pgrp,tty,cmd $$ $PPID |
Quote:
Thanks for all of your help! |
Update-
I attempted both versions of what you suggested Nominal and none of them came to fruition. Under the required user account, using Code:
( setsid /opt/files/iws/BINARY -start </dev/null &>/dev/null & ) Code:
( setsid ./opt/files/iws/BINARY -start </dev/null &>/dev/null & ) Any ideas left because I know I am out of them. |
Quote:
Hmm.. could it be the current working directory? What is the actual sequence of commands you use to start it successfully? If it is Code:
cd /opt/files/iws/ Code:
( cd /opt/files/iws/ && setsid ./BINARY -start </dev/null &>/dev/null & ) This expression works just like the earlier ones, except that the cd part will first change the current working directory, and the && will cause the command to only be run if the working directory change was successful. The working directory change is only effective in the subshell, so working directory is not changed for the caller/parent. |
Quote:
I am now back to what I got in the screenshot in the original post. Code:
UID PID PPID C STIME TTY TIME CMD |
Quote:
All I've seen thus far is commands that do not work. For all I know, the service itself could be completely shot, and there might be no way to start it properly at all. More likely, it could be random: it mostly succeeds when started by a human on the command line, but due to timing differences or black magic or the phase of the moon, happens to consistently fail when started by a script. I really need you to verify it can be repeatedly, reliably started using a specific command, for me to help you much further. Also, ps -u iws -o pid,ppid,sess,pgrp,cmd would be quite interesting. The CMD column is not interesting for me, but you need it to determine which processes are relevant for this service, I think. I'm only interested in the first four fields: process ID, parent process ID, session ID, and process group ID. If I am right, the latter two will be the same for all (and kill -KILL -thatnumber will kill the entire service cleanly). Quote:
It is quite possible BINARY relies on the parent process ID not changing. It would be a stupid assumption, but if it is highly-priced closed-source software, or sensitive in-house stuff, it could have checks built-in to make sure the parent process is legit. To verify, you could try starting it by using command Code:
( exec </dev/null &>/dev/null ; cd /opt/files/iws/ && setsid bash -c './BINARY -start ; exit $?' & ) If that fails, you could also try with /dev/tty63 (after making sure user iws owns /dev/tty63) instead of /dev/null. It should not matter, but who knows with such weird software. If that fails too, then I recommend you next try to start BINARY using strace -e -o log ./BINARY -start : it will log all syscalls it makes to file log. Comparing the log of a successful launch to a faulty launch might tell you where the two diverge. |
Quote:
Code:
cd /opt/files/iws Quote:
Quote:
Thanks for your help in this, I really do appreciate it |
Quote:
Actually, I think I've overlooked something obvious: Environment variables. You probably want to check env output for environment variables that might be relevant to ./Binary. You see, because per-user cron scripts are run using a standard shell (instead of an interactive user shell), the per-user interactive shell startup files do not get executed. It is possible that some environment variables are set in e.g. ~iws/.bashrc, which is only executed for interactive shells. Fortunately, you can just use an interactive shell to run ./Binary : Code:
( setsid bash -lic 'cd /opt/files/iws && exec ./Binary -start' </dev/null &>/dev/null & ) I'm off to sleep myself, |
Small update,
Since I have disabled the script and by some miracle since the program hasn't crashed I haven't been able to restart the process to check the new commands. However, here is the list from the ps -u iws -o pid,ppid,sess,pgrp,cmd command Code:
PID PPID SESS PGRP CMD |
Quote:
Thank you very much for the help Nominal, this final solution is working perfectly with Cron and everything works as it should (or however much this pos binary program can). Cheers sir! |
All times are GMT -5. The time now is 12:28 PM. |