LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   [Bash/unkown]Cron creating Defunct Processes from Script (https://www.linuxquestions.org/questions/programming-9/%5Bbash-unkown%5Dcron-creating-defunct-processes-from-script-947969/)

C-Sniper 06-01-2012 09:04 AM

[Bash/unkown]Cron creating Defunct Processes from Script
 
Hi all,

I am working on a server which has to restart a daemon every day. I have created a script to reduce the worload by our company's tech support team since they are not experienced in linux enough to know what to do. I have the script set up in cron to restart the process however, it only ever returns a zombie process that is killed without waiting.

The script I wrote is BASH based, however the daemon uses some binary file to start up and since there are multiple parts of the script that have to be run I would prefer to keep it in the script.

I am trying to find out where I am going wrong or if there is a way around this and any help would be appreciated. I have posted the script I wrote and the crontab below with some redactions due to company policy.

Crontab
Code:

0 4,20 * * * /usr/local/bin/<XXXXX> stop
2 4,20 * * * exec /usr/local/bin/<XXXXXX> start >>/dev/null 2>$1

Script
the word BINARY denotes the name of the script
Code:

#!/bin/bash

#
# BINARY automated init script
#
#
# used to stop, start, and restart the BINARY software with minimal
# user input and to prevent mistakes from occuring.

case "$1" in
        start)
               
                # Start daemon.
                DIR=$(pwd)
                if [ $DIR != "BINARY_DIR" ]
                then
                        echo "Changing to BINARY directory."
                        echo " "
                        cd BINARY_DIR               
                fi       
               
                # Start the server and dbserver
                exec ./BINARY -start
                RETVAL=$?
               
                # Log it
                echo  "BINARY Server started"
                [ $RETVAL -eq 0 ]
                ;;

        stop)
       
                # Stop daemon.
                DIR=$(pwd)
                if [ $DIR != "BINARY_DIR" ]
                then
                        echo "Changing to BINARY directory."
                        echo " "
                        cd BINARY_DIR                fi       

                ids[0]=$(pgrep BINARY)

                echo "Getting current process tree..."

                # Initialize the index numbers
                index=0
                quit=0
               
                while [ $quit -eq 0 ]
                do
                        ((index++))

                        # Get all the child processes spawned by this/these ppid/s
                        ids[$index]=$(ps -o pid --ppid ${ids[index-1]} | \pcregrep '\d' | tr \\n ' ')

                        # Once no child processes are found, exit the loop
                        if [ ! "${ids[$index]}" ]
                        then
                                ((quit++))
                        fi
                done

                # Start shutting down the processes and database
                echo "Shutting down BINARY: "
                echo "Please wait 15s for service shutdown"
                BINARY -stop -n 10
                sleep 15s

                # Check to see if there are any remaining processes
                # If so, kill **** and all the spawned processes.

                echo "Checking for any remaining processes"

                # Kill the process from the parent to all child processes
                for i in {0..1}
                do
               
                                # Try to kill the process gracefully
                                if [ "${ids[$i]}" ]
                                then
                                        kill -9 ${ids[$i]}
                                        sleep 3s
                                fi
                done


                echo "Shutdown complete."
                RETVAL=$?

                # Log it
                echo "**** Process Shutdown"
                [ $RETVAL -eq 0 ]
                ;;
               
        restart)
                echo "Restarting server"
                $0 stop
                $0 start
                ;;

        *)
                echo $"Usage: $0 {start|stop|restart}"
                RETVAL=1
                ;;
esac

exit $RETVAL

--UPDATE--

Here is an image of the processes spawned from that daemon.

http://i46.tinypic.com/53nv69.jpg

Thanks in advance!

Nominal Animal 06-01-2012 10:20 AM

It is better to implement the restart in a single script. If the service exits or is killed, no children left, then you start the service. Now, you just start the service regardless of what happened with stopping it. That's just not reliable at all.

(I'm sorry, but I'm not interested in debugging an approach that will not work reliably anyway. This is why I won't analyze your script.)

Assuming BINARY is a Java process -- I'm assuming that because you use pgrep BINARY instead of ps -o pid= -C BINARY --, then make sure your script executes it via setsid to make sure it starts in a new session (it is perfectly okay if it daemonizes and happens to start a yet new session). Then,
Code:

sessions="$(ps -o sess= -o cmd= -C java | awk '/BINARY/ { i = int($1); if (i > 1) s[i] } END { sp="" ; for (i in s) { printf("%s%d", sp, i); sp=" " } }')"
gives you all the session ID's for all java processes containing BINARY at the command line, as a space-delimited list.

That list is extremely useful. For example,
Code:

[ -n "$(ps -o pid= -s "$sessions")" ]
evaluates to true only if there are processes still left in those sessions. You can use e.g.
Code:

pids="$(ps -o pid= -s "$sessions")"
[ -n "$pids" ] && kill -HUP $pids

to send a HUP signal to all processes belonging to those sessions. (Java reacts to HUP by exiting normally; similar to what closing the window does to GUI Java apps.)

In Linux, I like to use date -u +%s to get the current Unix timestamp (seconds since Epoch) when killing processes. The following approach seems to work well for most situations:
Code:

TIMEOUT_SLEEP=1  # Duration to sleep before resending the signal
TIMEOUT_HUP=30    # Allow 30 seconds for graceful shutdown
TIMEOUT_TERM=15  # Allow further 15 seconds for exit
TIMEOUT_KILL=15  # Allow further 15 seconds for killed processes before failing

pids="$(ps -o pid= -s "$session")"

# HUP phase.
if [ -n "$pids" ]; then
    printf 'Sending HUP to existing BINARY processes .. ' >&2
    kill -HUP $pids

    now=$(date -u +%s)
    limit=$((now + TIMEOUT_HUP))
    while [ $now -lt $limit ]; do
        pids="$(ps -o pid -s "$session")"
        [ -n "$pids" ] || break
        kill -HUP "$pids"
        sleep $TIMEOUT_SLEEP
        now=$(date -u +%s)
    done
fi

# TERM phase.
if [ -n "$pids" ]; then
    printf 'TERM .. ' >&2
    kill -TERM $pids

    now=$(date -u +%s)
    limit=$((now + TIMEOUT_TERM))
    while [ $now -lt $limit ]; do
        pids="$(ps -o pid -s "$session")"
        [ -n "$pids" ] || break
        kill -TERM "$pids"
        sleep $TIMEOUT_SLEEP
        now=$(date -u +%s)
    done
fi

# KILL phase.
if [ -n "$pids" ]; then
    printf 'KILL .. ' >&2
    kill -TERM $pids

    now=$(date -u +%s)
    limit=$((now + TIMEOUT_KILL))
    while [ $now -lt $limit ]; do
        pids="$(ps -o pid -s "$session")"
        [ -n "$pids" ] || break
        kill -KILL "$pids"
        sleep $TIMEOUT_SLEEP
        now=$(date -u +%s)
    done
fi

if [ -n "$pids" ]; then
    printf 'Failed.\n' >&2

    # TODO: Log failure to restart the service!
    exit 1
else
    printf 'Success.\n' >&2

    # Service stopped successfully.
    # Wait for ~ 15 seconds or more,
    # to make sure socket resources are released.
    printf 'Waiting for 15 seconds .. ' >&2
    sleep 15
    printf 'Done.\n' >&2

    # TODO: Now start the service.
fi

Note that it works even for just starting the service, since it will not try to stop anything if there are no processes to stop; it'll just sail right through to the wait 15 seconds part, then start the service.

If used in crontab, I'd consider writing the actual stuff in a subshell (within ( ... )), capturing the output (either to a variable or a temporary file). That way, if the restart is successful, you can just discard the output; but, if there is a failure, you can append it to a log file, or send the restart log via e-mail.

C-Sniper 06-01-2012 11:23 AM

:doh:

My apologies Nominal, but I realize I was being very vague in the original post since I had higher-ups here looking over my shoulder telling me what I can and cannot type out.

The issue I am having is on the Start portion of the script and getting it to work with Cron. The Stop portion works flawlessly in Cron, and the reason I have the script broken up is so that I could debug it (in cron) and also not have to automatically resort to restarting the daemon every time.

Pretty much the screenshot I showed is from Cron executing the 'Start' command to the script. If the user executes the script on their own everything works fine but the problem I am having is related to how Cron forks out the child processes and waits for them to complete.

Although I will say that is a very elegant shutdown script. However this monstrosity of a program I am dealing with doesn't acknowledge any kill command other than the SIGKILL and that is only because, as you know, the kernel forces it to shutdown. Also, we fired all the developers for anything *NIX based over 2 years ago so there is no support for this, I just happen to be, "The guy who knows Linux" and got thrown on the project.

--UPDATE--

Managed to find someone with at least partial knowledge of this program. It appears to be written in C since it has to be able to go across Linux and AIX

Nominal Animal 06-01-2012 02:18 PM

Quote:

Originally Posted by C-Sniper (Post 4693144)
The issue I am having is on the Start portion of the script and getting it to work with Cron.

Ah, okay. You are aware that cron runs scripts as root? I wonder if you need to start the script as some specific user instead. If you have sudo installed, use sudo -u username command arguments-if-any.. .

It is possible the program requires a terminal to run. It would be weird, but if it is designed to be run by a user (and say displays some tables or menus), it could be a curses-type application, and not work unless it has a terminal. Note that it does not mean that the actual service needs a terminal, but that the program that eventually forks the service processes needs a terminal. I haven't ran into this in real life, but you can either use a dedicated console TTY (one without a getty running, say /dev/tty63 ) or use an utility similar to screen . If you use a dedicated console TTY, you'll need the cron script make sure the console owner and mode is suitable first; that requires root access, so you'd better use a wrapper script that does that before running sudo, if the BINARY needs to be run as a specific user. (That console will, however, be accessible from the physical console on the server too.)

In any case I'd recommend you to write a trivial wrapper script. Instead of running the BINARY directly, run something like the following script:
Code:

#!/bin/bash
( setsid /usr/local/bin/BINARY "$@" </dev/null &>/dev/null & )

or, if it ought to run as a specific user,
Code:

#!/bin/bash
[ "$(id -un)" = "user" ] || exec sudo -u user "$0" "$@"
( setsid /usr/local/bin/BINARY "$@" </dev/null &>/dev/null & )

The tty console wrapper could be something like
Code:

#!/bin/bash
user="user"
ttydev="/dev/tty63"

if [ "$(id -u)" = "0" ]; then
    chown "$user" "$ttydev" &>/dev/null
    chmod 0660 "$ttydev" &>/dev/null
    exec sudo -u "$user" "$0" "$@"
    exit $?
fi

[ "$(id -un)" = "$user" ] || exit 1

( setsid /usr/local/bin/BINARY "$@" <"$ttydev" &>"$ttydev" & )

Edited to add:

When restarting, I'd still make sure the old processes are killed before I'd try to start a new one, and make sure there is a long enough delay inbetween for socket resources to be released (by the kernel). TCP in particular is tricky: any TCP sockets open when the process was killed are subject to TCP timeout, since they will be left in the TIME_WAIT state. I think the typical default timeout is one minute, but it might be as long as 15 minutes on older AIX. Any attempt to bind to those sockets, even by the same service (since it will be a new process!), will fail until the timeout expires. You can check using nc -l [local-ip] port to see if local ip port port is still in TIME_WAIT state: it will only succeed if the port is free (not used and not in TIME_WAIT state).

C-Sniper 06-04-2012 08:46 AM

Sorry for the late response, was out for the weekend.

Quote:

Originally Posted by Nominal Animal (Post 4693273)
Ah, okay. You are aware that cron runs scripts as root? I wonder if you need to start the script as some specific user instead. If you have sudo installed, use sudo -u username command arguments-if-any.. .

I was under the impression that an individual user's crontab would run as that user and not root. This issue could explain it but I thought I remembered reading the man page for cron and it seemed to say that only the system crontab would run as root and the rest would run as their respective user.

Quote:

It is possible the program requires a terminal to run. It would be weird, but if it is designed to be run by a user (and say displays some tables or menus), it could be a curses-type application, and not work unless it has a terminal. Note that it does not mean that the actual service needs a terminal, but that the program that eventually forks the service processes needs a terminal. I haven't ran into this in real life, but you can either use a dedicated console TTY (one without a getty running, say /dev/tty63 ) or use an utility similar to screen . If you use a dedicated console TTY, you'll need the cron script make sure the console owner and mode is suitable first; that requires root access, so you'd better use a wrapper script that does that before running sudo, if the BINARY needs to be run as a specific user. (That console will, however, be accessible from the physical console on the server too.)
This program is a "basic" server side daemon which handles communications processing. All the configuration is handled by separate scripts. I will give the dedicated TTY a try since it could be a way to remove this process from being a child underneath the cron daemon which is what is causing the issues.

Quote:

In any case I'd recommend you to write a trivial wrapper script. Instead of running the BINARY directly, run something like the following script:
Code:

#!/bin/bash
( setsid /usr/local/bin/BINARY "$@" </dev/null &>/dev/null & )

or, if it ought to run as a specific user,
Code:

#!/bin/bash
[ "$(id -un)" = "user" ] || exec sudo -u user "$0" "$@"
( setsid /usr/local/bin/BINARY "$@" </dev/null &>/dev/null & )

The tty console wrapper could be something like
Code:

#!/bin/bash
user="user"
ttydev="/dev/tty63"

if [ "$(id -u)" = "0" ]; then
    chown "$user" "$ttydev" &>/dev/null
    chmod 0660 "$ttydev" &>/dev/null
    exec sudo -u "$user" "$0" "$@"
    exit $?
fi

[ "$(id -un)" = "$user" ] || exit 1

( setsid /usr/local/bin/BINARY "$@" <"$ttydev" &>"$ttydev" & )


I will give it a try!

Quote:

Edited to add:

When restarting, I'd still make sure the old processes are killed before I'd try to start a new one, and make sure there is a long enough delay inbetween for socket resources to be released (by the kernel). TCP in particular is tricky: any TCP sockets open when the process was killed are subject to TCP timeout, since they will be left in the TIME_WAIT state. I think the typical default timeout is one minute, but it might be as long as 15 minutes on older AIX. Any attempt to bind to those sockets, even by the same service (since it will be a new process!), will fail until the timeout expires. You can check using nc -l [local-ip] port to see if local ip port port is still in TIME_WAIT state: it will only succeed if the port is free (not used and not in TIME_WAIT state).
The script I have originally takes the whole process tree and kills it which does a force close on the TCP connection with a 10s timeout delay. This 10s is apart of the shutdown procedure for the server daemon and is not something that I can alter since it is hard coded in. I can usually poll the status using
Code:

netstat -ant |grep <ports>
and haven't had an issue with any connections dropping.


In case this doesn't work, and I think this may be the solution, is there a way that you can take the cron daemon's spawned processes (i.e. the wrapper script and any thing run by the wrapper script) and force it to fork off as a parent process running as a child process of the init system (PID 1) and NOT a child processes of either the Cron daemon or the wrapper script?

Nominal Animal 06-04-2012 10:12 AM

Quote:

Originally Posted by C-Sniper (Post 4695130)
I was under the impression that an individual user's crontab would run as that user and not root.

That is correct, so it should. It is just a remote possibility; there are a number of different cron implementations, and most of them can be misconfigured to run individual users crontab entries as root.

I only pointed that out because I did not notice you're using per-user crontab entries; sorry.

Quote:

Originally Posted by C-Sniper (Post 4695130)
I will give the dedicated TTY a try since it could be a way to remove this process from being a child underneath the cron daemon which is what is causing the issues.

If that is the only problem, then use
Code:

#!/bin/bash
( setsid /usr/local/bin/PROGRAM </dev/null &>/dev/null & )

as the launcher script in crontab. It detaches PROGRAM completely from the caller. The script will return immediately, and never output anything. PROGRAM will be reparented to init.


Quote:

Originally Posted by C-Sniper (Post 4695130)
The script I have originally takes the whole process tree and kills it which does a force close on the TCP connection with a 10s timeout delay. This 10s is apart of the shutdown procedure for the server daemon and is not something that I can alter since it is hard coded in.

No, actually it is something the kernel enforces when it has to close a TCP/IP socket, instead of the process calling close() on it. It is required by the TCP/IP spec, I believe.

Quote:

Originally Posted by C-Sniper (Post 4695130)
In case this doesn't work, and I think this may be the solution, is there a way that you can take the cron daemon's spawned processes (i.e. the wrapper script and any thing run by the wrapper script) and force it to fork off as a parent process and NOT a child processes of either the Cron daemon or the wrapper script?

That's exactly what the above snippet does.

You see, the ( command... & ) starts command.. at the background of a subshell. When the subshell exits, the command.. is reparented to init (process 1). setsid runs it in a new session (process group), and redirecting standard input, output and error (away from the parent) completely detaches it from the parent process.

Here is a practical example:
Code:

nominal@farm:~$ ps -o pid,ppid,pgrp,tty,cmd $$ $PPID
  PID  PPID  PGRP TT      CMD
7426    1  1959 ?        /usr/bin/xfce4-terminal --maximize
7428  7426  7428 pts/0    bash

nominal@farm:~$ ( setsid sleep 30 </dev/null &>/dev/null & )

nominal@farm:~$ ps -o pid,ppid,pgrp,tty,cmd -C sleep $$ $PPID
  PID  PPID  PGRP TT      CMD
7426    1  1959 ?        /usr/bin/xfce4-terminal --maximize
7428  7426  7428 pts/0    bash
8184    1  8184 ?        sleep 30

As you can see from the output, the sleep 30 command is completely detached from my shell and terminal; it does not have a controlling TTY. If I first make sure say /dev/tty63 is accessible to this user, then I can use that (instead of /dev/null) to redirect the standard streams, and it will be visible in the 63rd console.

C-Sniper 06-04-2012 10:22 AM

Quote:

Originally Posted by Nominal Animal (Post 4695203)
/Big Snip

Ok, thank you very much! I will give this a try on the server.

Thanks for all of your help!

C-Sniper 06-06-2012 10:23 AM

Update-

I attempted both versions of what you suggested Nominal and none of them came to fruition.

Under the required user account, using
Code:

( setsid /opt/files/iws/BINARY -start </dev/null &>/dev/null & )
or
Code:

( setsid ./opt/files/iws/BINARY -start </dev/null &>/dev/null & )
resulted in a non-start of the Daemon. I attempted to use the code you provided with the other BASH start-up script but that also did not work from the Cron daemon.

Any ideas left because I know I am out of them.

Nominal Animal 06-06-2012 12:31 PM

Quote:

Originally Posted by C-Sniper (Post 4696903)
Under the required user account, using
Code:

( setsid /opt/files/iws/BINARY -start </dev/null &>/dev/null & )
resulted in a non-start of the Daemon.

The ps output linked to your first post tells that the daemon detaches from the terminal, so it should not be a problem with ttys or terminals at all.

Hmm.. could it be the current working directory?

What is the actual sequence of commands you use to start it successfully? If it is
Code:

cd /opt/files/iws/
./BINARY -start

then could you please check if
Code:

( cd /opt/files/iws/ && setsid ./BINARY -start </dev/null &>/dev/null & )
starts the daemon correctly? If it does, you need to replace all setsid lines in my script to match.

This expression works just like the earlier ones, except that the cd part will first change the current working directory, and the && will cause the command to only be run if the working directory change was successful. The working directory change is only effective in the subshell, so working directory is not changed for the caller/parent.

C-Sniper 06-06-2012 03:08 PM

Quote:

Originally Posted by Nominal Animal (Post 4697044)
The ps output linked to your first post tells that the daemon detaches from the terminal, so it should not be a problem with ttys or terminals at all.

Hmm.. could it be the current working directory?

What is the actual sequence of commands you use to start it successfully? If it is
Code:

cd /opt/files/iws/
./BINARY -start

then could you please check if
Code:

( cd /opt/files/iws/ && setsid ./BINARY -start </dev/null &>/dev/null & )
starts the daemon correctly? If it does, you need to replace all setsid lines in my script to match.

This expression works just like the earlier ones, except that the cd part will first change the current working directory, and the && will cause the command to only be run if the working directory change was successful. The working directory change is only effective in the subshell, so working directory is not changed for the caller/parent.

Still no dice with the new code.

I am now back to what I got in the screenshot in the original post.

Code:

UID        PID  PPID  C STIME TTY          TIME CMD
iws      22556    1  0 10:14 ?        00:00:00 ./Binary -sTART
iws      22560    1  0 10:14 ?        00:00:05 dbserver
iws      22567 22556  0 10:14 ?        00:00:00 [Binary] <defunct>
iws      22570 22556  0 10:14 ?        00:00:00 [Binary] <defunct>
iws      22571 22556  0 10:14 ?        00:00:00 [Binary] <defunct>
iws      22576 22556  0 10:14 ?        00:00:02 [Binary] <defunct>
iws      22577 22556  0 10:14 ?        00:00:10 [Binary] <defunct>
iws      22586 22556  0 10:14 ?        00:00:01 [Binary] <defunct>
iws      22587 22556  0 10:14 ?        00:00:08 [Binary] <defunct>
iws      22588 22556  0 10:14 ?        00:00:32 [Binary] <defunct>
iws      22589 22556  0 10:14 ?        00:00:06 [Binary] <defunct>
iws      22590 22556  0 10:14 ?        00:00:00 [Binary] <defunct>
iws      22591 22556 37 10:14 ?        00:25:31 [Binary] <defunct>
iws      22592 22556  0 10:14 ?        00:00:01 [Binary] <defunct>
iws      22593 22556  0 10:14 ?        00:00:00 [Binary] <defunct>
iws      23385 22556 21 10:58 ?        00:04:58 [Binary] <defunct>
iws      23682 23670  0 11:15 ?        00:00:00 sshd: iws@pts/1 
iws      23683 23682  0 11:15 pts/1    00:00:00 -bash
iws      23871 23683  0 11:22 pts/1    00:00:00 ps -fu iws

In the code each [Binary] <defunct> is a child process of the main command (they are each different modules that are loaded). The ./Binary -sTART is the main daemon. Also, the -sTART is correct for starting it with -start.

Nominal Animal 06-06-2012 05:08 PM

Quote:

Originally Posted by C-Sniper (Post 4697164)
Still no dice with the new code.

What is the exact command sequence you need to run (I'm guessing as the iws user) to start the service successfully?

All I've seen thus far is commands that do not work. For all I know, the service itself could be completely shot, and there might be no way to start it properly at all. More likely, it could be random: it mostly succeeds when started by a human on the command line, but due to timing differences or black magic or the phase of the moon, happens to consistently fail when started by a script. I really need you to verify it can be repeatedly, reliably started using a specific command, for me to help you much further.

Also, ps -u iws -o pid,ppid,sess,pgrp,cmd would be quite interesting. The CMD column is not interesting for me, but you need it to determine which processes are relevant for this service, I think. I'm only interested in the first four fields: process ID, parent process ID, session ID, and process group ID. If I am right, the latter two will be the same for all (and kill -KILL -thatnumber will kill the entire service cleanly).

Quote:

Originally Posted by C-Sniper (Post 4697164)
Also, the -sTART is correct for starting it with -start.

That tells me quite a bit about the daemon programmers. Modifying the command line parameters like that is a very tricky technique; the changes are visible in the process list (and thus useful for monitoring), but changing the string length is not portable; you can only change the individual characters. The programmers that I know that use such techniques tend to be quite set in their ways -- shall we say, emphatically traditional?

It is quite possible BINARY relies on the parent process ID not changing. It would be a stupid assumption, but if it is highly-priced closed-source software, or sensitive in-house stuff, it could have checks built-in to make sure the parent process is legit.

To verify, you could try starting it by using command
Code:

( exec </dev/null &>/dev/null ; cd /opt/files/iws/ && setsid bash -c './BINARY -start ; exit $?' & )
This one redirects the standard streams to /dev/null first. Then it changes to the correct working directory (assuming that is the correct one; I wouldn't know). The backgrounded command this time is actually an explicit shell running in a new session. It will run BINARY, but the parent process (a Bash shell) will not exit until BINARY exits. This way BINARY will not be reparented to init, and its parent process will not change.

If that fails, you could also try with /dev/tty63 (after making sure user iws owns /dev/tty63) instead of /dev/null. It should not matter, but who knows with such weird software.

If that fails too, then I recommend you next try to start BINARY using strace -e -o log ./BINARY -start : it will log all syscalls it makes to file log. Comparing the log of a successful launch to a faulty launch might tell you where the two diverge.

C-Sniper 06-06-2012 07:41 PM

Quote:

Originally Posted by Nominal Animal (Post 4697268)
What is the exact command sequence you need to run (I'm guessing as the iws user) to start the service successfully?

as the iws user
Code:

cd /opt/files/iws
./Binary -start

Quote:

All I've seen thus far is commands that do not work. For all I know, the service itself could be completely shot, and there might be no way to start it properly at all. More likely, it could be random: it mostly succeeds when started by a human on the command line, but due to timing differences or black magic or the phase of the moon, happens to consistently fail when started by a script. I really need you to verify it can be repeatedly, reliably started using a specific command, for me to help you much further.
The code I posted above with regards to the exact sequence works every single time.


Quote:

Also, ps -u iws -o pid,ppid,sess,pgrp,cmd would be quite interesting. The CMD column is not interesting for me, but you need it to determine which processes are relevant for this service, I think. I'm only interested in the first four fields: process ID, parent process ID, session ID, and process group ID. If I am right, the latter two will be the same for all (and kill -KILL -thatnumber will kill the entire service cleanly).


That tells me quite a bit about the daemon programmers. Modifying the command line parameters like that is a very tricky technique; the changes are visible in the process list (and thus useful for monitoring), but changing the string length is not portable; you can only change the individual characters. The programmers that I know that use such techniques tend to be quite set in their ways -- shall we say, emphatically traditional?

It is quite possible BINARY relies on the parent process ID not changing. It would be a stupid assumption, but if it is highly-priced closed-source software, or sensitive in-house stuff, it could have checks built-in to make sure the parent process is legit.

To verify, you could try starting it by using command
Code:

( exec </dev/null &>/dev/null ; cd /opt/files/iws/ && setsid bash -c './BINARY -start ; exit $?' & )
This one redirects the standard streams to /dev/null first. Then it changes to the correct working directory (assuming that is the correct one; I wouldn't know). The backgrounded command this time is actually an explicit shell running in a new session. It will run BINARY, but the parent process (a Bash shell) will not exit until BINARY exits. This way BINARY will not be reparented to init, and its parent process will not change.

If that fails, you could also try with /dev/tty63 (after making sure user iws owns /dev/tty63) instead of /dev/null. It should not matter, but who knows with such weird software.

If that fails too, then I recommend you next try to start BINARY using strace -e -o log ./BINARY -start : it will log all syscalls it makes to file log. Comparing the log of a successful launch to a faulty launch might tell you where the two diverge.
I will get you this information tomorrow since I do not have access to the PC right now.

Thanks for your help in this, I really do appreciate it

Nominal Animal 06-06-2012 10:48 PM

Quote:

Originally Posted by C-Sniper (Post 4697361)
as the iws user
Code:

cd /opt/files/iws
./Binary -start

works every single time.

Excellent.

Actually, I think I've overlooked something obvious: Environment variables. You probably want to check env output for environment variables that might be relevant to ./Binary.

You see, because per-user cron scripts are run using a standard shell (instead of an interactive user shell), the per-user interactive shell startup files do not get executed. It is possible that some environment variables are set in e.g. ~iws/.bashrc, which is only executed for interactive shells.

Fortunately, you can just use an interactive shell to run ./Binary :
Code:

( setsid bash -lic 'cd /opt/files/iws && exec ./Binary -start' </dev/null &>/dev/null & )
Since it is possible the interactive login shell changes to the user home directory, it is best to do the cd within that shell, just before executing the Binary.

I'm off to sleep myself,

C-Sniper 06-08-2012 09:42 AM

Small update,

Since I have disabled the script and by some miracle since the program hasn't crashed I haven't been able to restart the process to check the new commands. However, here is the list from the ps -u iws -o pid,ppid,sess,pgrp,cmd command

Code:

  PID  PPID  SESS  PGRP CMD
 1858    1  1858  1858 ./Binary -sTART
 1862    1  1862  1862 dbserver
 1869  1858  1858  1858 Binary Module 1
 1872  1858  1858  1858 Binary Module 2
 1873  1858  1858  1858 Binary Module 3
 1878  1858  1858  1858 Binary Module 4
 1879  1858  1858  1858 Binary Module 5
 1880  1858  1858  1858 Binary Module 6
 1881  1858  1858  1858 Binary Module 7
 1882  1858  1858  1858 Binary Module 8
 1892  1858  1858  1858 Binary Module 9
 1893  1858  1858  1893 Binary Module 10
 1894  1858  1858  1858 Binary Module 11
 1895  1858  1858  1858 Binary Module 12
 1897  1858  1858  1858 Binary Module 13
 7712  1858  1858  1858 Binary Module 14

the reason the last PID is so different is due to the fact that the module keeps crashing and restarting itself every hour or so.

C-Sniper 06-12-2012 02:02 PM

Quote:

Originally Posted by Nominal Animal (Post 4697472)
Code:

( setsid bash -lic 'cd /opt/files/iws && exec ./Binary -start' </dev/null &>/dev/null & )

THIS WORKS!!

Thank you very much for the help Nominal, this final solution is working perfectly with Cron and everything works as it should (or however much this pos binary program can).


Cheers sir!


All times are GMT -5. The time now is 12:28 PM.