LinuxQuestions.org - [SOLVED] Need help running program in background in BASH

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Need help running program in background in BASH (https://www.linuxquestions.org/questions/linux-newbie-8/need-help-running-program-in-background-in-bash-4175464508/)

Need help running program in background in BASH

Somehow, I managed to create a program - in C, if that matters - which solves a math differential equation (don't worry if you don't understand that). We need the program to run for a very long time (multiple weeks!) on a cluster. I need to figure out how to run it in the background, since it seems my fine ISP doesn't seem to want to give me a consistent internet connection (and I want to disconnect and watch the new Arrested Development during the next few weeks). I am trying to do so using the "&" and "screen" features of BASH, but having very little luck. I am connecting to the cluster using SSH from a Cygwin terminal on my fine Windows Vista PC.

First off, how do you get the "&" feature (I don't know what it's called) to work right? I'm told the "&" after a command should execute that command in the background. But when I try to use that, the background job stops the moment I try to try to do something else. Then, I have to call the job back into the foreground (using "fg"), but then it's no longer running in the background and I have to wait for it to finish. For example:

Code:

prompt>mpirun 4 MYPROG 6 0 &

[1] 11571

prompt>PID 20705 executing on node3.

PID 22374 executing on node4.

PID 14730 executing on node5.

PID 14216 executing on node2.



 Simulating 64x64 oscillators for 9999 timesteps



 Number of PEs =  4



t = 00000. Overwriting files: /raid/data/N=064x064_Beta=0.00_t=0000000_T.csv,        /raid/data/N=064x064_Beta=0.00_t=0000000_U.csv

ls

total 1292

drwxr-xr-x 3 schwarz schwarz  4096 Jun  2 20:55 ./

...



[1]+  Stopped                mpirun -machinefile $HOME/utils/Host_file -np 4 FPU 6 0

prompt>jobs

[1]+  Stopped                mpirun -machinefile $HOME/utils/Host_file -np 4 FPU 6 0

prompt>fg 1

mpirun -machinefile $HOME/utils/Host_file -np 4 MYPROG 6 0

Notice where I typed "ls" (I tried to bold it) and got a file listing, and was then informed that job 1 was stopped. I was then forced to bring the job to the foreground using "fg 1".

Also, what happens if a background program has some output, or requires input? I can find no documentation on "&" and could use a good, clear explanation.

Second (is this post too long?), how do I use the "screen" command? I have searched the web far and wide and I see that you create a new screen with "screen -S SCNNAME" which works fine. I checked that it works by adding $STY to my prompt. But then everything I read says to switch screens you type "Ctrl-a c" or "Ctrl-a n". But none of those Ctrl-a sequences are working for me? Am I using screen in an environment that doesn't support it? Am I reading the right instructions? Am I doing something else wrong? Please tell me how to use "screen" and where I can get good documentation for it(not the "screen --help" or "man screen" crypto-help).

So how can I get my long running program to run so that I can disconnect and come back later, and still see the relevant output? TIA.

man page:
Screen does not understand the prefix "C-" to mean control, although this
notation is used in this manual for readability. Please use the caret
notation ("^A" instead of "C-a") as arguments to e.g. the escape command
or the -e option. Screen will also print out control characters in caret
notation.

I'm not sure what you're trying to tell me. Can you clarify please?

If you're going to run a program in the background via '&', you have to re-direct any output not already directed to a file ( eg stdout & stderr) to a file.
It should NOT require any input or it will likely hang waiting for the input. You'd have to bring it back to the foreground to do input....
If you want it to continue after you have logged out, you need to prefix with 'nohup' eg

Code:

nohup ./myprog >myprog.log 2>&1 &

I am using "screen" in place of "nohup". I understand where you put ">myprog.log" at the end of the execution statement - that is to send output to the file myprog.log. What does "2>&1" do? What do I search on to read about that?

File handles numbers 0,1 and 2 are respectively the standard input, standard output and standard error messages. These are created by default and are available to every process.

By your command line, the output (file handle 1) stands redirected to myprog.log

2>&1 redirects the standard error messages to the standard output which already stands redirected to myprog.log

So myprog.log contains BOTH the standard output and standard errors.

Thus all outputs are redirected and won't wait indefinitely for a non existent output device.

Questions to answer.
(1) Whats the role of & in 2>&1?
(2) Would 2>myprog.log work as well?

OK

Thanks, Anantha.

The output is not working quite right. I'm running the program as

Code:

mpirun -machinefile $HOME/utils/Host_file -np 30 MYPROG 6 4 >MYPROG.log 2>&1

But it doesn't seem to be capturing everything. I can look at the output directory and see that it is creating output files. The file MYPROG.log has a message for the creation of the first output file, but none of the ones after that. Any idea why MYPROG.log did not contain all the standard output from my program?

To answer your questions:
(1) The role of "&" is obviously some type of delimiter which tells linux not to interpret "1" as a file name.
(2) My first impression is that would work, but I'm guessing this is a trick question and there would be some file conflict issues. By redirecting standard error to standard output, it would combine the two outputs.

In file descriptors/redirections, '&' represents a 'file duplicator'. In natural language terms it could be translated as 'the same place as'.

>MYPROG.log 2>&1 means " send stdout to MYPROG.log, and also send stderr to the same place as stdout's current setting (i.e. also to MYPROG.log).

File descriptors are defined for the main command process launched on the line, mpirun in this case.

redirections and file descriptors explained

Since you've already mentioned screen though, I think you should forget about backgrounding anything and just use a dedicated screen session for it. Set it running, detach it, and open a new terminal for general use. You can re-attach to it at any time for control and monitoring. It's exactly the kind of thing screen was designed for.

(Don't ask me how to do it though, I don't have much experience with screen either. ;))

If you are using a cluster, then normally the cluster configuration includes a batch process that will do this for you (though this depends on the cluster - a cheap thrown together cluster might not... but then it also isn't really a cluster as it is more just a bunch of nodes on a net).

You can also read the man pages on "batch" and "cron" (batch uses cron to implement a simple batch queuing system).

Quote:

Originally Posted by David the H. (Post 4965383)

Hi David. Thanks for the link. I'll read that page this afternoon.

In the meantime, do you have any idea why MYPROG.log is not receiving output? It seems like the first few lines are sent to that file, but no further output. It seems like the OS might be buffering what it writes to MYPROG.log. But it's now 20 hours since I submitted the job - it is clearly still running because it is creating new output files - but there are no more messages being sent to MYPROG.log. If the file is being buffered, then it will be complete when the job finished (which will be about 30 hours from now). But if the job hangs - like it did last time - I get no further output.

Any idea what I can do about that? I hope I'm describing it well.

As to using screen (whoever is reading this thread), if I submit the job in the foreground how do I get out of the current screen? Using Ctrl-a c isn't working for me, and I don't understand what linosaurusroot, in the second post, was trying to tell me.

Quote:

Originally Posted by jpollard (Post 4965763)

Hi jp. I'll read about batch and cron this afternoon. Thanks for the pointer to them.

Yes, the MPI (multiple programming interface?) submits the job to all of the cluster's processors. But the command line that issues the mpi command

Code:

>mpirun -machinefile $HOME/utils/Host_file -np 30 MYPROG 6 4 >MYPROG.log 2>&1

is in the foreground of the node the node that I'm on and is waiting for the program to finish - approximately 50 hours - and in the meantime it is displaying the standard output and error output.

The file MYPROG.log has received the first few lines of the standard output, but nothing more. Also, if the job hangs - which it did after 30 hours the last time I ran it - I never get the rest of the output. Is it buffering the output before writing to MYPROG.log? Did it stop sending output to MYPROG.log when I changed the job to the background (using Ctrl-z)?

Thanks.

Most clusters will be running a batch system, something like PBS or LFS. There are others:

http://en.wikipedia.org/wiki/Job_scheduler

specifically the section on queuing for HPC clusers.

....

IF you just use Ctrl-z you didn't put the process in background - you just suspended it. If you want it to continue running you need to use the "bg" command to resume it.

Well, it seems to keep running after I used Ctrl-z. Using redirection and "&" didn't supply me with adequate output (Ctrl-z may not, either).

The only process suspended is the one attached to the terminal. Any processes spawned by the mpirun before the suspension will still be running. But since status data will still be sent back to the mpirun control process will be unprocessed (and if UDP, could be dropped).

It is the logging that you want one of the cluster based batch systems - errors from remote processes may get lost otherwise.

I think I get the first paragraph - basically, my terminal is suspended, but not running in the background. The cluster's processors are running the "mpispawn" jobs in the background and sending output back to the suspended terminal. The terminal stores the standard output and standard error so that when I call it to the foreground, I can see it. But if the job dies before I return it to the foreground, I will miss the important why-it-died information.

Can you clarify your second paragraph?