I have been facing a weird problem lately when using Sun Grid Engine (SGE).
I have a script which submit jobs to SGE in batches of 30 (total 200 jobs to be submitted). These are then distributed to ~ 10 machines which we have setup in a queue.
Now the problem is that some jobs get stuck on a few machines in "r" state.
In the execution host messages file I have: reaping job "54988" ptf complains: Job does not exist
and then 1 second later in the qmaster file I have:
"job 54988.1 finished on host "
Searched a lot on the net and the closest I could get to my problem is: https://arc.liv.ac.uk/trac/SGE/ticket/495
(extract pasted above)
The problems mentioned here are the same which I saw. In the link mentioned above, restarting SGE is mentioned as the solution which may help. Then I asked IT dept. to restart SGE but again the jobs get stuck in the same way.
There is no error in $SGE_ROOT/../spool/qmaster/messages or $SGE_ROOT/../spool//messages
Neither "qstat -j $job_id" shows any error.
Every now and then a few jobs get stuck to some machines and remain stuck forever. They need to be killed manually and this is getting obviously irritating
I'd appreciate any help on this issue.
Thanks a lot!