Sun grid engine; jobs stuck in "r" state - no errors - runs don't finish
I have been facing a weird problem lately when using Sun Grid Engine (SGE).
I have a script which submit jobs to SGE in batches of 30 (total 200 jobs to be submitted). These are then distributed to ~ 10 machines which we have setup in a queue.
Now the problem is that some jobs get stuck on a few machines in "r" state.
In the execution host messages file I have: reaping job "54988" ptf complains: Job does not exist
and then 1 second later in the qmaster file I have:
"job 54988.1 finished on host "
Searched a lot on the net and the closest I could get to my problem is: https://arc.liv.ac.uk/trac/SGE/ticket/495 (extract pasted above)
The problems mentioned here are the same which I saw. In the link mentioned above, restarting SGE is mentioned as the solution which may help. Then I asked IT dept. to restart SGE but again the jobs get stuck in the same way.
There is no error in $SGE_ROOT/../spool/qmaster/messages or $SGE_ROOT/../spool//messages
Neither "qstat -j $job_id" shows any error.
Every now and then a few jobs get stuck to some machines and remain stuck forever. They need to be killed manually and this is getting obviously irritating :)
I'd appreciate any help on this issue.
Thanks a lot!
|All times are GMT -5. The time now is 02:48 AM.|