Sun grid engine; jobs stuck in "r" state - no errors - runs don't finish
Hi,
I have been facing a weird problem lately when using Sun Grid Engine (SGE). I have a script which submit jobs to SGE in batches of 30 (total 200 jobs to be submitted). These are then distributed to ~ 10 machines which we have setup in a queue. Now the problem is that some jobs get stuck on a few machines in "r" state. In the execution host messages file I have: reaping job "54988" ptf complains: Job does not exist and then 1 second later in the qmaster file I have: "job 54988.1 finished on host " Searched a lot on the net and the closest I could get to my problem is: https://arc.liv.ac.uk/trac/SGE/ticket/495 (extract pasted above) The problems mentioned here are the same which I saw. In the link mentioned above, restarting SGE is mentioned as the solution which may help. Then I asked IT dept. to restart SGE but again the jobs get stuck in the same way. There is no error in $SGE_ROOT/../spool/qmaster/messages or $SGE_ROOT/../spool//messages Neither "qstat -j $job_id" shows any error. Every now and then a few jobs get stuck to some machines and remain stuck forever. They need to be killed manually and this is getting obviously irritating :) I'd appreciate any help on this issue. Thanks a lot! |
All times are GMT -5. The time now is 07:11 PM. |