LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (http://www.linuxquestions.org/questions/linux-general-1/)
-   -   Sun grid engine; jobs stuck in "r" state - no errors - runs don't finish (http://www.linuxquestions.org/questions/linux-general-1/sun-grid-engine%3B-jobs-stuck-in-r-state-no-errors-runs-dont-finish-881753/)

r_r 05-20-2011 04:25 AM

Sun grid engine; jobs stuck in "r" state - no errors - runs don't finish
 
Hi,

I have been facing a weird problem lately when using Sun Grid Engine (SGE).

I have a script which submit jobs to SGE in batches of 30 (total 200 jobs to be submitted). These are then distributed to ~ 10 machines which we have setup in a queue.

Now the problem is that some jobs get stuck on a few machines in "r" state.

In the execution host messages file I have: reaping job "54988" ptf complains: Job does not exist

and then 1 second later in the qmaster file I have:

"job 54988.1 finished on host "

Searched a lot on the net and the closest I could get to my problem is: https://arc.liv.ac.uk/trac/SGE/ticket/495 (extract pasted above)

The problems mentioned here are the same which I saw. In the link mentioned above, restarting SGE is mentioned as the solution which may help. Then I asked IT dept. to restart SGE but again the jobs get stuck in the same way.

There is no error in $SGE_ROOT/../spool/qmaster/messages or $SGE_ROOT/../spool//messages

Neither "qstat -j $job_id" shows any error.

Every now and then a few jobs get stuck to some machines and remain stuck forever. They need to be killed manually and this is getting obviously irritating :)

I'd appreciate any help on this issue.

Thanks a lot!


All times are GMT -5. The time now is 07:14 PM.