LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   How does Sun Grid Engine 6.2u3 kick off jobs? (https://www.linuxquestions.org/questions/linux-general-1/how-does-sun-grid-engine-6-2u3-kick-off-jobs-764007/)

gumbyjnm 10-23-2009 12:39 PM

How does Sun Grid Engine 6.2u3 kick off jobs?
 
Hello, new to the group and hoping for some help. I am a software engineer, mainly concerned with writing simulation code. I get forced into sys admin/setup tasks from time to time, which is where I usually run into trouble! :-)

I am posting here as suggested for a first post, not sure where this question belongs so if there is a better place please let me know.

I have googled around on this problem and found some suggested fixes, none of which sounded very elegant. Anyway here is some background on my problem with questions at the end.

I have a couple of CentOs 5 machines that I have installed the Sun Grid Engine 6.2u3 on. One machine has an execution host and the QMaster, and the other is just the execution host. The problem I am having is on the execution host only machine. I can schedule and run the simple.sh example provided with the Grid Install, but when I try to run my acutall program I get the error related to the display.

The first error I got was that DISPLAY was not set, so I set it to DISPLAY=:0.0 in the script. THen I would get the Xlib connection refused, and Xlib no protocol error pair. I have seen the solution of setting /etc/profile to see what user is logging in and set XAUTHORITY env variable to /home/username/.Xauthority solution, but this does not seem very elegant, have also seen the xhost + option which everyone seems to discourage.

If I log into the execution host and grep DISPLAY variable, then set the DISPLAY variable in my Sun Grid script to the same value the program will execute. I do not want to have to be logged in though.

My suspicion is that these problems are related to the X11 forwarding and root security features. I admit that I do not understand the X Windows setup of security and forwarding very well and this is probably the root of my confusion. So after all of that rambling here are my questions:

Does anyone here know how Sun Grid is kicking off the jobs? Is the sge_execd process, running as root, just kicking the jobs off as the user who scheduled the job?

Is there and elegant solution to the x windows problem of setting up the display if you have passed through the root user?

The program I am trying to run does not actually display anything, just needs the x environment to do off screen buffer.

I dont know if any of that made any sense. If you have questions please ask and I will try to clarify what I am trying to do....

Thanks in advance...Help me Obi Wan, your my only hope!

John

archlinux_jessica 10-23-2009 04:25 PM

I can't really help with the issue but I can give some information that I know of at least. You may or may not already know this much and I'm sorry if my post might not be very helpful.

As my understanding your xorg system is currently running as your user, and not by root. Meaning if you try to run a GUI as root it will state there's no display because root does not have xorg running. I'm not sure what your security needs are but I find if you run sudo instead of su, you are able to do superuser commands while still running as the user your in. Meaning xorg will be available. And I mean run sudo as the user, not as root.

-Jessica-

gumbyjnm 10-23-2009 05:59 PM

Understand about sudo. I guess my description wasn't very clear. The Sun Grid execution service, sge_execd, is actually starting the process on the remote machine. I think the sge_qmaster process on the master talks to the sge_execd process on the hosts over tcp. The sge_execd is then kicking off the scheduled job somehow, not sure how that is done, but the job runs as the user specified in the submit jobs routine on the master.

This service is running in the background, so no one actually has to be logged into the host machine. Maybe that is the problem, there is no xterm because no one is logged in??

Thank you for taking a shot at though.

I have tried looking through the Sun Grid docs and they are less than helpful. The install process is not that hard, but you would never know it reading the documentation.

Another bit of info on the problem that I forgot in the original description. When I dump the environment variables through the Sun Grid started script the variable TERM is set to linux, not xterm as I would suspect. I have told the script to start as a login script:

#!/bin/sh --login

Still looking for Obi Wan to come get me off of Alderaan!!!

gumbyjnm 10-25-2009 11:12 PM

Update and mostly solved
 
OK, I have this working well enough to get by now. What I did:

Ran gdmsetup and checked to Allow tcp connections. Not sure if I really needed to this one or not, but will test tomorrow. I also added a file to /ect called X0.0.hosts, then in the script that runs on the execution node set DISPLAY to the master and export. The only problem with this is if I log out of the master the remote process dies with a broken pipe. Which makes sense because the xserve instance goes away.

So I guess my new questions, is there an instance running when no one is logged into a machine?

Deviathan 10-26-2009 11:34 AM

Couple of things here. I used to admin a 60 node SGE cluster so I'll chime in on some things here. There should be no need to do anything elaborate to run X style programs either on the client or the node the job is getting sent to. But it does depend on what command you use to send the job to the scheduler. You'll want to look at all the different commands to use when submitting a job, some are for running X type jobs and others will not allow such a thing.

Also, if you have to you can set the DISPLAY variable in the user's startup script (i.e. .bashrc) or any other info you might need for that to be opened up. Remember, when you run a SGE submit command, it should be sending the job to the master which is usually the scheduler, the scheduler will then decide which node on the cluster to send the job to. The X server would be the one you are using on the client side, the client program however will be running on the SGE node the job was sent to.

gumbyjnm 10-26-2009 10:09 PM

OK, will look into the different submit commands. I was using the qmon gui only to try different options. Have used qsub cli interface before but never any of the others. Looks like maybe qrsh or qtch might be the way I need to go.

Thanks for the help.....


All times are GMT -5. The time now is 03:33 AM.