SGE (Sun Grid Engine) advice please
I want some advice on SGE, for use in distributing compile and test tasks across several Centos systems with 1Gb LAN connections.
I looked at the SGE site and read through the FAQ and looked for the "introduction" that was mentioned but with no obvious link. I couldn't find any place the basics were described (a level deeper than "it distributes work for you" but much less deep than most of what I saw there that assumed you already knew a lot about using SGE).
Is it the right tool at all? Or is there some other Linux tool to do the job I want?
What is the level of overhead? Would it make sense to hand it thousands of tasks most of which only take a second? Or do I need to group tasks up to a reasonable level to avoid adding lots of overhead?
Is it practical to have a lightweight task on the requesting computer waiting for the requested task to finish (usually on some other computer)? Is it practical to specify some notification task to be run after a requested task finishes?
Is it practical to specify a task not start until after a listed set of previous tasks has finished?
One major use I have is doing compile and link tasks in a project with thousands of .cpp files to be compiled and linked to create dozens of .so files, all multiplied by several different tool sets (32 bit vs. 64 bit target, debug build vs. release build, Intel compiler vs. GCC, different versions of each, etc.)
Currently it is all launched by an obsolete version of bjam, which has a j switch so it can be told how many compiles/links to run in parallel, which I base on either the number of cores or the amount of ram on the system (whichever is more limiting). A top level script runs several bjam commands (strictly sequentially) to do each of a few major subprojects times each of the currently enabled tool sets. That whole script is launched manually on one of a few Centos systems based on which isn't busy.
If something like SGE has very low overhead for tiny tasks, then I would want a very lightweight local task to be launched by bjam for every compile and every link, with bjam set to a much higher j value (number in parallel). Those lightweight local tasks would request the individual compile or link and wait for them (because bjam needs to think the task is local). So the top level scrip and bjam and all those lightweight tasks would run on one manually selected system and the real work would occur on several SGE selected systems. That would represent a minimum change from the current structure.
But I expect something like SGE has too much overhead for that idea. So another alternative would be handing whole bjam commands to SGE from the top level script. Most can run in parallel. But some of them cannot be started until specific earlier ones complete. So I would need extra logic to manage that.