Inconsistent job run times

koullislp · 01-18-2013, 11:02 AM

Hi, this is my first post, so please excuse me for any errors.

I have an issue with my new servers.
All of them using SLES 11.2 with kernel 3.0.42.

The thing is that when I run jobs using the queuing system(Platform LSF) on a specific set of servers it is about ten times slower than running that job directly from the CLI. Either way, the same command is used on both occasions. Please keep in mind that a job runs only on one server and the servers are locked for one job only at a time. Moreover the command used to run the job specifies number of CPUs and memory to use.

Therefore the OS, Kernel, number of CPUs, RAM are exactly the same on both occasions.

However, the job run from CLI is 10x times faster!!

At first I thought that it might has something to do with the Linux Task Scheduler. The scheduler obviously gives RealTime priority to the job which is run by the CLI. However I tried increasing the priority during runtime of the job which is run by the queuing system, but it didn't make any difference.

What do you think? Any advice would be much appreciated...

jpollard · 01-18-2013, 10:02 PM

It depends on the system. I am assuming you are referring to elapsed time, and not CPU time.

Batch jobs are normally placed at a low priority. Cron jobs (actually nearly anything else) get much higher priority.

Normally, this isn't a problem - batch jobs are just that, unattended jobs run after anything else is done. Elapsed time can be significantly impacted as any other process may flush the buffers, force paging, even cause the batch job to suspend, whenever they need the memory, CPU, or I/O. It all depends on the general system load.

This should be adjustable by the batch scheduler - just give the batch job (or queue) a higher priority (check the batch configuration - what it considers a high priority does not necessarily correspond to the system task switching priory). One warning... don't ever make a batch system have a priority higher than interactive - it is a sure way to reduce the elapsed time, but it will lock your system from use until the batch job finishes. I did this early in my career - fortunately, the batch job running only needed 10 minutes... but for those 10 minutes nobody could do anything.

koullislp · 01-19-2013, 07:46 AM

Thanks for your reply.

Exactly as you said I was not talking about CPU time. That's quite similar. The elapsed time is what bathers me.

I did raise the priority of that job using nice/renice/ionice but I couldn't make it any better. It had literally no impact on the job when I raised the priority. It was running with priority at 39, nice at 0, sched_other and I changed it to priority 39, nice -20 and sched_batch. However it didn't help at all.

Do you know how can I give the job an RT priority? Thanks...

unSpawn · 01-19-2013, 08:53 AM

You're running IBM Platform LSF so what does its documentation says about configuring (and troubleshooting) LSF scheduling? And since it's a commercially licensed product you may be entitled to support, that's what you pay IBM for, right?

koullislp · 01-19-2013, 11:05 AM

Hmmm.. Not really. The queuing system works fine... It's the Linux task scheduler that I believe is the problem here....

jpollard · 01-20-2013, 05:54 AM

Quote:

Originally Posted by koullislp

Hmmm.. Not really. The queuing system works fine... It's the Linux task scheduler that I believe is the problem here....

No. It depends on the queue job scheduler. ESPECIALLY if it is using control groups (cgroups)...

If a cgroup is active for the queue (which is what they were originally designed for) nice values only have significance within a cgroup. Not system wide.

Cgroups allow the system to be partitioned for scheduling purposes. And if the cgroup your batch job runs in is low, then anything else will take priority over it. The advantage cgroups give batch jobs is that entire jobs can be controlled relatively easily. It prevents processes within the cgroup from causing issues with processes outside that cgroup. In Fedora, each user is put into a cgroup on login - and activity that that user causes cannot affect other processes (mostly system daemons) and cause performance problems by pushing them out of memory, CPU, or I/O. It does introduce other problems (sluggish and slow copying of large files - which compete for memory and I/O) but the problems only affect the user job initiating the activity. Other users (and other jobs) are not affected (any thrashing is constrained to the job initiating the activity).

You can get an overview at http://www.kernel.org/doc/Documentat...ps/cgroups.txt

You can think of it as a "fair share scheduler" imposed on top of the nice scheduling, which now only works within a cgroup.

How cgroups are actually used by your job queue you would have find in the documentation.

koullislp · 01-20-2013, 06:55 AM

Hmmm... Interesting. In these case I will do my reading and I will let you know.

Thank you very much for your help.

Please let me know if you come up with something else...

koullislp · 01-22-2013, 04:11 AM

As it turned up, we are not using any cgroups, neither does the queuing system. The queuing system simply runs a script that we wrote in order to initiate the procedures and dispatch the job. I have checked that script and it looks fine to me...

jpollard · 01-22-2013, 07:59 AM

According to the documentation I can find, LFS uses a fairshare scheduling...

Code:

Dynamic User Priority
  LSF calculates a dynamic user priority for individual users or for a group, depending
  on how the shares are assigned. The priority is dynamic because it changes as soon
  as any variable in formula changes. By default, a user’s dynamic priority gradually
  decreases after a job starts, and the dynamic priority immediately increases when
  the job finishes.

How LSF calculates dynamic priority
  By default, LSF calculates the dynamic priority for each user based on:
  ◆ The number of shares assigned to the user
  ◆ The resources used by jobs belonging to the user:
    ❖ Number of job slots reserved and in use
    ❖ Run time of running jobs
    ❖ Cumulative actual CPU time (not normalized), adjusted so that recently
      used CPU time is weighted more heavily than CPU time used in the distant
      past

If you enable additional functionality, the formula can also involve additional
resources used by jobs belonging to the user:
  ◆ Historical run time of finished jobs
  ◆ Committed run time, specified at job submission with the -W option of bsub, or
    in the queue with the RUNLIMIT parameter in lsb.queues

How LSF measures fairshare resource usage
  LSF measures resource usage differently, depending on the type of fairshare:
...

With most fairshare schedulers, they will automatically adjust the process priority downward as it runs. And the more CPU time they take, the faster the priority goes down. My past experience with fairshare (not using LSF though), was that they recalculate priorities roughly every 5-10 seconds, which should be a configuration item.

So the LSF queuing could easily be getting your job slowed down.

Have you considered running it using the "batch" or "at" commands (ie cron, rather than LSF) mostly just to see if it runs differently?

koullislp · 01-23-2013, 02:56 AM

I suppose you are right...

How can you explain the difference in runtimes between server types using the same number of cores and memory?

Bad server architecture??

koullislp · 01-23-2013, 04:23 AM

Thanks for that. So here is the scenario:
LSF uses fairshare scheduler and no other non-system job runs at that time. Keeping this in mind when somebody submits a job to the queue, that job should have the highest priority and use as many resources as it is specified and not less. Right?

But when a job runs through LSF to a server, it is 7++ times slower than running it manually from command line(ssh). Therefore we derive that the problem is the LSF. Right?

jpollard · 01-23-2013, 05:14 AM

Quote:

Originally Posted by koullislp

Thanks for that. So here is the scenario:
LSF uses fairshare scheduler and no other non-system job runs at that time. Keeping this in mind when somebody submits a job to the queue, that job should have the highest priority and use as many resources as it is specified and not less. Right?

It depends on what is a "system job". If they are also running under LSF, then they are likely given a high priority queue, and thus get more time.

If there are no LSF jobs except the one in question, it should get the same or nearly the same CPU time as an interactive login. NOTE: Any interactive login may cause LSF to drop the priority. This is how it maintains the "interactivity" of logins not under its control.

The problem here is that it is expected that long running LSF jobs will be CPU bound. That means there is no "interaction" between people (that complain) and the computer. The difference between interaction and batch is that people expect a computer to respond with I/O quickly after hitting the enter key... with long waits before more computation. So short computation periods, with long delays. LSF will downgrade itself to maintain that level of interaction.

Quote:

But when a job runs through LSF to a server, it is 7++ times slower than running it manually from command line(ssh). Therefore we derive that the problem is the LSF. Right?

That is what it appears to be. 7 times slower is extreme though. It is possible, but should only be due to higher priority jobs, or some misconfiguration (checkpointing a large memory job could take a very long time). The expected slowdown is mostly due to the system recalculating the priority, and any queue management actions, and that should be under 5% - unless a lot of checkpointing occurs.

When the site I was working at ran queues they had 3 levels. High priority was a queue used for production jobs that needed to be finished ASAP. We were doing weather prediction runs - so they got the high priority. The middle priority was used for testing new prediction runs and upcoming changes to production runs. The low priority queue was provided for compiling large programs - so they always got bumped when any other job came in. They would proceed, but only on the leftover CPU time after everything else. There were other restrictions - the high and middle priority queues also tended to have large memory quota too (though not the same limits). The low priority queue also had smaller memory quotas as the compilers did not need lots of memory (compared to the production jobs that is).

User interactive logins was limited, to none. Operators only, and even then, just to make sure nothing happened to the system itself. Most operations were done via the queue manager as any interactive login caused the system to give that login additional time, and took away from the time (and memory) available for any batch job.

Our queuing system was much earlier (about 15 years ago now) and did not support distributed queues even though we had 3 servers. If your site has multiple servers (not indicated though) it is possible that LSF moved short running jobs around to where CPU time was available. Such jobs (if higher priority) would impact any low priority job.