Hi all,
I am an engineer of an application that solves a computationally expensive problem, and can do so in a highly parallel fashion. A user can basically specify that it should run P processes to solve the problem, and off it goes. Each process is likely to essentially tie up one core and use several hundred megabytes of physical memory. If too many processes are started, the system can thrash and performance suffers.
The trouble is, we really want to automatically determine how many processes to run instead of making users guess P. We should react to dynamically changing conditions on the machine, creating and destroying processes as needed. The resource needs of the application should grow to the capacity of the machine, without negatively impinging on other applications running on the system.
Furthermore, we would like to have a reasonable degree of fairness. If a user runs multiple instances of our application, or other applications with similarly elastic resource requirements, resources should be split roughly evenly (assuming all applications are striving to be socially-minded in this fashion). For example, the first application to spin up should not get all the resources.
We have solved this problem in a simplistic manner by inspecting available memory and the number of logical cores on the system. However, I think there is a lot of room for improvement, and I'm wondering how well we can really do without OS assistance. How much of the file system cache should we assume is "up for grabs?" It seems to me that the OS is in the best position to judge whether there is too much memory pressure for another process. I suppose we could reactively exit if our application starts incurring too many page faults, but that seems awfully reactive and I can imagine loopholes.
There is a more detailed write-up of what we ended up doing here:
http://www.grammatech.com/blog/pleas...physical-pages
Since writing that post I have discovered cgroups:
http://en.wikipedia.org/wiki/Cgroups
They don't really fit the bill, if I understand the article correctly.
Mac OSX has a feature that helps a little:
http://developer.apple.com/library/i...P40009251-SW13
Basically, a process can declare itself expendable, meaning the OS can kill it at will. Not sure how well this works in practice.
Bear in mind that we have little control over the computers the application gets deployed or run on, so one must assume a healthy ecosystem of competing resource demands outside our control.
With the prevalence of cloud computing and highly parallel applications, other developers must have run into this sort of issue. What has and has not worked? How can we improve on the status quo? How would you go about solving this issue? Discuss.