Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have often been frustrated by that same problem on Windows. I've never found any kind of work around on Windows. On Windows, it is a flaw in the OS's memory management algorithms and there doesn't seem to be anything a user can do about it.
I haven't seen much of this behavior in Linux and I never really investigated it in Linux. I expect Linux gives the expert user more control over such things, but I don't know the details of what you ought to adjust.
From your description, I can't quite tell whether you fully understand the problem behavior, so I'll try to fill in some details:
The program is actively using more memory than the OS is letting the program keep resident. So the program is constantly soft faulting pages from the cache into its resident set, while the OS is bumping other pages out of that resident set into the cache. Then those other pages will be soft faulted soon after.
So if the memory management algorithms were tuned better, the process would have a larger resident set and there would be fewer pages pushed back into cache.
whenever there is a performance problem of any kind, point in time snapshots of the system state are usually not particularly helpful. what was the state when the job started> did it immediately change or was it gradual. Was the state constant or did it change over time?
These are all critical questions to better understanding the overall behavior.
If you really want to take a different approach, and it's really pretty easy, install collectl and turn it on. It will sample almost everything the system is doing and write the results to a log in /var/log/collectl, taking samples every 10 second. Not to worry, the load is ~0.1%.
Now, wait a day or at least until you're application has run for awhile or better yet finishes - and does the state return to normal when it does?
At this point if you install collectl-utils on a system that has a webserver runnign on it you'll find a nifty tool called colplot. Browse to http://hostname/colplot and you should see colplot start up. You point it to a directory containing collectl plot files and tell it to plot everything. You'll see 24 hour plots of virtually everything your system is doing and hopefully the answer lies in the data.
btw - this same technique will work with ANY data but you need:
- the data visible as plots
- sufficient types of data: cpu, network, disk, memory at minimum but more types are better
- samples taken at a reasonable frequency and 10 seconds seems to work very well
collectl and collectd are 2 totally different tools. I never heard of collectd when I wrote collectl and don't even know which came first. collectl is based on Ron Urban's collect tool, which ran on DEC's Tru64 Unix. When linux was becoming more visible in High Performance Computing, I ported the collect functionality to collectl, hence the meaning of the name - collect for linux.
While I can't tell you the difference, the focus of collectl has always been
- support the broadest set of performance counters around (and I think it really does)
- run with a relatively frequent monitoring rate of sampling every 10 seconds, though process sampling runs at once a minute since it is heavier-weight
- be light weight enough so that people will just turn it on an leave it running, and it does tend to use ~.1% of a cpu
As an aside, it's not unusual to find collectl on some of the largest and fastest clusters in the world. If you look at the list of the top 500 clusters, collectl at least runs on most if not all HP's.
I am wondering if the user that is running the application is hitting a limit situation. Might want to check into how system resources are allowed to be used by checking ulimit -a as the user that is running the application.
I am wondering if the user that is running the application is hitting a limit situation. Might want to check into how system resources are allowed to be used by checking ulimit -a as the user that is running the application.
The only relevant limit would be resident set size. I'm pretty sure in Linux ulimit tracks that limit but it doesn't connect to any support in the OS.
It appears something is limiting the resident set size, but I'm pretty sure that something isn't the value managed by ulimit.
Thank you all for your help. Interesting to know about collectl, sure it's quite useful, I'll give it a try.
With respect to the initial post, the behaviour I was reporting happened at 3 servers at the same time, although they had already run similar processes from the queue. After stopping them and send again, they run flawlessly, so it must have been a transient situation that halted them in that strange state.
I'd bet something related with the shared file system or the network, but what I don't know yet is why only affected to those three and not to the rest. I'll have to keep an eye on them to see if it happens again.
aha! you raise an interesting question. did this happen at exactly the same time or approximately the same time? this is why it's so important to run a tool like collectl. it actually synchronizes its sampling down to the msec level so if you're running ntp on all your machines, every 10 seconds all counters will be sampled within a couple of msec of each other. Then when the problem reoccurs, which it probably will, you will be able to compare behaviors. perhaps it happens on one machine first and 'spreads' to others OR maybe it happens simultaneously. could it be a shared resource like a network? hard to say without detailed data (and I do mean detailed).
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.