Understanding the out-of-memory state and the oom-killer
I have recently been seeing servers run out of memory (physical and swap). The server then becomes completely unresponsive. What I would expect to see in this circumstance is that the out of memory killer kicks in and kills a process, but this does not seem to be happening every time and even when it does, it can take a few minutes from the server completely running out of memory and going unresponsive to the oom-killer terminating any processes - by that point various timout errors have started and the server should be effectively rebooted. Is there any threshold that can be tuned to allow the out of memory killer to kick in sooner, or alternatively is there a way to debug this to find out why the out of memory killer is not able to work effectively?
Consider my servers as a farm of machines for scientists to test code on. Memory leaks are not uncommon, but I need the server to terminate the offending process and continue to work. I would prefer to use the out of memory infrastrucure if is suitabe for this purpose rather than write a script. I have tried invoking the oom-killer manually with "echo f > /proc/sysrq-trigger" and it (a) works and (b) makes a sensible choice of which process to kill.
The machines have 16G RAM 16G swap,
2.6.18-308.20.1.el5 x86_64, 2.6.18-308.24.1.el5 x86_64
lowmem_reserve_ratio 256 256 32
swap_token_timeout 300 0
I don't have any systems anywhere near that old, but I'd be looking at that min_free_kbytes. For comparison, this 8 Gig F16 laptop has 67584.
Have a read of this (it's on wayback, so could take a while to load the page).
If your systems get really low, there may not be any room to allocate necessary for the oom-killer itself without it fighting with kswapd. Might explain your observations.
Dicking with the vm tunables is very much a black art - pick a system that is expendable to play on.
> I don't have any systems anywhere near that old, but I'd be looking at that min_free_kbytes.
Thank you for your suggestion. I changed min_free_kbytes to values up to 1G. Unfortunatley all that apears to achieve is that they system holds back 1G of ram which - as far as I can determine - seems to be unusable to any process. I have been able to reproduce a full system hang, using a process with an artificial memory leak, even with values up to 1G.
|All times are GMT -5. The time now is 06:03 AM.|