LinuxQuestions.org - Understanding the out-of-memory state and the oom-killer

Hi,

I have recently been seeing servers run out of memory (physical and swap). The server then becomes completely unresponsive. What I would expect to see in this circumstance is that the out of memory killer kicks in and kills a process, but this does not seem to be happening every time and even when it does, it can take a few minutes from the server completely running out of memory and going unresponsive to the oom-killer terminating any processes - by that point various timout errors have started and the server should be effectively rebooted. Is there any threshold that can be tuned to allow the out of memory killer to kick in sooner, or alternatively is there a way to debug this to find out why the out of memory killer is not able to work effectively?

Consider my servers as a farm of machines for scientists to test code on. Memory leaks are not uncommon, but I need the server to terminate the offending process and continue to work. I would prefer to use the out of memory infrastrucure if is suitabe for this purpose rather than write a script. I have tried invoking the oom-killer manually with "echo f > /proc/sysrq-trigger" and it (a) works and (b) makes a sensible choice of which process to kill.

The machines have 16G RAM 16G swap,
2.6.18-308.20.1.el5 x86_64, 2.6.18-308.24.1.el5 x86_64

/proc/sys/vm:

block_dump 0
dirty_background_bytes 0
dirty_background_ratio 10
dirty_bytes 0
dirty_expire_centisecs 3000
dirty_ratio 40
dirty_writeback_centisecs 500
drop_caches 0
flush_mmap_pages 1
hugetlb_shm_group 0
laptop_mode 0
legacy_va_layout 0
lowmem_reserve_ratio 256 256 32
max_map_count 65536
max_reclaims_in_progress 0
max_writeback_pages 1024
min_free_kbytes 32527
min_slab_ratio 5
min_unmapped_ratio 1
mmap_min_addr 4096
nr_hugepages 0
nr_pdflush_threads 2
overcommit_memory 0
overcommit_ratio 50
pagecache 100
page-cluster 3
panic_on_oom 0
percpu_pagelist_fraction 0
swappiness 60
swap_token_timeout 300 0
topdown_allocate_fast 0
vfs_cache_pressure 100
vm_devzero_optimized 1
zone_reclaim_interval 30
zone_reclaim_mode 1