We run a set of servers, each with 32GB of RAM and Slackware 13.37 with the standard 18.104.22.168 kernel (recompiled with 64GB memory model, using the 32-bit PAE extensions to access all 32GB).
Our application does not use all that much memory (max of about 2.4GB per process), so a lot of the memory gets allocated to the disk cache over time (which is fine).
However, when the size of the disk cache hits approx. 25GB (as reported by 'free'), the kernel thinks that there is no more memory left on the machine and the OOM killer starts killing off other normal processes (eg. sshd, httpd, etc).
Dec 24 19:59:54 server kernel: [621984.510826] Out of memory: Kill process 30237 (java) score 4 or sacrifice child
Dec 24 19:59:54 server kernel: [621984.510905] Killed process 30237 (java) total-vm:2460272kB, anon-rss:153936kB, file-rss:9988kB
At the time, 'free' reported available RAM as:
total used free shared buffers cached
Mem: 32788272 25309772 7478500 0 2928 24677912
-/+ buffers/cache: 628932 32159340
Swap: 995992 0 995992
When we clear the disk cache using:
echo 1 > /proc/sys/vm/drop_caches
The problem goes away until we hit approx. 25GB in disk cache again. As a short-term workaround we've created a crontab entry which periodically clears the disk caches using the above command and as long-term we added mem=16G
to the kernel command line to effectively limit the size of the disk caches.
As anybody come across this issue?
Our test servers do not have this amount of RAM on them, but they will shortly and we will be testing the latest 3.7.x kernel branch to see if we get the same problem.
Note that this is a 32-bit kernel (out of necessity).