LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (http://www.linuxquestions.org/questions/linux-software-2/)
-   -   Out of memory (oom) killer causes system crash? (http://www.linuxquestions.org/questions/linux-software-2/out-of-memory-oom-killer-causes-system-crash-623023/)

BusyBeeBop 02-22-2008 07:01 AM

Out of memory (oom) killer causes system crash?
 
My (virtual) Red Hat Enterprise Linux 4 (update 5) server _seems_ to
(preliminary analysis) have crashed due to oom killer having killed
processes such as "sshd, "udevd" and a few others. The last log to
have been killed, according to /var/log/messages, is "udevd".


May the termination of "udevd" be the reason for the server to crash?
If so: Why does Linux whack processes that causes the server to crash?

rayfordj 02-22-2008 07:08 AM

I'ts possible; I've seen oom-killer even nuke init (that's usually when things go really south ;)). Some-where in the list of oom-killer killed processes is most likely the culprit.


http://www.redhat.com/archives/taroo.../msg00006.html
and
http://linux-mm.org/OOM_Killer

"It is the job of the linux 'oom killer' to sacrifice one or more processes in order to free up memory for the system when all else fails. It will also kill any process sharing the same mm_struct as the selected process, for obvious reasons. Any particular process leader may be immunized against the oom killer if the value of it's /proc/<pid>/oomadj is set to the constant OOM_DISABLE (currently defined as -17)."



For further review: http://www.google.com/linux?hl=en&sa...er&btnG=Search

BusyBeeBop 02-22-2008 08:37 AM

Thanks for the reply.


Similar to mr. Sisler on the redhat-link we're running virtual RHEL (4) servers (on an esx server in our case). We have, however, other virtual RHEL 4 servers running the exact same kernel, Java-application, and so forth. One of these servers had the same problem with oom killer whacking processes, but never did the server crash. By setting lower_zone_protection to 250 (as mr. Sisler suggests) the whole oom killer problem went away.

But the RHEL 4 server in question does not seem to respond to this hack. _And_ it is the only server who has crashed due to oom killer.

I'm thinking that the problems we're having may have something to do with the interaction between esx and the virtual RHEL 4 server. I'm not quite sure, but it seems likely that there may be something there. Am I way off?

rayfordj 02-22-2008 06:19 PM

You could be on to something. I suppose it is possible depending on memory shares allocation that if the vmware-tools driver is ballooning memory so that it may be allocated to a guest with higher shares that needs more physical RAM that it could induce this. (Stealing from Peter to pay Paul -- or something like that). I've not personally seen this as a problem myself...

There are some things that you could consider implementing (and I'm sure others may have more or more robust implementations and/or recommendations than these) to help track down what may be going on.

the first is to make sure that the sysstat package is installed in RHEL4.
Code:

rpm -q sysstat
cat /etc/cron.d/sysstat

this will collect system activity (a snapshot in time every 10 minutes by default) and generate a text report nightly (sometime around 4am by default) that may be found under /var/log/sa/

also, you could configure top sort by mem usage and dump to a file every X minutes via cron.

start top, press M (this should sort by Memory), then W (this should pop a quick confirmation just beneath the memory output and above the process list that says it wrote ~/.toprc) then q (to quit)

then add something to cron.d to capture output every X minutes.

every 5 minutes for example:
Code:

*/5 * * * * root /usr/bin/top -d 1 -n 1 -b >> /root/top.out 2>/dev/null
then after the problem has happened you can refer back to this for a timeline of process activity sorted by memory usage to see what the big hitters were.

syg00 02-22-2008 08:17 PM

Killing user processes is unlikely to cause a full-on system crash - killing init is a bit drastic though. I would have thought the code was smarter than that. Haven't looked at it for a while though, and I certainly wasn't looking for that ...:p
More likely resource exhaustion - maybe low memory as suggested.
Go for a 64-bit kernel if you can - for everything.

Sysstat would certainly help - with RH you probably already have it. If you need to do the top trick instead, put it in a script, and add that meminfo display, and anything else you can find in google.
Bit of legwork and digging around required I'd reckon.

BusyBeeBop 02-26-2008 04:52 AM

Thanks for the tip on systat and top+cron. I'll try and implement these.

Unfortunately I'm stuck with 32-bit for now. :/

pierwelzien 06-02-2008 01:42 AM

Hello everyone. We are experiencing in my company the same kind of problems ;

- We are running several RH4 VM under a VMware ESX Server. Often, some RH4 VM have "Out of Memory" problems ...

This is really strange because a lot of logs are taken from all the RH4 VM and when we analyze the logs, the processes are far away from consuming all the available memory.

Also, the stats presented by the ESX server don't show that much consumption of memory ...

Does someone has any idea of what could be the problem ? Thanks in advance


All times are GMT -5. The time now is 01:58 PM.