Linux Process Reaped

EpicOfChaos · 09-30-2012, 12:46 PM

Hello,
I have a process that will be running just fine for weeks at a time but after about two weeks of uptime the process while just be gone. I suspect it is exhausting all the memory on the machine, and since there is no swap setup it just kills the process. A long time ago I found a log that showed processes that get reaped because of memory consumption, however, I can not for the life of me find this log. Does anyone know where such a log would exist?

My Linux version : Linux 2.6.35.14-97.44.amzn1.i686 i686
(Amazon AMI)

I have checked /var/log

nugat · 10-01-2012, 12:51 AM

Typically, if a program has a memory leak, or is otherwise consuming all the system memory, it will just use it all up until the kernel seizes. If the program itself has memory management built in, then perhaps it is reaping itself, in which case perhaps it has its own logging facility. Or perhaps there is another generic process on the system that manages memory?

The syslog, typically /var/log/messages, is the first place I would look, but it sounds like you've already checked there.

hydraMax · 10-01-2012, 01:33 AM

Quote:

Originally Posted by nugat

Typically, if a program has a memory leak, or is otherwise consuming all the system memory, it will just use it all up until the kernel seizes. If the program itself has memory management built in, then perhaps it is reaping itself, in which case perhaps it has its own logging facility. Or perhaps there is another generic process on the system that manages memory?

I do not think this is correct. During memory shortages, processes are killed by the Linux OOM (Out Of Memory) Killer as much as is necessary for the kernel to continue operating.

http://linux-mm.org/OOM_Killer
http://linux-mm.org/OOM

I'm not an expert on this, but I believe that it is all done by the kernel. There doesn't seem to be a separate log for it... I think OOM information just gets dumped to the regular kernel log which can be read by dmesg, but you should be able to filter out these messages to a separate file with syslogd if you wish.

According to the above references, you can mark a process so that it can not be killed by the OOM Killer.

nugat · 10-01-2012, 02:28 AM

Quote:

Originally Posted by hydraMax

I do not think this is correct. During memory shortages, processes are killed by the Linux OOM (Out Of Memory) Killer as much as is necessary for the kernel to continue operating.

http://linux-mm.org/OOM_Killer
http://linux-mm.org/OOM

I'm not an expert on this, but I believe that it is all done by the kernel. There doesn't seem to be a separate log for it... I think OOM information just gets dumped to the regular kernel log which can be read by dmesg, but you should be able to filter out these messages to a separate file with syslogd if you wish.

According to the above references, you can mark a process so that it can not be killed by the OOM Killer.

OOM stuff should appear in dmesg|/var/log/messages, though, shouldn't it? something like

"Out of Memory: Killed process 18254 (ntop)."

that's why i suspected something else killing the process. the kernel is usually pretty clear when it is doing such things.

EpicOfChaos · 10-01-2012, 09:09 AM

So it looks like you are right, the oom killer is what would log such and action. However, following the ideas on this link http://stackoverflow.com/questions/6...nux-oom-killer I am not showing my process was reaped. So most likely it is not consuming to much memory. Let me give you some more information, so the process that is running is a Glassfish 3.1.2.2 server instance, the logging that glassfish provides is not helping, one second it is serving requests like normal, than the next the entire process is gone. I am not sure where else to look for a solution.

unSpawn · 10-01-2012, 10:10 AM

Maybe it would be better to run a small SAR (Atop, Dstat, Collectl, whatever else you fancy) and actually collect system statistics first? That, together with reviewing any bug tickets wrt Java and Glassfish and reviewing your Glassfish server settings might provide a more efficient approach because IMHO here looking for log entries is reactive, an after-the-fact op and that itself won't change or improve anything.

nugat · 10-01-2012, 10:44 PM

Quote:

Originally Posted by EpicOfChaos

So it looks like you are right, the oom killer is what would log such and action. However, following the ideas on this link http://stackoverflow.com/questions/6...nux-oom-killer I am not showing my process was reaped. So most likely it is not consuming to much memory. Let me give you some more information, so the process that is running is a Glassfish 3.1.2.2 server instance, the logging that glassfish provides is not helping, one second it is serving requests like normal, than the next the entire process is gone. I am not sure where else to look for a solution.

Are you already starting it in verbose mode? e.g.:

Code:

asadmin start-domain --verbose

Documentation for that is here:

http://docs.oracle.com/javaee/6/tutorial/doc/bnadi.html

Apparently, there is also full-on debug mode available:

http://docs.oracle.com/cd/E19798-01/...afd/index.html

markseger · 10-07-2012, 06:33 AM

Haven't seen much activity here on this over the last few days. Any progress?
One thing about OOM I've discovered with collectl. Normally, collectl runs every monitoring interval at exactly the same time - no drift at all, well maybe an occasional msec, but it's very accurate and never misses a sample. Whenever OOM runs, at just about the highest priority as it can, a side effect is collectl stalls and missing sampling intervals. In fact on some systems I've seen collectl stall for over a couple of minutes! When this happens, you can often find a kernel daemon running at 100% in the process log either just before collectl stalls OR when I comes back, I forget which since I don't see it very often.
In other words, if you see long stalls in collectl logs, there's a good chance OOM was running and if there are no stalls it probably wasn't.
-mark