LinuxQuestions.org - Is it OOM Killer

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - Is it OOM Killer - how to tell from sar? (https://www.linuxquestions.org/questions/linux-general-1/is-it-oom-killer-how-to-tell-from-sar-722249/)

Is it OOM Killer - how to tell from sar?

We are on 32bit Red HAt ES 4 (Nahant update 4):

What we saw that all of our front end box's jboss app server died. Time when they died were several hours apart. I looked in /var/log/messages but couldn't find anything. Then I looked in sar and it showed me that mem used was 99%. How can I tell:

1. If OOM killer killed the process. I don't see anything in /var/log/messages. Is there any other way of confirming that.
2. I am using sar -rR option. Is there any other option I can use to get more granular or precise details of the process?
3. How can I tell if it's a bug in Red Hat or something in Red Hat caused our app to die?
4. How do I interpret free -lm output to tell that we have a problem?

Well no one has taken a stab at this yet, so maybe I'll give it a try.

Memory management is a pretty deep topic. Too involved to describe completely in a web post. And we don't have the complete output from your free command to comment on, but...

Quote:

I looked in sar and it showed me that mem used was 99%.

This is normal on an active system once it has been up for a while (especially after your backup product has run).

What were the kbcached and kbswpused values around the time the problem occurred?

If low and high respectively, then that is not good. Does any process have very high 'res' memory usage that keeps increasing over time? Or virtual memory, for that matter, that keeps creeping?

If high and low respectively, then that's ok.

Quote:

1. If OOM killer killed the process. I don't see anything in /var/log/messages. Is there any other way of confirming that.

If you had an OOM condition believe me you'd see it in /var/log/messages.

OOM occurs when you've run out of free memory and swap space, and the cached area cannot be pruned any more.
This is a serious situation and shouldn't happen on a normal system. When it does, the OOM killer, as a last resort, kills what it thinks are expendable processes to free up memory.

Quote:

2. I am using sar -rR option. Is there any other option I can use to get more granular or precise details of the process?

'sar's interval can be modified from it's default of 10 minutes, but using 'vmstat' with a delay and count make more sense.

You can also watch what is going on with 'top'. With interactive 'top' you can use 'f' to change to columns displayed, then 'F' to sort on a desired column.

There are other options, too (like writing some code to scrape memory stats out of the /proc filesystem).

Quote:

4. How do I interpret free -lm output to tell that we have a problem?

'total' = 'used' + 'free' is pretty obvious. (And this is, of course real memory being reported.)

What isn't obvious is that much or most of 'cached' can often be "trimmed back" or "pruned", and can be thought of as 'free'.

So, "truly free" = 'free' + "most of 'cached'"
and "truly used" = 'total' - 'free' - "most of 'cached'"

When your 'cached' value is very low and your 'Swap free' is nearly exhausted, your system is in trouble.

Quote:

3. How can I tell if it's a bug in Red Hat or something in Red Hat caused our app to die?

Sorry, that's the tough one. Does your app have a memory leak or some other bad behaviour. If you can gather info and categorize what is happening, maybe someone with jboss experience can comment.

Or open a ticket with Redhat and ask for guidance.

Maybe someone else can comments, too. Or correct anything I've said. Good luck.

Since we don't have offending process running is it possible to find out from sar how that process was consuming resources. I am running sar -rR. How do I interpret the outptut?

The short answer is no. The 'sar -rR' isn't going to help you.

First of all, it probably defaults to 10 minute granularity and with 10 minutes between each set of metrics, 'sar' is probably worthless.

The second problem is that it is system wide, not process specific. A 'man sar' describes what each field contains.

How did the other metrics change leading up to the incident? You would want to look a it for trends,
if memory is indeed your problem. As I mentioned earlier, the "mem used was 99%" that you stated in
your original post is immaterial.

Running low on memory makes your systems run slow, not cause software to fail until you run out of
memory, and you didn't find any OOM messages in syslog.

I have to agree with tommylovell - if OOM-killer had been at work, you'd be able to find evidence of such in the logs.
Taskstats has been available for a while - sysstat exposes these via pidstats. You should be able to check that depending on kernel level.
Hmmm - maybe not; just noticed this

Quote:

Red HAt ES 4

syg00, I was unaware of 'pidstat'. A nice addition to 'sysstat'. I'll have to play with it at home (Fedora 9, sysstat 8.0.4).

'sysstat' doc says it was made available in 7.1.4 (development version) and 8.0.0 (stable version).

So I won't see it in work for a while. And mohitanchlia does not have it either.

Release / kernel / sysstat
RHEL4.6 / 2.6.9 / 5.0.5
RHEL5.3 / 2.6.18 / 7.0.2

Thanks for the tip.

I am little confused. Does it mean that our version of Linux will not print that message from OOM?

Another question, Is there a way other than sar to see how process was behaving in terms of memory at that time?

Quote:

Does it mean that our version of Linux will not print that message from OOM?

No. It will print OOM messages.

If you had an OOM condition, your syslog (/var/log/messages) would be filled with OOM messages.
I have a Redhat Enterprise Linux 4.6 systems that had an OOM situation and there were a LOT of messages.

The other discussion was whether you had 'pidstat'. No, you don't have it. It's available in Fedora
but not for Redhat yet.

Quote:

Another question, Is there a way other than sar to see how process was behaving in terms of memory at that time?

No.

To get meaningful detailed granular history, you'd need to install a product like Teamquest, Tivoli or possibly Sarcheck, to collect and archive performance metrics to a database. Those products are expensive.

mohitanchlia, you seem certain that it was memory. Did your sar report show that you ran out of memory?

How do I check in sar if I ran out of memory. I looked at the Virtual memory and it seemed to have gone above 4GB.

'sar -rR' will tell you about ''real memory''. If kbcached was stable for a period of time, then dropped to a much lower value, and at the same time you saw a rise in kbswpused, and %swpused started to approach 100%, that would indicate you were running out of real memory. Did that happen?

Also, you indicated that you couldn't find oom messages in /var/log/syslog. 'cd /var/log', then 'grep oom-killer mess*' and 'grep "Out of Memory" mess*' to confirm. If that's true, no oom messages, you are NOT out of memory.

Total virtual memory can safely exceed the amount of real memory that you have on your system. It depends. It is too involved to explain here. But that is generally not a problem. If it was a problem, it would put pressure on real memory and swap, and you'd see it manifested there.

An individual process can run out of addressable virtual memory. Thus on your 32-bit OS, each process can only address 4GB of memory. I think it would be up to your app to put out "malloc failed" types of messages if that were the case. You won't find that in sar either.

So the information that I see in /proc/meminfo that says hightotal and lowtotal is of around 4GB is that applicable to all the processes that OS is handling? I read about meminfo but couldn't really understand how to read the output.

Quote:

So the information that I see in /proc/meminfo that says hightotal and lowtotal is of around 4GB is that applicable to all the processes that OS is handling?

That's real memory, as is MemTotal. The kernel, loaded modules, kernel stack space, i/o buffers, cache, every processes resident set, you name it, it's there.

Quote:

I read about meminfo but couldn't really understand how to read the output.

You're not alone. You can dig some of the more esoteric information out of one of the kernel books. "Understanding the Linux Kernel, 3rd Ed." and "Linux Kernel Development, 2nd Ed." are both good. But some things you can only find out by looking at the kernel code.

But you didn't answer my question about whether you had any of the OOM symptoms. Were you out of memory?

What we saw was that we went beyond 4GB virtual memory and I think that's why app just died because of 32bit OS limit.