LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices



Reply
 
Search this Thread
Old 04-28-2009, 10:12 AM   #1
mohitanchlia
Member
 
Registered: Aug 2008
Posts: 60

Rep: Reputation: 15
Is it OOM Killer - how to tell from sar?


We are on 32bit Red HAt ES 4 (Nahant update 4):

What we saw that all of our front end box's jboss app server died. Time when they died were several hours apart. I looked in /var/log/messages but couldn't find anything. Then I looked in sar and it showed me that mem used was 99%. How can I tell:

1. If OOM killer killed the process. I don't see anything in /var/log/messages. Is there any other way of confirming that.
2. I am using sar -rR option. Is there any other option I can use to get more granular or precise details of the process?
3. How can I tell if it's a bug in Red Hat or something in Red Hat caused our app to die?
4. How do I interpret free -lm output to tell that we have a problem?
 
Old 04-28-2009, 08:25 PM   #2
tommylovell
Member
 
Registered: Nov 2005
Distribution: Fedora, Redhat
Posts: 372

Rep: Reputation: 101Reputation: 101
Well no one has taken a stab at this yet, so maybe I'll give it a try.

Memory management is a pretty deep topic. Too involved to describe completely in a web post. And we don't have the complete output from your free command to comment on, but...

Quote:
I looked in sar and it showed me that mem used was 99%.
This is normal on an active system once it has been up for a while (especially after your backup product has run).

What were the kbcached and kbswpused values around the time the problem occurred?

If low and high respectively, then that is not good. Does any process have very high 'res' memory usage that keeps increasing over time? Or virtual memory, for that matter, that keeps creeping?

If high and low respectively, then that's ok.

Quote:
1. If OOM killer killed the process. I don't see anything in /var/log/messages. Is there any other way of confirming that.
If you had an OOM condition believe me you'd see it in /var/log/messages.

OOM occurs when you've run out of free memory and swap space, and the cached area cannot be pruned any more.
This is a serious situation and shouldn't happen on a normal system. When it does, the OOM killer, as a last resort, kills what it thinks are expendable processes to free up memory.

Quote:
2. I am using sar -rR option. Is there any other option I can use to get more granular or precise details of the process?
'sar's interval can be modified from it's default of 10 minutes, but using 'vmstat' with a delay and count make more sense.

You can also watch what is going on with 'top'. With interactive 'top' you can use 'f' to change to columns displayed, then 'F' to sort on a desired column.

There are other options, too (like writing some code to scrape memory stats out of the /proc filesystem).

Quote:
4. How do I interpret free -lm output to tell that we have a problem?
'total' = 'used' + 'free' is pretty obvious. (And this is, of course real memory being reported.)

What isn't obvious is that much or most of 'cached' can often be "trimmed back" or "pruned", and can be thought of as 'free'.

So, "truly free" = 'free' + "most of 'cached'"
and "truly used" = 'total' - 'free' - "most of 'cached'"

When your 'cached' value is very low and your 'Swap free' is nearly exhausted, your system is in trouble.

Quote:
3. How can I tell if it's a bug in Red Hat or something in Red Hat caused our app to die?
Sorry, that's the tough one. Does your app have a memory leak or some other bad behaviour. If you can gather info and categorize what is happening, maybe someone with jboss experience can comment.

Or open a ticket with Redhat and ask for guidance.

Maybe someone else can comments, too. Or correct anything I've said. Good luck.
 
Old 04-28-2009, 08:38 PM   #3
mohitanchlia
Member
 
Registered: Aug 2008
Posts: 60

Original Poster
Rep: Reputation: 15
Since we don't have offending process running is it possible to find out from sar how that process was consuming resources. I am running sar -rR. How do I interpret the outptut?
 
Old 04-28-2009, 11:55 PM   #4
tommylovell
Member
 
Registered: Nov 2005
Distribution: Fedora, Redhat
Posts: 372

Rep: Reputation: 101Reputation: 101
The short answer is no. The 'sar -rR' isn't going to help you.

First of all, it probably defaults to 10 minute granularity and with 10 minutes between each set of metrics, 'sar' is probably worthless.

The second problem is that it is system wide, not process specific. A 'man sar' describes what each field contains.

How did the other metrics change leading up to the incident? You would want to look a it for trends,
if memory is indeed your problem. As I mentioned earlier, the "mem used was 99%" that you stated in
your original post is immaterial.

Running low on memory makes your systems run slow, not cause software to fail until you run out of
memory, and you didn't find any OOM messages in syslog.

Last edited by tommylovell; 04-29-2009 at 01:06 PM. Reason: typo
 
Old 04-29-2009, 12:27 AM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 12,501

Rep: Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077
I have to agree with tommylovell - if OOM-killer had been at work, you'd be able to find evidence of such in the logs.
Taskstats has been available for a while - sysstat exposes these via pidstats. You should be able to check that depending on kernel level.
Hmmm - maybe not; just noticed this
Quote:
Red HAt ES 4
 
Old 04-29-2009, 08:42 AM   #6
tommylovell
Member
 
Registered: Nov 2005
Distribution: Fedora, Redhat
Posts: 372

Rep: Reputation: 101Reputation: 101
syg00, I was unaware of 'pidstat'. A nice addition to 'sysstat'. I'll have to play with it at home (Fedora 9, sysstat 8.0.4).

'sysstat' doc says it was made available in 7.1.4 (development version) and 8.0.0 (stable version).

So I won't see it in work for a while. And mohitanchlia does not have it either.

Release / kernel / sysstat
RHEL4.6 / 2.6.9 / 5.0.5
RHEL5.3 / 2.6.18 / 7.0.2

Thanks for the tip.
 
Old 04-29-2009, 11:41 AM   #7
mohitanchlia
Member
 
Registered: Aug 2008
Posts: 60

Original Poster
Rep: Reputation: 15
I am little confused. Does it mean that our version of Linux will not print that message from OOM?

Another question, Is there a way other than sar to see how process was behaving in terms of memory at that time?
 
Old 04-29-2009, 01:05 PM   #8
tommylovell
Member
 
Registered: Nov 2005
Distribution: Fedora, Redhat
Posts: 372

Rep: Reputation: 101Reputation: 101
Quote:
Does it mean that our version of Linux will not print that message from OOM?
No. It will print OOM messages.

If you had an OOM condition, your syslog (/var/log/messages) would be filled with OOM messages.
I have a Redhat Enterprise Linux 4.6 systems that had an OOM situation and there were a LOT of messages.

The other discussion was whether you had 'pidstat'. No, you don't have it. It's available in Fedora
but not for Redhat yet.

Quote:
Another question, Is there a way other than sar to see how process was behaving in terms of memory at that time?
No.

To get meaningful detailed granular history, you'd need to install a product like Teamquest, Tivoli or possibly Sarcheck, to collect and archive performance metrics to a database. Those products are expensive.

mohitanchlia, you seem certain that it was memory. Did your sar report show that you ran out of memory?
 
Old 04-29-2009, 01:35 PM   #9
mohitanchlia
Member
 
Registered: Aug 2008
Posts: 60

Original Poster
Rep: Reputation: 15
How do I check in sar if I ran out of memory. I looked at the Virtual memory and it seemed to have gone above 4GB.
 
Old 04-29-2009, 03:17 PM   #10
tommylovell
Member
 
Registered: Nov 2005
Distribution: Fedora, Redhat
Posts: 372

Rep: Reputation: 101Reputation: 101
'sar -rR' will tell you about ''real memory''. If kbcached was stable for a period of time, then dropped to a much lower value, and at the same time you saw a rise in kbswpused, and %swpused started to approach 100%, that would indicate you were running out of real memory. Did that happen?

Also, you indicated that you couldn't find oom messages in /var/log/syslog. 'cd /var/log', then 'grep oom-killer mess*' and 'grep "Out of Memory" mess*' to confirm. If that's true, no oom messages, you are NOT out of memory.


Total virtual memory can safely exceed the amount of real memory that you have on your system. It depends. It is too involved to explain here. But that is generally not a problem. If it was a problem, it would put pressure on real memory and swap, and you'd see it manifested there.

An individual process can run out of addressable virtual memory. Thus on your 32-bit OS, each process can only address 4GB of memory. I think it would be up to your app to put out "malloc failed" types of messages if that were the case. You won't find that in sar either.
 
Old 04-29-2009, 04:32 PM   #11
mohitanchlia
Member
 
Registered: Aug 2008
Posts: 60

Original Poster
Rep: Reputation: 15
So the information that I see in /proc/meminfo that says hightotal and lowtotal is of around 4GB is that applicable to all the processes that OS is handling? I read about meminfo but couldn't really understand how to read the output.
 
Old 04-29-2009, 06:30 PM   #12
tommylovell
Member
 
Registered: Nov 2005
Distribution: Fedora, Redhat
Posts: 372

Rep: Reputation: 101Reputation: 101
Quote:
So the information that I see in /proc/meminfo that says hightotal and lowtotal is of around 4GB is that applicable to all the processes that OS is handling?
That's real memory, as is MemTotal. The kernel, loaded modules, kernel stack space, i/o buffers, cache, every processes resident set, you name it, it's there.

Quote:
I read about meminfo but couldn't really understand how to read the output.
You're not alone. You can dig some of the more esoteric information out of one of the kernel books. "Understanding the Linux Kernel, 3rd Ed." and "Linux Kernel Development, 2nd Ed." are both good. But some things you can only find out by looking at the kernel code.

But you didn't answer my question about whether you had any of the OOM symptoms. Were you out of memory?

Last edited by tommylovell; 04-29-2009 at 06:31 PM. Reason: grammer
 
Old 04-29-2009, 08:12 PM   #13
mohitanchlia
Member
 
Registered: Aug 2008
Posts: 60

Original Poster
Rep: Reputation: 15
What we saw was that we went beyond 4GB virtual memory and I think that's why app just died because of 32bit OS limit.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help me understand why oom-killer kicks in Ralfredo Linux - Kernel 20 04-30-2009 06:53 PM
oom-killer on RHEL5.2 jaiarunk_s Linux - Server 3 12-12-2008 08:54 PM
Out of memory (oom) killer causes system crash? BusyBeeBop Linux - Software 6 06-02-2008 02:42 AM
OOM-Killer woes Slim Backwater Slackware 2 07-25-2006 04:00 AM
oom-killer is killing my programs on FC3 abarclay Fedora 1 03-08-2005 10:14 AM


All times are GMT -5. The time now is 06:13 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration