Well no one has taken a stab at this yet, so maybe I'll give it a try.
Memory management is a pretty deep topic. Too involved to describe completely in a web post. And we don't have the complete output from your free command to comment on, but...
Quote:
|
I looked in sar and it showed me that mem used was 99%.
|
This is normal on an active system once it has been up for a while (especially after your backup product has run).
What were the kbcached and kbswpused values around the time the problem occurred?
If low and high respectively, then that is not good. Does any process have very high 'res' memory usage that keeps increasing over time? Or virtual memory, for that matter, that keeps creeping?
If high and low respectively, then that's ok.
Quote:
|
1. If OOM killer killed the process. I don't see anything in /var/log/messages. Is there any other way of confirming that.
|
If you had an OOM condition believe me you'd see it in /var/log/messages.
OOM occurs when you've run out of free memory and swap space, and the cached area cannot be pruned any more.
This is a serious situation and shouldn't happen on a normal system. When it does, the OOM killer, as a last resort, kills what it thinks are expendable processes to free up memory.
Quote:
|
2. I am using sar -rR option. Is there any other option I can use to get more granular or precise details of the process?
|
'sar's interval can be modified from it's default of 10 minutes, but using 'vmstat' with a delay and count make more sense.
You can also watch what is going on with 'top'. With interactive 'top' you can use 'f' to change to columns displayed, then 'F' to sort on a desired column.
There are other options, too (like writing some code to scrape memory stats out of the /proc filesystem).
Quote:
|
4. How do I interpret free -lm output to tell that we have a problem?
|
'total' = 'used' + 'free' is pretty obvious. (And this is, of course real memory being reported.)
What isn't obvious is that much or most of 'cached' can often be "trimmed back" or "pruned", and can be thought of as 'free'.
So, "truly free" = 'free' + "most of 'cached'"
and "truly used" = 'total' - 'free' - "most of 'cached'"
When your 'cached' value is very low and your 'Swap free' is nearly exhausted, your system is in trouble.
Quote:
|
3. How can I tell if it's a bug in Red Hat or something in Red Hat caused our app to die?
|
Sorry, that's the tough one. Does your app have a memory leak or some other bad behaviour. If you can gather info and categorize what is happening, maybe someone with jboss experience can comment.
Or open a ticket with Redhat and ask for guidance.
Maybe someone else can comments, too. Or correct anything I've said. Good luck.