How To interpret kernel stack trace
I have been unable to find any kind of tutorial or clue as to determine why a crash occured in the kernel. I am running a Red Hat EE 3.0 kernel
and received the following crash, which appears to be in kswapd: Code:
Aug 19 17:30:57 host1 login(pam_unix)[9816]: session closed for user someuser The load was pretty high at: Code:
kbmemfree kbmemused %memused kbmemshrd kbbuffers kbcached kbswpfree kbswpused %swpused Any clue or any pointers to howto info would be great help. Thanks! |
I'm no kernel guru, but it does appear that your system was attempting to free some swap space to allocate it to NFS. Perhaps the pointer being referred to was supposed to point to the next free block of memory, or something like that. In any case, null pointer dereferences are quite bad and IMHO that shows a bug in the kernel.
You'll see a similar report here that has a lot of similarities (minus NFS, but otherwise the branch followed by kswapd looks almost identical). That was in 2002 and there's a post by Andrew Morton that most of the developers thought it was just bad RAM, but due to the overwhelming number of reports they were getting he was starting to think it was a kernel bug. Sounds like your best bet is to get the most recent kernel. If the problems persist, test your RAM with memtest86 and/or consider swaping out the RAM sticks with known good RAM. Anothing thing to point out is that you had nearly exhausted your swap space, which should really never happen. It seems like one or more of the applications you're running has some severe memory leaks in it. Another option would be to create more swap space. |
Quote:
Now to find out if this was fixed or not in the newer kernel! Korey |
It should be noted that even if the new kernel solves the crash issue, you're going to need a lot more RAM to continue running that load since you're swaping out a ton of memory. Like I said, one of your applications probably is leaking memory.
|
Quote:
Can both processors address 4gb of physical RAM? or is it bound to the kernels addressing capabilities? We were using the SMP kernel from redhat. |
Whoops, shows how much I was paying attention... Now that I looked at the numbers, yes that's quite an impressive battery of RAM.
So once again, I'm not that great with Linux kernel internals, but from what I can tell the limit it 4GB per process. Apparently the memory limitation doesn't have anything to do with the number of CPUs, it's either what the kernel's max is, or what the hardware memory controller can handle. |
Okay, we are going to need to play around a little here. The problem may have started in kswapd but we can tell exactly where it actually happened.
We are going to have to use gdb (and I am not positive off the top of my head if we need to take action for a compressed kernel... I'll check on that). gdb -k /path/to/kernel This should spit out the introduction and leave you with a prompt of: (kgdb) Now, try (kgdb) disas 0xc017c767 That number is the address of the instruction pointer where the problem occured. It should spit out the function -- starting from the top in assembly. The assembly might not help you but at least you will know the name of the function that "broke." If you have a core dump and a debugging kernel there is a lot more we can do. With a proper core dump we can examine the exact data that cause the problem and the exact state of the machine. Sadly, it is far more likely you don't have a core dump (I've been bitten more than once and every time fate conspires to do it when I have the core dump ability turned off). I have done some very brief looking about the compressed kernel question but don't have the ability to try anything at work. For all I know, it could be a non-issue. It won't hurt anything to try the steps above. Also... a very minor thing... when posting output could you please use the [.code.] and [./code.] tags around the output? (without the .'s) My window here is very small and it wraps lines in horrible places... and messes with the format in other subtle ways. It is a minor thing but it makes the output easier to read. |
All times are GMT -5. The time now is 05:21 PM. |