How To interpret kernel stack trace

kklier · 08-20-2004, 08:55 AM

I have been unable to find any kind of tutorial or clue as to determine why a crash occured in the kernel. I am running a Red Hat EE 3.0 kernel
and received the following crash, which appears to be in kswapd:

Code:

Aug 19 17:30:57 host1 login(pam_unix)[9816]: session closed for user someuser
Aug 19 17:30:59 host1 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000107
Aug 19 17:30:59 host1 kernel:  printing eip:
Aug 19 17:30:59 host1 kernel: c017c767
Aug 19 17:30:59 host1 kernel: *pde = 00003001
Aug 19 17:30:59 host1 kernel: *pte = 00000000
Aug 19 17:30:59 host1 kernel: Oops: 0000
Aug 19 17:30:59 host1 kernel: nfs nfsd lockd sunrpc lp parport autofs tg3 e100 floppy sg microcode keybdev mousedev hid input usb-ohci usbcore ext3 jbd mptscsih mptbase sd_mod scsi_mod
Aug 19 17:30:59 host1 kernel: CPU:    1
Aug 19 17:30:59 host1 kernel: EIP:    0060:[<c017c767>]    Not tainted
Aug 19 17:30:59 host1 kernel: EFLAGS: 00010202
Aug 19 17:30:59 host1 kernel:
Aug 19 17:30:59 host1 kernel: EIP is at iput [kernel] 0x37 (2.4.21-9.ELsmp/i686)
Aug 19 17:30:59 host1 kernel: eax: 000000ef   ebx: f3120a80   ecx: f3120a90   edx: e8b72c80
Aug 19 17:30:59 host1 kernel: esi: 000000ef   edi: df7ba800   ebp: 0000035e   esp: f7fa3f6c
Aug 19 17:30:59 host1 kernel: ds: 0068   es: 0068   ss: 0068
Aug 19 17:30:59 host1 kernel: Process kswapd (pid: 7, stackpage=f7fa3000)
Aug 19 17:31:00 host1 kernel: Stack: 00000000 c0179610 f8d99ad7 e8b72c98 e8b72c80 f3120a80 c0179b1a f3120a80
Aug 19 17:31:00 host1 kernel:        f3120a80 c03a3d00 00003108 00000040 000001d0 c0179ee8 0000038b 00000040
Aug 19 17:31:00 host1 kernel:        c015388a 00000006 000001d0 00000014 0000312c 00000000 00040f42 ffffffff
Aug 19 17:31:00 host1 kernel: Call Trace:   [<c0179610>] dput [kernel] 0x30 (0xf7fa3f70)
Aug 19 17:31:00 host1 kernel: [<f8d99ad7>] nfs_dentry_iput [nfs] 0x57 (0xf7fa3f74)
Aug 19 17:31:00 host1 kernel: [<c0179b1a>] prune_dcache [kernel] 0x18a (0xf7fa3f84)
Aug 19 17:31:00 host1 kernel: [<c0179ee8>] shrink_dcache_memory [kernel] 0x68 (0xf7fa3fa0)
Aug 19 17:31:00 host1 kernel: [<c015388a>] do_try_to_free_pages_kswapd [kernel] 0x13a (0xf7fa3fac)
Aug 19 17:31:00 host1 kernel: [<c0153a38>] kswapd [kernel] 0x68 (0xf7fa3fd0)
Aug 19 17:31:00 host1 kernel: [<c01539d0>] kswapd [kernel] 0x0 (0xf7fa3fe4)
Aug 19 17:31:00 host1 kernel: [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf7fa3ff0)
Aug 19 17:31:00 host1 kernel:
Aug 19 17:31:00 host1 kernel: Code: 8b 46 18 85 c0 0f 85 b1 02 00 00 c7 44 24 04 9c 86 3a c0 8d
Aug 19 17:31:00 host1 kernel:
Aug 19 17:31:00 host1 kernel: Kernel panic: Fatal exception
Aug 19 17:31:00 host1 kernel:

But, I cannot figure out why this happened.

The load was pretty high at:

Code:

            kbmemfree kbmemused  %memused kbmemshrd kbbuffers  kbcached kbswpfree kbswpused  %swpused
16:50:00       391080   3734860     90.52         0    230052   2293084   1569572   6816220     81.28
17:00:02       329112   3796828     92.02         0    230080   2295696   1623204   6762588     80.64
17:10:01       324004   3801936     92.15         0    230084   2295600   1623480   6762312     80.64
17:20:01       322776   3803164     92.18         0    230104   2295768   1623524   6762268     80.64

But it does not look like all the resources were completely exhausted.

Any clue or any pointers to howto info would be great help.

Thanks!

chort · 08-20-2004, 11:34 AM

I'm no kernel guru, but it does appear that your system was attempting to free some swap space to allocate it to NFS. Perhaps the pointer being referred to was supposed to point to the next free block of memory, or something like that. In any case, null pointer dereferences are quite bad and IMHO that shows a bug in the kernel.

You'll see a similar report here that has a lot of similarities (minus NFS, but otherwise the branch followed by kswapd looks almost identical). That was in 2002 and there's a post by Andrew Morton that most of the developers thought it was just bad RAM, but due to the overwhelming number of reports they were getting he was starting to think it was a kernel bug.

Sounds like your best bet is to get the most recent kernel. If the problems persist, test your RAM with memtest86 and/or consider swaping out the RAM sticks with known good RAM. Anothing thing to point out is that you had nearly exhausted your swap space, which should really never happen. It seems like one or more of the applications you're running has some severe memory leaks in it. Another option would be to create more swap space.

kklier · 08-20-2004, 01:38 PM

Quote:

Originally posted by chort
I'm no kernel guru,....

Thanks chort. Any bit of info helps. We are limited on the kernels that we can use. We are forced to use the updates from Red Hat as the come out. I will however be switching from 2.4.21-9.EL to 2.4.21-15.0.2.EL, the one provided in Update 2.

Now to find out if this was fixed or not in the newer kernel!

Korey

chort · 08-20-2004, 10:36 PM

It should be noted that even if the new kernel solves the crash issue, you're going to need a lot more RAM to continue running that load since you're swaping out a ton of memory. Like I said, one of your applications probably is leaking memory.

kklier · 08-21-2004, 10:52 PM

Quote:

Originally posted by chort
It should be noted that even if the new kernel solves the crash issue, you're going to need a lot more RAM to continue running that load since you're swaping out a ton of memory. Like I said, one of your applications probably is leaking memory.

speaking of ram...these are duel processor Xeon's with 4gb of ram,8gb of swap. Near as I can tell there were two simulations running( one on each proc), but we are not sure if the remaining RAM was or swap space was sucked up during the last 10min before sar stopped reporting. These sims are known to eat RAM, so no surprise.

Can both processors address 4gb of physical RAM? or is it bound to the kernels addressing capabilities? We were using the SMP kernel from redhat.

chort · 08-22-2004, 01:40 AM

Whoops, shows how much I was paying attention... Now that I looked at the numbers, yes that's quite an impressive battery of RAM.

So once again, I'm not that great with Linux kernel internals, but from what I can tell the limit it 4GB per process. Apparently the memory limitation doesn't have anything to do with the number of CPUs, it's either what the kernel's max is, or what the hardware memory controller can handle.

frob23 · 08-22-2004, 09:24 AM

Okay, we are going to need to play around a little here. The problem may have started in kswapd but we can tell exactly where it actually happened.

We are going to have to use gdb (and I am not positive off the top of my head if we need to take action for a compressed kernel... I'll check on that).

gdb -k /path/to/kernel

This should spit out the introduction and leave you with a prompt of:
(kgdb)

Now, try
(kgdb) disas 0xc017c767

That number is the address of the instruction pointer where the problem occured. It should spit out the function -- starting from the top in assembly. The assembly might not help you but at least you will know the name of the function that "broke."

If you have a core dump and a debugging kernel there is a lot more we can do. With a proper core dump we can examine the exact data that cause the problem and the exact state of the machine. Sadly, it is far more likely you don't have a core dump (I've been bitten more than once and every time fate conspires to do it when I have the core dump ability turned off).

I have done some very brief looking about the compressed kernel question but don't have the ability to try anything at work. For all I know, it could be a non-issue. It won't hurt anything to try the steps above.

Also... a very minor thing... when posting output could you please use the [.code.] and [./code.] tags around the output? (without the .'s) My window here is very small and it wraps lines in horrible places... and messes with the format in other subtle ways. It is a minor thing but it makes the output easier to read.