Kernel Call Trace Order - Is it top to bottom OR vice-versa

nkataria · 04-12-2009, 04:53 AM

Hello All,

I am facing a strange problem with my written program. It goes to zombie state. When I give "echo t > /proc/sysrq-trigger", I get the following in "/var/log/messages" file.

----------some-code-----------
Apr 12 14:51:12 localhost kernel: Call Trace:
Apr 12 14:51:12 localhost kernel: [<021209d0>] do_exit+0x386/0x390
Apr 12 14:51:12 localhost kernel: [<02106693>] do_divide_error+0x0/0xaa
Apr 12 14:51:12 localhost kernel: [<02118df5>] do_page_fault+0x2fd/0x4b4
Apr 12 14:51:12 localhost kernel: [<0214e2d3>] sys_close+0x0/0x61
Apr 12 14:51:12 localhost kernel: [<0214e2d3>] sys_close+0x0/0x61
Apr 12 14:51:12 localhost kernel: [<02142e0e>] __vma_link+0x4e/0x93
Apr 12 14:51:12 localhost kernel: [<02142eaf>] vma_link+0x5c/0x8d
Apr 12 14:51:12 localhost kernel: [<02140c53>] follow_page+0x128/0x134
Apr 12 14:51:12 localhost kernel: [<0214c50b>] rw_vm+0x20b/0x234
Apr 12 14:51:12 localhost kernel: [<02118af8>] do_page_fault+0x0/0x4b4
Apr 12 14:51:12 localhost kernel: [<0214e2d3>] sys_close+0x0/0x61

----------some-code-----------

Can some-one explain me the order of function call here ?
Did "do_exit" call "do_divide_error" OR vice-versa ?

Does "do_divide_error" function call says that my program some-where do "divide by zero operation" ?

Thanks and Regards,
Navneet Kataria
P.S. -> I am using Fedora Core 2 OS and 2.6.5-1.358smp kernel.

titan22 · 04-13-2009, 08:28 AM

It's bottom (caller) to top (callee). At some point in time do_divide_error() called do_exit().

nkataria · 04-14-2009, 11:57 PM

Thanks for the reply, titan22.

Can you also help me in finding the cause of the problem ? As of now, I think It is happening because of some hardware issue. As the same application is running fine in other hardware set with same OS loaded in it. What kind of probable hardware issue it may be ?

--
Thanks and Regards,
Navneet Kataria

titan22 · 04-17-2009, 10:06 AM

Your hardware issue is likely the driver that implements "close". The driver may need different implementation for different hardware. Sometimes in /var/log/messages you can see the instruction EIP that is causing the problem. You can decipher that by gdb.

For example:
kernel: EIP is at d_instantiate+0x2d/0x56
[11:05am] /usr/src/linux-2.6.6-1.435.2.3.lair6smp
44 > gdb vmlinux
(gdb) info line *d_instantiate+0x2d
Line 66 of "list.h" starts at address 0xc0167d87
and ends at 0xc0167d8a <d_instantiate+48>.

Since only limited output is available and the stack dump does not give a very clear image. Two close() in one stack which is impossible. The best guess is that it crashed during do_page_fault().

Apr 12 14:51:12 localhost kernel: [<021209d0>] do_exit+0x386/0x390
Apr 12 14:51:12 localhost kernel: [<02106693>] do_divide_error+0x0/0xaa
Apr 12 14:51:12 localhost kernel: [<02118df5>] do_page_fault+0x2fd/0x4b4
Apr 12 14:51:12 localhost kernel: [<0214e2d3>] sys_close+0x0/0x61
Apr 12 14:51:12 localhost kernel: [<0214e2d3>] sys_close+0x0/0x61

Translate "do_page_fault+0x2fd/0x4b4" and "sys_close+0x0/0x61" to the corresponding line numbers. Check if any possible divide by zero occurs around the line number. do_page_fault() internally does not do any "divide" operation. It's likely that the divide operation is done in a function called by do_page_fault(). For example handle_mm_fault() (just a guess). Close() can be implemented by either a filesystem or network drivers.

You can also add assert() or panic() around the possible candidates and narrow down the problem. That should tell you the exact line number when the unexpected happens

nkataria · 04-27-2009, 04:30 AM

Thanks once again Titan22 !!

I tried my application on Red-hat ES OS. The issue seems to be resolved, since the application along the OS is running fine since 10 days.

BTW I could not debug FC2 kernel in my case, because I could not locate the "vmlinux" file for it. I guess I need to compile the kernel [from the provided source] for the file (vmlinux) to be generated.

So It looks like a hardware compatibility issue with FC2, which got resolved with RH-ES4.

One more thing I want to ask is that, Why different distributions (like Fedora, Debian) instrument the standard kernel? Also Why they don't clearly specify the changes they had done?