Zombie Process, But Threads Running. Logs say Unable to handle kernel paging request.

nkataria · 01-07-2009, 05:13 AM

Hello,

I had written an application which has 10 periodic threads along with the main process which handles application GUI refreshing.

One of the periodic thread also starts some other background threads based on the packets received on the serial link. It also allocates and de-allocates the memory. These threads terminates after finishing their
job/work.

The problem is that some-times the application/GUI hangs. When I see the proccess list, I saw my process being marked as zombie. The process list displays <defunct> for my process. But all other 10 threads continue to work with-out any problem. (This I assured by the observing the output threads are supposed to generate.)

When I see the /var/log/messages file. I observe the following message at the same time my application (process name display) hanged.

----------------------------------------------------------
Dec 21 04:24:31 localhost kernel: Unable to handle kernel paging request at virtual address 0214ebaf
Dec 21 04:24:31 localhost kernel: printing eip:
Dec 21 04:24:31 localhost kernel: 0214ebaf
Dec 21 04:24:31 localhost kernel: *pde = 0041b027
Dec 21 04:24:31 localhost kernel: Oops: 0000 [#1]
Dec 21 04:24:31 localhost kernel: SMP
Dec 21 04:24:31 localhost kernel: CPU: 0
Dec 21 04:24:31 localhost kernel: EIP: 0060:[<0214ebaf>] Not tainted
Dec 21 04:24:31 localhost kernel: EFLAGS: 00210246 (2.6.5-1.358smp)
Dec 21 04:24:31 localhost kernel: EIP is at sys_read+0x0/0x42
Dec 21 04:24:31 localhost kernel: eax: 00000003 ebx: 104b5fc4 ecx: 09a3c940 edx: 00389000
Dec 21 04:24:31 localhost kernel: esi: 00000004 edi: 09a3c940 ebp: 104b5000 esp: 104b5fc0
Dec 21 04:24:31 localhost kernel: ds: 007b es: 007b ss: 0068
Dec 21 04:24:31 localhost kernel: Process display (pid: 3911, threadinfo=104b5000 task=12cf26d0)
Dec 21 04:24:31 localhost kernel: Stack: fffeb200 0000000c 09a3c940 00000004 00000004 09a3c940 fef5a928 00000003
Dec 21 04:24:31 localhost kernel: fffe007b 0000007b 00000003 00953402 00000073 00200246 fef5a900 0000007b
Dec 21 04:24:31 localhost kernel: Call Trace:
Dec 21 04:24:31 localhost kernel:
Dec 21 04:24:31 localhost kernel: Code: 56 be f7 ff ff ff 53 50 8b 44 24 10 89 e2 e8 5f 0b 00 00 85
-------------------------------------------------------------------------

I am using kernel version 2.6.5-1.358smp and Fedora Core 2 distribution.
I had searched on the net regarding this message and find out some people say that it is hardware/RAM problem. Is it true ? Or should I look into logic in my code for finding out the problem.

Also I use Magic Sys-Rq key to find out the kernel threads when the system was hanged. I got the following output.
------------------------------------------------------------------------
......................................
Dec 24 10:21:52 localhost kernel: display Z F70B02E8 0 3909 3908 (L-TLB)
Dec 24 10:21:52 localhost kernel: 10486e8c 00200046 12b9a1b0 f70b02e8 20284300 0000000b 033f4ce0 0003d090
Dec 24 10:21:52 localhost kernel: 666cee00 000f9a17 12b9a1b0 12b9a360 2167cdc0 20284328 12b9a1b0 0000000b
Dec 24 10:21:52 localhost kernel: 021209d0 10486000 022b661c 00000000 00000008 02106693 0000000b 000000e3
Dec 24 10:21:52 localhost kernel: Call Trace:
Dec 24 10:21:52 localhost kernel: [<021209d0>] do_exit+0x386/0x390
Dec 24 10:21:52 localhost kernel: [<02106693>] do_divide_error+0x0/0xaa
Dec 24 10:21:52 localhost kernel: [<02118df5>] do_page_fault+0x2fd/0x4b4
Dec 24 10:21:52 localhost kernel: [<0214ebaf>] sys_read+0x0/0x42
Dec 24 10:21:52 localhost kernel: [<0214ebaf>] sys_read+0x0/0x42
Dec 24 10:21:52 localhost kernel: [<02118af8>] do_page_fault+0x0/0x4b4
Dec 24 10:21:52 localhost kernel: [<0214ebaf>] sys_read+0x0/0x42
Dec 24 10:21:52 localhost kernel:
.................................................

While the out-put for rest of the working threads was as follows:
..................................................................
Dec 24 10:21:52 localhost kernel: display S 00000000 0 3970 1 3971 13718 (NOTLB)
Dec 24 10:21:52 localhost kernel: 15e08f68 00200006 00013e45 00000000 148d42b0 148d42d0 033fcce0 00000000
Dec 24 10:21:52 localhost kernel: d98b5340 000fcfa5 208b39b0 208b3b60 0944470f 0944470f 000f41a7 00000001
Dec 24 10:21:52 localhost kernel: 022a020e 033fdf24 033fe900 0944470f 00000001 4b87ad6e 021253e3 208b39b0
Dec 24 10:21:52 localhost kernel: Call Trace:
Dec 24 10:21:52 localhost kernel: [<022a020e>] schedule_timeout+0x86/0xa1
Dec 24 10:21:52 localhost kernel: [<021253e3>] process_timeout+0x0/0x5
Dec 24 10:21:52 localhost kernel: [<021254c4>] sys_nanosleep+0xcc/0x13f
Dec 24 10:21:52 localhost kernel:
.......................................................................
---------------------------------------------------------------------

Can any-body guide me in this case ?

Thanks and Regards,
Navneet Kataria

jailbait · 01-08-2009, 03:33 PM

Quote:

Originally Posted by nkataria

should I look into logic in my code for finding out the problem.

I would start looking for a problem with a shared variable. These intermittent type problems in a multi-threaded application are often caused by failure to lock/unlock a shared resource correctly.

-----------------------
Steve Stites

P.S. If a problem is unpredictably intermittent that is a good indication that it is a hardware problem instead of a software problem. So I can see why people are inclined to diagnose your problem as being a hardware problem.

But software problems with lock/unlock logic can be intermittent due to the random timing of when two different threads happen to access a shared resource incorrectly at the same time.

In C the lock logic is called mutexes. Here is an explanation of mutexes.

http://www.comptechdoc.org/os/linux/..._pgcmutex.html

nkataria · 01-12-2009, 04:46 AM

Can't figure the problem out.

Also one thing I had noted that the virtual address is reported same all the time.

"Unable to handle kernel paging request at virtual address 0214ebaf"
i.e. the address 0214ebaf does not change with the re-occurance of the problem.

Every time the system hangs, the address remains the same.

Can any-body help ?

I am thinking of changing my design so that less/NO new functions calls are there after initalization.

nkataria · 04-04-2009, 03:41 AM

Is there a way to some-how generate core-dump for this ?

I think since the kernel log shows a function call "do_divide_error", there might be a possibility of divide by zero bug in my program. And since I am having many threads, it might be possible that before the default interrupt service routine (for divide_error) returns, one of my thread executes the read sys-call which results in protection fault for the kernel.

Is there a way to produce the core dump for the process which has gone to zombie state while the threads created by it are still running ?

Can any-body tell me how core dump work ? I know that one has to give "ulimit -c unlimited" on the terminal. Does the core dump have history of all states of the process it has run OR just the present process state ?

If I give a kill -s SIGSEGV to my zombie process after giving "ulimit -c unlimited" on a newly opened terminal; will it be able to produce core dump ?

Why Core Size are terminals dependent ? Does Core Size has to be set for every newly opened terminal ? Will opening a new terminal and then giving kill command to me process, result in core dump ?