Embedded kernel+app running on 256MB but random crashes on 128MB

numa · 04-10-2012, 08:33 AM

Hi,
I am not sure if anybody can help but here is my problem:
I developed a Linux platform using Wind River Linux for an embedded device with a VIA Nehemiah processor. I have only one application running but it needs to run 24/7.
My first system has 256 MB of RAM and the application runs with not problem but on my second system, which has only 128MB, I am getting totally random crashes: oops or application crashes. These crashes happen after anything between 2 minutes and 2 days.

Would somebody have any idea where to start looking to solve this issue? Any advise would be great has I have been trying to solve this problem for weeks now.

More Info:
Kernel 2.6.34.8
The whole OS is loaded into RAM and everything is running from RAM. There is not swap partition.
Board: HS2604

Cheers.

jefro · 04-10-2012, 04:26 PM

Right now we can't say if it is the amount of ram or the board.

Without knowing any memory stats, we also can't say. Do you have any terminal access to get any stats?

Contacting Wind River may shed some light. They are selling you the software and should know the basic limits of this OS.

irey · 04-10-2012, 07:52 PM

If there is no swap and the kernel needs memory for some critical operation, it may kill a process without asking. You should be able to see that in the kernel logs (dmesg). This may explain the application crashes.

Sorry I don't know much about the oops but this may help: http://www.kernel.org/doc/Documentat...ps-tracing.txt.

numa · 04-11-2012, 02:53 AM

Hi Thank you for your replies,

I have a screen connected to the device.
Top shows that with the 128MB system we are using almost 100% of the memory, however this memory is not really used but in fact allocated (number of threads * 8MB). I reduced the size of the thread stacks to 2MB but the behaviour is same even if the app now only uses 30% of the memory.

Below are some example of oops messages that I am getting but never pointing to the same part of the code. I have many of them in stock

I don't think that it is a problem with memory allocation as I am not getting the correct dmesg if the app crashes. I created an app with a memory leak to check what message I would get if the system was running out of memory.

I would ask Wind River but they are very protective over the help they giving if you don't pay for dedicated support, but yes I could ask if they are aware of any issue with a system running on 128MB.

I am totally puzzled!

Code:

BUG: unable to handle kernel paging request at 00203b3c
IP: [<c128ee1c>] __rb_rotate_left+0x3c/0x70
*pde = 00000000 
Oops: 0000 [#1] PREEMPT 
LTT NESTING LEVEL : 0
last sysfs file: /sys/devices/pci0000:00/0000:00:07.2/usbmon/usbmon1/dev
Modules linked in: comsync(P) ip_tables iptable_filter

Pid: 218, comm: lcc Tainted: P           2.6.34.8-WR4.1.0.0_standard #3 PT-2200/Uknown
EIP: 0060:[<c128ee1c>] EFLAGS: 00010002 CPU: 0
EIP is at __rb_rotate_left+0x3c/0x70
EAX: c4708034 EBX: 00203b34 ECX: c457bb34 EDX: c15954e8
ESI: 00203b35 EDI: 00000000 EBP: c46fb988 ESP: c46fb97c
 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Process lcc (pid: 218, ti=c46fa000 task=c74b1620 task.ti=c46fa000)
Stack:
 c4708034 c457ac74 c457bb34 c46fb9a4 c128ef3e c457ac74 c15954e8 0ff01d63
<0> 00000537 c457bb2c c46fb9d0 c101e4a2 c457ac7c c15954e8 c15954cc 00000000
<0> ffda2adc ffffffff c15954cc c457bb2c 00000001 c46fb9e4 c1021548 c457bb2c
Call Trace:
 [<c128ef3e>] ? rb_insert_color+0x7e/0x110
 [<c101e4a2>] ? __enqueue_entity+0xa2/0xc0
 [<c1021548>] ? enqueue_entity+0x68/0xd0
 [<c10215f6>] ? enqueue_task_fair+0x46/0x80
 [<c101b129>] ? enqueue_task+0x39/0x50
 [<c101b279>] ? activate_task+0x29/0x30
 [<c102409e>] ? try_to_wake_up+0xde/0x210
 [<c1049ee0>] ? hrtimer_wakeup+0x0/0x20
 [<c10241ff>] ? wake_up_process+0xf/0x20
 [<c1049ef8>] ? hrtimer_wakeup+0x18/0x20
 [<c104a377>] ? __run_hrtimer+0x67/0x1e0
 [<c104a5e7>] ? hrtimer_run_queues+0xf7/0x170
 [<c1037868>] ? run_local_timers+0x8/0x20
 [<c10378ae>] ? update_process_times+0x2e/0x60
 [<c1053cb4>] ? tick_periodic+0x24/0x90
 [<c1053d38>] ? tick_handle_periodic+0x18/0x90
 [<c1005194>] ? timer_interrupt+0x14/0x20
 [<c1073478>] ? handle_IRQ_event+0x58/0x230
 [<c10214d8>] ? dequeue_task_fair+0x68/0x70
 [<c1075758>] ? handle_level_irq+0x88/0x130
 [<c1004c78>] ? handle_irq+0x18/0x30
 [<c10048ef>] ? do_IRQ+0x3f/0xa0
 [<c104a164>] ? hrtimer_try_to_cancel+0x44/0xf0
 [<c1003129>] ? common_interrupt+0x29/0x30
 [<c10cd8a6>] ? fget_light+0x66/0xd0
 [<c12b6120>] ? n_tty_poll+0x0/0x150
 [<c10dc14a>] ? do_sys_poll+0x1ca/0x510
 [<c10dbe60>] ? __pollwait+0x0/0x20
 [<c10dbe80>] ? pollwake+0x0/0x60
 [<c10dbe80>] ? pollwake+0x0/0x60
 [<c1420efc>] ? fib4_rule_action+0x4c/0x60
 [<c13ca1f4>] ? fib_rules_lookup+0xe4/0x120
 [<c14201d0>] ? fib_table_lookup+0xd0/0x100
 [<c1420efc>] ? fib4_rule_action+0x4c/0x60
 [<c13ca1f4>] ? fib_rules_lookup+0xe4/0x120
 [<c1420fed>] ? fib_lookup+0x2d/0x40
 [<c13e48d8>] ? ip_route_input_slow+0x168/0xad0
 [<c13d9de8>] ? nf_conntrack_in+0x238/0x620
 [<c13d8e96>] ? nf_conntrack_free+0x46/0x60
 [<c13e714e>] ? ip_rcv_finish+0x8e/0x320
 [<c13ae77a>] ? __kfree_skb+0x3a/0x90
 [<c13ae830>] ? kfree_skb+0x30/0x90
 [<c13e714e>] ? ip_rcv_finish+0x8e/0x320
 [<c13e75d6>] ? ip_rcv+0x1f6/0x2a0
 [<c80832d8>] ? comsync_ioctl+0x628/0x1100 [comsync]
 [<c13b7526>] ? netif_receive_skb+0x366/0x4c0
 [<c10da72a>] ? vfs_ioctl+0xaa/0xb0
 [<c10478cf>] ? thread_group_cputime+0xbf/0xe0
 [<c100744d>] ? pit_read+0x7d/0xf0
 [<c104fd1b>] ? ktime_get_ts+0xdb/0x110
 [<c1468775>] ? system_call_done+0x0/0x4
Code: 04 89 7c 24 08 8b 79 08 83 e3 fc 85 ff 89 78 04 74 09 8b 37 83 e6 03 09 c6 89 37 8b 31 89 41 08 83 e6 03 09 de 85 db 89 31 74 24 <39> 43 08 74 27 89 4b 04 8b 10 8b 1c 24 8b 74 24 04 8b 7c 24 08 
EIP: [<c128ee1c>] __rb_rotate_left+0x3c/0x70 SS:ESP 0068:c46fb97c
CR2: 0000000000203b3c
---[ end trace e3c0297a3494e9de ]---

or

Code:

BUG: unable to handle kernel NULL pointer dereference at 000000d8
IP: [<c11265ab>] do_task_stat+0xeb/0x710
*pde = 00000000 
Oops: 0000 [#1] PREEMPT 
LTT NESTING LEVEL : 0
last sysfs file: /sys/devices/pci0000:00/0000:00:07.3/usbmon/usbmon2/dev
Modules linked in: comsync(P) ip_tables iptable_filter

Pid: 208, comm: top Tainted: P           2.6.34.8-WR4.1.0.0_standard #10 PT-2200/Uknown
EIP: 0060:[<c11265ab>] EFLAGS: 00010086 CPU: 0
EIP is at do_task_stat+0xeb/0x710
EAX: c1ba1ec0 EBX: 00000000 ECX: 00000001 EDX: c4014000
ESI: 00000000 EDI: c1bf8b70 EBP: c4015edc ESP: c4015d48
 DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068
Process top (pid: 208, ti=c4014000 task=c1bf9310 task.ti=c4014000)
Stack:
 c1ba4b80 c153b564 000000ca c4015ebc 00000053 00000035 000000ca 000000ca
<0> 00000000 ffffffff 00402140 00000126 00000000 00000000 00000000 0000002e
<0> 00000001 00000000 00000000 00000014 00000000 00000001 000018d6 00000000
Call Trace:
 [<c10f56ed>] ? seq_open+0x6d/0x120
 [<c112406f>] ? pid_revalidate+0x8f/0x170
 [<c1122d60>] ? proc_single_show+0x0/0x80
 [<c1045277>] ? get_pid_task+0x47/0x70
 [<c1121234>] ? proc_single_open+0x24/0x40
 [<c10d62df>] ? __dentry_open+0x1cf/0x2f0
 [<c10e141f>] ? generic_permission+0x1f/0xb0
 [<c10d6400>] ? dentry_open+0x0/0x90
 [<c1126bf0>] ? proc_tgid_stat+0x20/0x30
 [<c1122db5>] ? proc_single_show+0x55/0x80
 [<c10f5385>] ? seq_read+0xf5/0x3f0
 [<c126a4c4>] ? security_file_permission+0x14/0x20
 [<c10d8afb>] ? vfs_read+0x9b/0x130
 [<c10f5290>] ? seq_read+0x0/0x3f0
 [<c10d8cdb>] ? sys_read+0x4b/0xe0
 [<c149574c>] ? system_call_done+0x0/0x4
Code: 00 00 00 c7 45 c4 00 00 00 00 c7 45 d8 00 00 00 00 c7 45 dc 00 00 00 00 e8 e3 51 f1 ff 85 c0 0f 84 9b 05 00 00 8b 9f ac 02 00 00 <8b> 83 d8 00 00 00 85 c0 0f 84 5f 05 00 00 e8 42 36 1a 00 8b 55 
EIP: [<c11265ab>] do_task_stat+0xeb/0x710 SS:ESP 0068:c4015d48
CR2: 00000000000000d8
---[ end trace 69d6dbd78290fb94 ]---

irey · 04-11-2012, 08:40 AM

Ok, then it's probably not an out of memory condition.

You say you always get the oops at a different point in the code, however in both posts I see the same module:

Quote:

Modules linked in: comsync(P) ip_tables iptable_filter

Do you always get it there? If so, does it help if you remove that kernel module? I know it may be needed for your purpose but this is just to identify the problem.

Another hypothesis: Hardware problem? May your RAM be corrupted somehow? If yes, it would cause any kind of random crashes, both app and kernel. VIA Nehemiah is x86 compatible, right? www.memtest86.com - it's worth a try.

numa · 04-11-2012, 09:22 AM

Hi irey,
Thanks for your reply again.
I think that module linked are all the modules which are loaded not particularly modules that are causing the ooops. I unloaded the module comsync as it is not part of the original kernel but the crash occurs still (Ooops message at the end of this post).

I tried the same kernel and same app on two (or three?) different machines and the crash is still happening.

I ran the app into Valgrind which, among others errors, it gave me the following error:
Conditional jump or move depends on uninitialised value(s)
at ....: pthread_mutex_init

The mutex code is the following:

Code:

void CriticalSection::initialise()
{
    pthread_mutexattr_t attr;
    if (pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE ) != 0)
    {
        perror("CriticalSection::initialise: pthread_mutexattr_settype: ");
        exit(1);
    }
    
    if (pthread_mutex_init(&m_mutex, &attr) != 0)
    {
        perror("CriticalSection::initialise: pthread_mutex_init: ");
        exit(1);
    }
    
    // destroy the mutex attribute after use (not the mutex itself)
    pthread_mutexattr_destroy(&attr);
}

And I changed it to:

Code:

void CriticalSection::initialise()
{
    pthread_mutexattr_t attr;
    if (pthread_mutexattr_init(&attr) != 0)
    {
        perror("CriticalSection::initialise: pthread_mutexattr_init: ");
        exit(1);
    }

    if (pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE ) != 0)
    {
        perror("CriticalSection::initialise: pthread_mutexattr_settype: ");
        exit(1);
    }
    
    if (pthread_mutex_init(&m_mutex, &attr) != 0)
    {
        perror("CriticalSection::initialise: pthread_mutex_init: ");
        exit(1);
    }
    
    // destroy the mutex attribute after use (not the mutex itself)
    pthread_mutexattr_destroy(&attr);
}

I don't know if this could be causing so much issues but it stops Valgrind complaining?

----------------------

Code:

BUG: unable to handle kernel paging request at 9a3f6850
IP: [<9a3f6850>] 0x9a3f6850
*pde = 00000000 
Oops: 0000 [#1] PREEMPT 
LTT NESTING LEVEL : 0
last sysfs file: /sys/devices/pci0000:00/0000:00:07.2/usbmon/usbmon1/dev
Modules linked in: ip_tables iptable_filter ipt_REJECT [last unloaded: comsync]

Pid: 207, comm: lcc Tainted: P           2.6.34.8-WR4.1.0.0_standard #17 PT-2200/Uknown
EIP: 0060:[<9a3f6850>] EFLAGS: 00010206 CPU: 0
EIP is at 0x9a3f6850
EAX: c4901980 EBX: c4901980 ECX: 00000063 EDX: c4a9ec00
ESI: c48e0830 EDI: 001aa8a9 EBP: c4aa7cb8 ESP: c4aa7c74
 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Process lcc (pid: 207, ti=c4aa6000 task=c490a130 task.ti=c4aa6000)
Stack:
 c1400043 c4aa7ca8 275c1408 c21b2a80 000016a0 00000020 00000063 c48e0848
<0> 00000002 00000000 001aa8a9 46a54b82 00000000 00000000 c4901980 c48e0780
<0> c48e0798 c4aa7d14 c14024a2 00000020 c48e0780 c4aa7cf4 00000202 c4aa7d90
Call Trace:
 [<c1400043>] ? tcp_transmit_skb+0x473/0x7c0
 [<c14024a2>] ? tcp_write_xmit+0x182/0x940
 [<c1402cc8>] ? __tcp_push_pending_frames+0x28/0x80
 [<c13f5bd2>] ? tcp_sendmsg+0x792/0xa20
 [<c13a6717>] ? sock_sendmsg+0xd7/0x170
 [<c13d6d54>] ? nf_hook_slow+0xc4/0x100
 [<c1020199>] ? cpuacct_charge+0x59/0x80
 [<c10211e0>] ? update_curr+0x150/0x240
 [<c13e7193>] ? ip_rcv_finish+0xd3/0x320
 [<c13a6f52>] ? sys_sendto+0xb2/0xe0
 [<c10214a0>] ? dequeue_task_fair+0x30/0x70
 [<c10016a4>] ? __switch_to+0x164/0x180
 [<c10255a8>] ? T.1372+0x38/0x90
 [<c14664fd>] ? schedule+0x1fd/0x440
 [<c104a164>] ? hrtimer_try_to_cancel+0x44/0xf0
 [<c104a229>] ? hrtimer_cancel+0x19/0x20
 [<c13a6fb2>] ? sys_send+0x32/0x40
 [<c13a89cd>] ? sys_socketcall+0x20d/0x2e0
 [<c1468775>] ? system_call_done+0x0/0x4
Code:  Bad EIP value.
EIP: [<9a3f6850>] 0x9a3f6850 SS:ESP 0068:c4aa7c74
CR2: 000000009a3f6850
---[ end trace 3ea5428ee428cf0c ]---

irey · 04-11-2012, 03:36 PM

I think Valgrind was right. Even if the manpage for pthread_mutexattr_init() doesn't say it's mandatory, I assume it is since I would never use an uninitialized object.

Maybe you're right, "modules linked in" doesn't necessarily mean those modules were responsible for the problem and your stack traces show completely different system calls.

Have you tried running that OS image in a VM (such as VirtualBox) to experiment with different RAM sizes? Maybe that can confirm it's just the ammount of memory...

jefro · 04-11-2012, 03:48 PM

The VM would also introduce a different hardware. Unless this board is certified to run WindRiver there will always be doubt.