Kernel Panic, MCE messages, and an Error Code

tvynr · 06-01-2005, 02:56 PM

My firewall is running on a Linux 2.6.9 kernel and has been functioning just fine for months. This morning, I found the machine having some unusual problems, none of which I'd seen exhibited on any machine before.

First, the machine couldn't detect a dial tone from the modem, despite the fact that the modem was cleanly initialized and there was definitely a dial tone on the line. After reboot, it worked just fine.

Next, I found the following message in the /var/log/messages several times:

Code:

Jun  1 15:29:39 dib kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
Jun  1 15:29:39 dib kernel: Bank 1: 9400000000000151
Jun  1 15:32:09 dib kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
Jun  1 15:32:09 dib kernel: Bank 1: d400000000000151

Finally, the machine dies every so often (say, five to ten minutes or so) with a really cryptic kernel panic. Unfortunately, I can only see the last 25 lines:

Code:

[<c037f040>] ip_rcv_finish+0x0/0x2c0
[<c0360084>] nf_hook_slow+0xe4/0x120
[<c037f040>] ip_rcv_finish+0x0/0x2c0
[<c037ed69>] ip_rcv+0x439/0x500
[<c037f040>] ip_rcv_finish+0x0/0x2c0
[<c0355807>] netif_receive_skb+0x117/0x1d0
[<c034d4a7>] alloc_skb+0x47/0xe0
[<c02d3779>] rtl8139_rx+0x199/0x340
[<c02d3b0a>] rtl8139_poll+0x5a/0xe0
[<c0355a53>] net_rx_action+0x83/0x110
[<c0123d3a>] __do_softirq+0xba/0xd0
[<c010892c>] do_softirq+0x4c/0x60
=======================
[<c0108045>] do_IRQ+0x165/0x1b0
[<c0105be8>] common_interrupt+0x18/0x20
[<c0103030>] default_idle+0x0/0x40
[<c010305c>] default_idle+0x2c/0x40
[<c01030f2>] cpu_idle+0x42/0x60
[<c051d937>] start_kernel+0x167/0x190
[<c051d3a0>] unknown_bootoption+0x0/0x160
Code: 8b 44 24 24 89 44 24 04 e8 85 7d ff ff 8b 5c 24 18 83 c4 1c c3 8d b6 00 00 00 00 8d bc 27 00 00 00 00 55 31 ed 57 56 53 83 ec 34 <8b> 54 24 48 8b 4c 24 48 0f b6 42 0e 8b 04 85 00 e7 5a c0 89 44
 <0>Kernel panic - not syncing: Fatal exception in interrupt

It looks to me like a stack trace or something. It should be noted that I have the RealTek 8139 ethernet drivers compiled into the kernel and the network card in the machine is a RealTek 8139 chipset.

I did a little research and ran into a program called parsemce. I parsed the first dump in the /var/log/messages file and got:

Code:

parsebank(1): 9400000000000151 @ 0
        External tag parity error
        Address in addr register valid
        Error enabled in control register
        Memory heirarchy error
        Request: Generic error
        Transaction type : Instruction
        Memory/IO : Reserved

Unfortunately, I have no clue at all what I'm looking at. The kernel panic message seems to be some kind of stack trace, but I don't have all of it and wouldn't know what to do with it anyway.

Does anyone have any guesses as to what could've gone wrong?

tvynr · 06-02-2005, 01:57 AM

Okay, so I fired up the machine to give it another go a little while ago and the hard drive was making noises that sounding like heavy construction equipment. I pulled the drive, had a small episode with trying to make the ECSD check the hardware from scratch (bloody cryptic BIOS), and popped a 20 Gb hard drive in that I salvaged from a broken machine long ago.

The new hard drive seems to be working perfectly. However, I popped the Slackware 10 CD into the drive and started up... and I got a segmentation fault from the USB scan. Okay, this is worrisome, but it kept moving. Deal with that later... we'll run badblocks for now to make sure the hard drive is okay.

Well, badblocks froze about halfway through writing the first pass. No message, no anything. I rebooted. USB check ran fine this time. However, the moment badblocks started, I got the following:

Code:

Checking for bad blocks (read-only test): Unable to handle kernel paging request at virtual address 09a657a5
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c01ce270>]     Not tainted
EFLAGS: 00010283
eax: c12d2968   ebx: cf7ee390   ecx: cf7ee390   efx: 000000d2
esi: 00000400   edi: cf7ee350   ebp: 000000d6   esp: cedb9c48
ds: 0018   es: 0018   ss: 0018
Process badblocks (pid: 185, stackpage=cedb9000)
Stack: c01ce447 cf7ee390 c12d2968 00000002 cedb9c6c cedb9c6c fffffffe c011c5b8
       cf7ee350 c03a2780 c03a2780 c0390d80 c0388a30 c011f8ca c0320afc c011c4f2
       c011c404 00000001 00000001 c011c213 c0388a30 c0388900 00000000 c031f650
Call Trace:    [<c01ce447>] [<c011c5b8>] [<c011f8ca>] [<c011c4f2>] [<c011c404>]
  [<c011c213>] [<c010a09d>] [<c010c488>] [<c02028f9>] [<c021406a>] [<c020b9c5>]
  [<c020bb08>] [<c020bc3f>] [<c01e5074>] [<c011c5b8>] [<c014d59a>] [<c013b223>]
  [<c013aeb6>] [<c0127cda>] [<c0125a30>] [<c0125c2c>] [<c013d3c3>] [<c013d360>]
  [<c012a08b>] [<c012a2c7>] [<c0144b8d>] [<c01376c8>] [<c0108d73>]

Code: 55 57 56 53 8b 74 24 14 8b 6c 24 1c 8b 7e 10 4d 4f 83 fd ff
 <0>Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing

I ran memtest86+ v1.55 on this box only an hour ago and it said the RAM was fine. Is the CPU on this machine toast? Does anyone have any idea what the heck is going on here? I'd love to keep using this machine if possible... a computer is a terrible thing to waste.

You know things are bad when you get a silly-looking message from the kernel instead of a vaguely professional looking one. I'm recalling the "food fight!" message at this point...

Cheers and Thanks