LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Slackware (https://www.linuxquestions.org/questions/slackware-14/)
-   -   Kernel Panic on Slackware64 14.1 (https://www.linuxquestions.org/questions/slackware-14/kernel-panic-on-slackware64-14-1-a-4175501840/)

Eupator 04-15-2014 01:44 PM

Kernel Panic on Slackware64 14.1
 
Hi all,

Recently, I've been getting kernel panics from my machine, and I'm at a loss for how to fix them.

I'm running slackware64 14.1 on an AMD Quadcore 9600.

Panics most often occur during bootup, after lilo loads the kernel and before I get to login. My machine also will freeze now and again when starting X, failing to send any signal to my monitor or respond to keyboard commands. Most recently, I had a panic while I was in X, debugging some javascript.

Here's a screenshot of the message that appeared on my most recent kernel panic:
http://i.imgur.com/BQIXon9.png
(I apologize for the blur; my camera is not excellent.)

I have installed the multilib packages; is it possible that these have introduced this error?

Thanks in advance for any advice or suggestions.

Mark Pettit 04-15-2014 01:49 PM

To check that it's not your Slackware install, perhaps boot up off a live cd (eg ubuntu) and see how that goes. If that does the same, then clearly it would be a hardware issue. If not, well then do come back to us and we can continue with suggestions.

TobiSGD 04-15-2014 02:13 PM

Your system trigger a MCE (Machine Check Exception), which is likely a problem with the hardware.
Clean up the cooling system (just in case this is overheating), use Memtest86+ (available from the bootscreen of your Slackware DVD) to check the RAM.

Eupator 04-15-2014 03:27 PM

Mark, I have used LiveCD's on this computer with no issue.

TobiSGD, I have not yet run a memtest, but I did just now open up my computer and dust off the cooling system. I discovered that my rear fan (not on the CPU or PSU fan) has lost a blade. At the very least, this explains some of the noise with my machine.

Since then, I booted up the machine and had a crash twice on booting X. I'll post again as soon as I have a chance to memtest.

Thanks to you both!

Eupator 04-16-2014 02:05 PM

I ran memtest86+ and found no errors.

I've had no trouble booting from the Slackware liveUSB, nor from the SLAX liveCD.

The replacement fan is in the mail. Until then, are there any other diagnostics I can run?

mancha 04-16-2014 02:20 PM

So live CDs run fine, eh? For starters, can you put together a pastebin with the output from:
  • dmidecode
  • lsmod
  • lspci -v
Also, what machine is this?

--mancha

-----

Edit:

You can also try running mcelog to get some more verbosity. Not sure why Slackware doesn't have a hook
for this but you can add the following code block to /etc/rc.d/rc.local

Code:

# Start mcelog daemon
if [ -x /etc/rc.d/rc.mcelog ]; then
    /etc/rc.d/rc.mcelog start
fi

You can place your mcelog settings in /etc/mcelog.conf

Eupator 04-16-2014 02:42 PM

dmidecode

lsmod

lspci -v

I got this machine second-hand, but uname -a returns:

Quote:

Linux sigmund 3.10.17 #2 SMP Wed Oct 23 16:34:38 CDT 2013 x86_64 AMD Phenom(tm) 9600 Quad-Core Processor AuthenticAMD GNU/Linux

Eupator 04-29-2014 06:40 PM

Hi again,

I've replaced the broken fan, but I'm still getting crashes during X startup. Any ideas?

metaschima 04-29-2014 07:42 PM

Run:
http://www.mersenne.org/download/index.php#source
In mode 1 to try and see if the CPU is working properly. The error clearly states that there is an MCE on the CPU meaning that the CPU may be faulty. Let it run for 13 runs and see if it prints an error.

j_v 04-29-2014 07:48 PM

Quote:

Originally Posted by Eupator (Post 5161570)
Hi again,

I've replaced the broken CPU fan, but I'm still getting crashes during X startup. Any ideas?

Specifics would far and away lead to some ideas. I know your original post mentions kernel panic, is that still the issue? Going on what you've mentioned so far, my gut reaction is a faulty cpu core, but that is really just a guess. If it were my machine, I might look into disabling the 3rd core (core 2 being the one to show the fault in the pic you linked to), but I don't know your bios and whether core disabling is even viable with your machine's bios.

Either of these next two suggestions would allow you to temporarily disable suspected cores, to test whether running without them improves matters. These might be better to try first, rather than messing with the bios, because these are fairly simple and can be easily discarded if proved useless:
  1. You could disable an individual core via sysfs:
    Code:

    echo "0" > /sys/bus/cpu/devices/cpu2/online
  2. You could boot with only the first two cores by adding 'maxcpus=2' to the kernel command line.

Bare in mind that I'm am going on a hunch here. My suggestions may only be a blind alley and no help at all.

Regards

EDIT:
@metaschima: You beat me to the punch. Good idea on the prime95 test.

ReaperX7 04-29-2014 09:43 PM

It could be several hardware failures.

1. Memtest86+ will see if your RAM may have problems. This can be anything from modules going bad to total failures.

2. When you format a disk, try using the SLOW format to check for bad blocks. If your hard drive has a lot of errors you may need to replace it. A slow format will tell you if there are bad sectors. On large capacity disks this will take considerable time, but it's worth it.

3. Check your cables for breaks, clean the air flow paths, and look for discoloration and burn marks on hardware. Any of these could mean it's time to start replacing hardware.

Eupator 04-29-2014 09:50 PM

metaschima, I ran mprimes in single user mode, as you suggested, and it got through six tests before, surprise! Kernel panic.

Here's the output

More interestingly, dmesg threw a couple of these at me:

Quote:

[11701.927635] [Hardware Error]: MC0 Error: Data/Tag DWR error.
[11701.928255] [Hardware Error]: Error Status: Uncorrected, software restartable error.
[11701.928731] [Hardware Error]: CPU:2 (10:2:2) MC0_STATUS[-|UE|-|-|AddrV|UECC]: 0xb441200000000145
[11701.929728] [Hardware Error]: MC0_ADDR: 0x00000001086e6100
[11701.930708] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DWR
[11774.164149] [Hardware Error]: MC0 Error: Data/Tag DWR error.
[11774.164727] [Hardware Error]: Error Status: Uncorrected, software restartable error.
[11774.165107] [Hardware Error]: CPU:2 (10:2:2) MC0_STATUS[-|UE|-|-|AddrV|UECC]: 0xb451a00000000145
[11774.166094] [Hardware Error]: MC0_ADDR: 0x000000011e36b090
[11774.166984] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DWR
Which seems to support the CPU core theory.

j_v, I will try your idea next, and report back.

ReaperX7, I have already (1) run memtest86+, (2) checked my hard drive for bad blocks, and (3) cleaned and inspected my computer's internals.

Thanks to all of you for your input!

enorbet 04-30-2014 02:22 AM

Hello
I'm sorry to say that if that fan has been broken for sufficient time, hardware damage may have occurred. OTOH just as often the thermal grease may simply have "caked up" from overheating and need to be replaced. It would probably be wise to use some monitoring software like Conky to keep a close watch. Of course you could just run lmsensors in a terminal but IMHO constant desktop meters are extremely valuable. Also, you might check in bios to see if your fans have been set to some "quiet mode" that gives silence preference over temperature. Heat is the enemy of electronics.

metaschima 04-30-2014 03:01 PM

Check the CPU temperatures and make sure they are under critical. If they are under, then it is very likely that the CPU is faulty.

Eupator 04-30-2014 08:21 PM

After passing the kernel 'maxcpus=2' at boot, mprimes appears to run without error, and I have had no kernel panics.

Thanks to you all for helping me pinpoint the problem.

Now to figure out a replacement CPU . . .


All times are GMT -5. The time now is 04:42 PM.