LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 03-04-2015, 07:14 AM   #1
aikempshall
Member
 
Registered: Nov 2003
Location: Bristol, Britain
Distribution: Slackware
Posts: 900

Rep: Reputation: 153Reputation: 153
HARDWARE ERROR. This is *NOT* a software problem!


Had a feeling that somethings not quite with my machine in respect of KMail see http://www.linuxquestions.org/questi...er-4175535602/

Anyway, this morning Kmail seemed to lock up completely. So had a look in /var/log/messages and found this

Code:
Mar  4 12:52:06 office kernel: [ 3706.202568] mce: [Hardware Error]: Machine check events logged
I ran

Code:
 /usr/sbin/mcelog > mcelog.out
to see if I could get a bit more information, it gave these results


Code:
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 1 BANK 0 
TIME 1425459338 Wed Mar  4 08:55:38 2015
MCG status:
MCi status:
Error enabled
MCA: Unknown Error 5
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 60
I've run a memory test and all appears OK

I run e2fsck on my root partition and that appears OK.

Any suggestions on what I should investigate next? Looking back through /var/log/messages the problems been about since at least 24th December 2014.

Alex
 
Old 03-04-2015, 08:51 AM   #2
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,632

Rep: Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965
Quote:
Originally Posted by aikempshall View Post
Had a feeling that somethings not quite with my machine in respect of KMail see http://www.linuxquestions.org/questi...er-4175535602/

Anyway, this morning Kmail seemed to lock up completely. So had a look in /var/log/messages and found this
Code:
Mar  4 12:52:06 office kernel: [ 3706.202568] mce: [Hardware Error]: Machine check events logged
I ran
Code:
 /usr/sbin/mcelog > mcelog.out
to see if I could get a bit more information, it gave these results
Code:
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 1 BANK 0 
TIME 1425459338 Wed Mar  4 08:55:38 2015
MCG status:
MCi status:
Error enabled
MCA: Unknown Error 5
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 60
I've run a memory test and all appears OK I run e2fsck on my root partition and that appears OK. Any suggestions on what I should investigate next? Looking back through /var/log/messages the problems been about since at least 24th December 2014.
If you look at what mcelog returned, you'd see the "MCA: Unknown Error 5" message, and a CPUID. Based on those things, it appears to be a CPU parity issue, which could very well be transient, with any one of a number of causes. Have you overclocked or tinkered with the BIOS settings on that system?
 
Old 03-04-2015, 09:22 AM   #3
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: Fedora
Posts: 4,140

Rep: Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263
If you have multiple occurrences in the log, try replacing the DIMM. Memory tests can't try all possible patterns.
 
Old 03-04-2015, 11:29 AM   #4
aikempshall
Member
 
Registered: Nov 2003
Location: Bristol, Britain
Distribution: Slackware
Posts: 900

Original Poster
Rep: Reputation: 153Reputation: 153
Hi TB0ne

I've had 102 occurrences in the last 3 weeks and 290 since 24th December 2014. My machine is in use almost all day every day. These are the number of occurrences for February and March

Code:
 
  Count     Date
  -----     ----   
     21 20150201 
      1 20150202 
      5 20150203 
      9 20150204 
      5 20150205 
      7 20150206 
      9 20150207 
      6 20150208 
      3 20150209 
      3 20150211 
      1 20150212 
      5 20150213 
      3 20150215 
      4 20150216 
      1 20150217 
     10 20150218 
      5 20150219 
      3 20150220 
      1 20150221 
      2 20150223 
      5 20150224 
      2 20150225 
      2 20150226 
      7 20150227 
     15 20150228 
     27 20150301 
      1 20150303 
      9 20150304
Some days it doesn't occur at all, on other days the number of times varies. So not sure if this is transient. I've not tinkered with the bios in respect of the cpu or memory - or at least I can't recall doing so. I did tinker with the bios to change uefi to legacy and might have changed some usb settings.

The machines about 11 months old. Built by myself!

I've got some backups so I will have a look to see whether the problem existed before December 2014.

If it does I shall try re-seating memory and check the fans.

The core temperatures as I write this are 35/35/34/30.

Alex
 
Old 03-07-2015, 05:25 AM   #5
aikempshall
Member
 
Registered: Nov 2003
Location: Bristol, Britain
Distribution: Slackware
Posts: 900

Original Poster
Rep: Reputation: 153Reputation: 153
I've gone through my backups and the earliest message.log I have is for August 2014.

The error message is in that log. So I suspect that I've always had the error.

The next time I open the box I shall remove what I consider the most appropriate memory stick and see whether the problem disappears. If it doesn't I shall move on from there and try with the other stick.

If it is a memory problem it would appear it's BANK 0 as below is an extract from the mcelog

Code:
CPU 0 BANK 0 
CPU 1 BANK 0 
CPU 0 BANK 0 
CPU 1 BANK 0 
CPU 2 BANK 0 
CPU 2 BANK 0 
CPU 1 BANK 0 
CPU 0 BANK 0
Alex
 
Old 03-07-2015, 10:52 AM   #6
metaschima
Senior Member
 
Registered: Dec 2013
Distribution: Slackware
Posts: 1,982

Rep: Reputation: 492Reputation: 492Reputation: 492Reputation: 492Reputation: 492
Try running memtest for 3 runs. If they all pass, try running GIMPS in mode 1 to test the CPU:
http://www.mersenne.org/download/#source
 
Old 03-08-2015, 02:37 PM   #7
EDDY1
LQ Addict
 
Registered: Mar 2010
Location: Oakland,Ca
Distribution: wins7, Debian wheezy
Posts: 6,841

Rep: Reputation: 649Reputation: 649Reputation: 649Reputation: 649Reputation: 649Reputation: 649
Quote:
Originally Posted by aikempshall View Post
I've gone through my backups and the earliest message.log I have is for August 2014.

The error message is in that log. So I suspect that I've always had the error.

The next time I open the box I shall remove what I consider the most appropriate memory stick and see whether the problem disappears. If it doesn't I shall move on from there and try with the other stick.

If it is a memory problem it would appear it's BANK 0 as below is an extract from the mcelog

Code:
CPU 0 BANK 0 
CPU 1 BANK 0 
CPU 0 BANK 0 
CPU 1 BANK 0 
CPU 2 BANK 0 
CPU 2 BANK 0 
CPU 1 BANK 0 
CPU 0 BANK 0
Alex
Asus has a 2 or 3 yr warrantee if you're having problems with it you can RMA it. That will atleast rule out motherboard problem.
 
1 members found this post helpful.
Old 03-09-2015, 10:26 AM   #8
Soadyheid
Senior Member
 
Registered: Aug 2010
Location: Near Edinburgh, Scotland
Distribution: Cinnamon Mint 20.1 (Laptop) and 20.2 (Desktop)
Posts: 1,672

Rep: Reputation: 486Reputation: 486Reputation: 486Reputation: 486Reputation: 486
Quote:
The next time I open the box I shall remove what I consider the most appropriate memory stick and see whether the problem disappears.
From this I'm going to assume you have 2 x 4Gb DIMMS as you don't say what your 8Gb is made up of.
Have you tried swapping them round? The error, if it's memory, should then move to slot 1
If you DO have 2 x 4Gb DIMMs, are they identical and from the same manufacturer? If they're not a matched pair, you could get problems.

Play Bonny!

 
Old 03-09-2015, 02:33 PM   #9
aikempshall
Member
 
Registered: Nov 2003
Location: Bristol, Britain
Distribution: Slackware
Posts: 900

Original Poster
Rep: Reputation: 153Reputation: 153
Hi Soadyheid

Yes I have 2 x 4Gb DIMMS, both from the same manufacturer and same model. I will take your advice in swapping them around.

Alex
 
Old 03-09-2015, 03:54 PM   #10
aikempshall
Member
 
Registered: Nov 2003
Location: Bristol, Britain
Distribution: Slackware
Posts: 900

Original Poster
Rep: Reputation: 153Reputation: 153
Hi metaschima

Did a quick run at GIMPS in torture mode. I could hear some fans speed up significantly and the temperature shot up to what I thought was an unsatisfactory level. So I cancelled the test.

In the messages log I saw this for the duration of the test -

Code:
Mar  8 07:04:42 office kernel: [ 1451.992598] CPU4: Core temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992599] CPU2: Package temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992600] CPU3: Package temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992600] CPU0: Core temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992601] CPU6: Package temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992602] CPU7: Package temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992603] CPU5: Package temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992603] CPU1: Package temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992604] CPU0: Package temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992615] CPU4: Package temperature/speed normal
Mar  8 07:05:31 office kernel: [ 1501.326203] mce: [Hardware Error]: Machine check events logged
I'm going to have a look at the cooling for the box currently temperature is 31/31/31/27 with no fan noise.

Alex
 
Old 03-09-2015, 05:06 PM   #11
metaschima
Senior Member
 
Registered: Dec 2013
Distribution: Slackware
Posts: 1,982

Rep: Reputation: 492Reputation: 492Reputation: 492Reputation: 492Reputation: 492
Could be a cooling problem or maybe a CPU issue. If you cannot fix it, RMA it.
 
Old 03-09-2015, 11:42 PM   #12
EDDY1
LQ Addict
 
Registered: Mar 2010
Location: Oakland,Ca
Distribution: wins7, Debian wheezy
Posts: 6,841

Rep: Reputation: 649Reputation: 649Reputation: 649Reputation: 649Reputation: 649Reputation: 649
Have you taken the side case off to s3 if all fans are running?
 
Old 03-13-2015, 03:25 PM   #13
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,278

Rep: Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322
I had something a bit like that. In my case, it would say

hda: not ready
hdb: not ready

and just sit there. I could not even shut the thing down. It most probably has to do with logic levels being borderline. Having gone through things with an oscilloscope I am used to seeing it in other paople's equipment. It's usually lows not low enough, although sometimes it's highs not high enough.

Unload the data bus, (fewer dimms). One of them could be sinking current. And clean the cpu heatsink with a paintbrush or something that will get in between the fins.
 
Old 04-13-2015, 03:37 AM   #14
aikempshall
Member
 
Registered: Nov 2003
Location: Bristol, Britain
Distribution: Slackware
Posts: 900

Original Poster
Rep: Reputation: 153Reputation: 153
I swapped out one of the memory sticks and still had errors, put it back in and took the other one out, still get the errors.

I then realized that I was using an old version of mcelog, the latest version results in this -

Code:
Hardware event. This is not a software error.
MCE 0
CPU 2 BANK 0 
TIME 1428909602 Mon Apr 13 08:20:02 2015
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 4 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 60

As it's a "Corrected error" and my machine is not exhibiting any problems, other than these messages, I'm going to ignore them until something more significant happens.

Alex
 
Old 04-13-2015, 08:45 AM   #15
Soadyheid
Senior Member
 
Registered: Aug 2010
Location: Near Edinburgh, Scotland
Distribution: Cinnamon Mint 20.1 (Laptop) and 20.2 (Desktop)
Posts: 1,672

Rep: Reputation: 486Reputation: 486Reputation: 486Reputation: 486Reputation: 486
As I mentioned before, swap the two DIMMs round, if there is some sort of intermittent memory problem, it should then exhibit as affecting BANK 1.

Meanwhile.. carry on as usual!

Play Bonny!

 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Hardware or software problem? psionl0 Slackware 14 07-08-2011 03:18 AM
mcelog: HARDWARE ERROR. This is *NOT* a software problem! olego Linux - Hardware 11 01-17-2011 03:12 AM
Hardware or software problem ? czezz Linux - Networking 3 04-18-2008 11:10 AM
Internet and LAN access is unstable-is this a hardware problem or a software problem? soren625 Linux - Networking 1 06-07-2004 06:43 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 06:32 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration