HARDWARE ERROR. This is *NOT* a software problem!

aikempshall · 03-04-2015, 07:14 AM

Had a feeling that somethings not quite with my machine in respect of KMail see http://www.linuxquestions.org/questi...er-4175535602/

Anyway, this morning Kmail seemed to lock up completely. So had a look in /var/log/messages and found this

Code:

Mar  4 12:52:06 office kernel: [ 3706.202568] mce: [Hardware Error]: Machine check events logged

I ran

Code:

 /usr/sbin/mcelog > mcelog.out

to see if I could get a bit more information, it gave these results

Code:

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 1 BANK 0 
TIME 1425459338 Wed Mar  4 08:55:38 2015
MCG status:
MCi status:
Error enabled
MCA: Unknown Error 5
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 60

I've run a memory test and all appears OK

I run e2fsck on my root partition and that appears OK.

Any suggestions on what I should investigate next? Looking back through /var/log/messages the problems been about since at least 24th December 2014.

Alex

TB0ne · 03-04-2015, 08:51 AM

Quote:

Originally Posted by aikempshall

Had a feeling that somethings not quite with my machine in respect of KMail see http://www.linuxquestions.org/questi...er-4175535602/

Anyway, this morning Kmail seemed to lock up completely. So had a look in /var/log/messages and found this

Code:

Mar  4 12:52:06 office kernel: [ 3706.202568] mce: [Hardware Error]: Machine check events logged

I ran

Code:

 /usr/sbin/mcelog > mcelog.out

to see if I could get a bit more information, it gave these results

Code:

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 1 BANK 0 
TIME 1425459338 Wed Mar  4 08:55:38 2015
MCG status:
MCi status:
Error enabled
MCA: Unknown Error 5
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 60

I've run a memory test and all appears OK I run e2fsck on my root partition and that appears OK. Any suggestions on what I should investigate next? Looking back through /var/log/messages the problems been about since at least 24th December 2014.

If you look at what mcelog returned, you'd see the "MCA: Unknown Error 5" message, and a CPUID. Based on those things, it appears to be a CPU parity issue, which could very well be transient, with any one of a number of causes. Have you overclocked or tinkered with the BIOS settings on that system?

smallpond · 03-04-2015, 09:22 AM

If you have multiple occurrences in the log, try replacing the DIMM. Memory tests can't try all possible patterns.

aikempshall · 03-04-2015, 11:29 AM

Hi TB0ne

I've had 102 occurrences in the last 3 weeks and 290 since 24th December 2014. My machine is in use almost all day every day. These are the number of occurrences for February and March

Code:

 
  Count     Date
  -----     ----   
     21 20150201 
      1 20150202 
      5 20150203 
      9 20150204 
      5 20150205 
      7 20150206 
      9 20150207 
      6 20150208 
      3 20150209 
      3 20150211 
      1 20150212 
      5 20150213 
      3 20150215 
      4 20150216 
      1 20150217 
     10 20150218 
      5 20150219 
      3 20150220 
      1 20150221 
      2 20150223 
      5 20150224 
      2 20150225 
      2 20150226 
      7 20150227 
     15 20150228 
     27 20150301 
      1 20150303 
      9 20150304

Some days it doesn't occur at all, on other days the number of times varies. So not sure if this is transient. I've not tinkered with the bios in respect of the cpu or memory - or at least I can't recall doing so. I did tinker with the bios to change uefi to legacy and might have changed some usb settings.

The machines about 11 months old. Built by myself!

I've got some backups so I will have a look to see whether the problem existed before December 2014.

If it does I shall try re-seating memory and check the fans.

The core temperatures as I write this are 35/35/34/30.

Alex

aikempshall · 03-07-2015, 05:25 AM

I've gone through my backups and the earliest message.log I have is for August 2014.

The error message is in that log. So I suspect that I've always had the error.

The next time I open the box I shall remove what I consider the most appropriate memory stick and see whether the problem disappears. If it doesn't I shall move on from there and try with the other stick.

If it is a memory problem it would appear it's BANK 0 as below is an extract from the mcelog

Code:

CPU 0 BANK 0 
CPU 1 BANK 0 
CPU 0 BANK 0 
CPU 1 BANK 0 
CPU 2 BANK 0 
CPU 2 BANK 0 
CPU 1 BANK 0 
CPU 0 BANK 0

Alex

metaschima · 03-07-2015, 10:52 AM

Try running memtest for 3 runs. If they all pass, try running GIMPS in mode 1 to test the CPU:
http://www.mersenne.org/download/#source

EDDY1 · 03-08-2015, 02:37 PM

Quote:

Originally Posted by aikempshall

I've gone through my backups and the earliest message.log I have is for August 2014.

The error message is in that log. So I suspect that I've always had the error.

The next time I open the box I shall remove what I consider the most appropriate memory stick and see whether the problem disappears. If it doesn't I shall move on from there and try with the other stick.

If it is a memory problem it would appear it's BANK 0 as below is an extract from the mcelog

Code:

CPU 0 BANK 0 
CPU 1 BANK 0 
CPU 0 BANK 0 
CPU 1 BANK 0 
CPU 2 BANK 0 
CPU 2 BANK 0 
CPU 1 BANK 0 
CPU 0 BANK 0

Alex

Asus has a 2 or 3 yr warrantee if you're having problems with it you can RMA it. That will atleast rule out motherboard problem.

Soadyheid · 03-09-2015, 10:26 AM

Quote:

The next time I open the box I shall remove what I consider the most appropriate memory stick and see whether the problem disappears.

From this I'm going to assume you have 2 x 4Gb DIMMS as you don't say what your 8Gb is made up of.
Have you tried swapping them round? The error, if it's memory, should then move to slot 1
If you DO have 2 x 4Gb DIMMs, are they identical and from the same manufacturer? If they're not a matched pair, you could get problems.

Play Bonny!

aikempshall · 03-09-2015, 02:33 PM

Hi Soadyheid

Yes I have 2 x 4Gb DIMMS, both from the same manufacturer and same model. I will take your advice in swapping them around.

Alex

aikempshall · 03-09-2015, 03:54 PM

Hi metaschima

Did a quick run at GIMPS in torture mode. I could hear some fans speed up significantly and the temperature shot up to what I thought was an unsatisfactory level. So I cancelled the test.

In the messages log I saw this for the duration of the test -

Code:

Mar  8 07:04:42 office kernel: [ 1451.992598] CPU4: Core temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992599] CPU2: Package temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992600] CPU3: Package temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992600] CPU0: Core temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992601] CPU6: Package temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992602] CPU7: Package temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992603] CPU5: Package temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992603] CPU1: Package temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992604] CPU0: Package temperature/speed normal
Mar  8 07:04:42 office kernel: [ 1451.992615] CPU4: Package temperature/speed normal
Mar  8 07:05:31 office kernel: [ 1501.326203] mce: [Hardware Error]: Machine check events logged

I'm going to have a look at the cooling for the box currently temperature is 31/31/31/27 with no fan noise.

Alex

metaschima · 03-09-2015, 05:06 PM

Could be a cooling problem or maybe a CPU issue. If you cannot fix it, RMA it.

EDDY1 · 03-09-2015, 11:42 PM

Have you taken the side case off to s3 if all fans are running?

business_kid · 03-13-2015, 03:25 PM

I had something a bit like that. In my case, it would say

hda: not ready
hdb: not ready

and just sit there. I could not even shut the thing down. It most probably has to do with logic levels being borderline. Having gone through things with an oscilloscope I am used to seeing it in other paople's equipment. It's usually lows not low enough, although sometimes it's highs not high enough.

Unload the data bus, (fewer dimms). One of them could be sinking current. And clean the cpu heatsink with a paintbrush or something that will get in between the fins.

aikempshall · 04-13-2015, 03:37 AM

I swapped out one of the memory sticks and still had errors, put it back in and took the other one out, still get the errors.

I then realized that I was using an old version of mcelog, the latest version results in this -

Code:

Hardware event. This is not a software error.
MCE 0
CPU 2 BANK 0 
TIME 1428909602 Mon Apr 13 08:20:02 2015
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 4 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 60

As it's a "Corrected error" and my machine is not exhibiting any problems, other than these messages, I'm going to ignore them until something more significant happens.

Alex

Soadyheid · 04-13-2015, 08:45 AM

As I mentioned before, swap the two DIMMs round, if there is some sort of intermittent memory problem, it should then exhibit as affecting BANK 1.

Meanwhile.. carry on as usual!

Play Bonny!