Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
to see if I could get a bit more information, it gave these results
Code:
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 1 BANK 0
TIME 1425459338 Wed Mar 4 08:55:38 2015
MCG status:
MCi status:
Error enabled
MCA: Unknown Error 5
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
I've run a memory test and all appears OK
I run e2fsck on my root partition and that appears OK.
Any suggestions on what I should investigate next? Looking back through /var/log/messages the problems been about since at least 24th December 2014.
to see if I could get a bit more information, it gave these results
Code:
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 1 BANK 0
TIME 1425459338 Wed Mar 4 08:55:38 2015
MCG status:
MCi status:
Error enabled
MCA: Unknown Error 5
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
I've run a memory test and all appears OK I run e2fsck on my root partition and that appears OK. Any suggestions on what I should investigate next? Looking back through /var/log/messages the problems been about since at least 24th December 2014.
If you look at what mcelog returned, you'd see the "MCA: Unknown Error 5" message, and a CPUID. Based on those things, it appears to be a CPU parity issue, which could very well be transient, with any one of a number of causes. Have you overclocked or tinkered with the BIOS settings on that system?
I've had 102 occurrences in the last 3 weeks and 290 since 24th December 2014. My machine is in use almost all day every day. These are the number of occurrences for February and March
Some days it doesn't occur at all, on other days the number of times varies. So not sure if this is transient. I've not tinkered with the bios in respect of the cpu or memory - or at least I can't recall doing so. I did tinker with the bios to change uefi to legacy and might have changed some usb settings.
The machines about 11 months old. Built by myself!
I've got some backups so I will have a look to see whether the problem existed before December 2014.
If it does I shall try re-seating memory and check the fans.
The core temperatures as I write this are 35/35/34/30.
I've gone through my backups and the earliest message.log I have is for August 2014.
The error message is in that log. So I suspect that I've always had the error.
The next time I open the box I shall remove what I consider the most appropriate memory stick and see whether the problem disappears. If it doesn't I shall move on from there and try with the other stick.
If it is a memory problem it would appear it's BANK 0 as below is an extract from the mcelog
Code:
CPU 0 BANK 0
CPU 1 BANK 0
CPU 0 BANK 0
CPU 1 BANK 0
CPU 2 BANK 0
CPU 2 BANK 0
CPU 1 BANK 0
CPU 0 BANK 0
I've gone through my backups and the earliest message.log I have is for August 2014.
The error message is in that log. So I suspect that I've always had the error.
The next time I open the box I shall remove what I consider the most appropriate memory stick and see whether the problem disappears. If it doesn't I shall move on from there and try with the other stick.
If it is a memory problem it would appear it's BANK 0 as below is an extract from the mcelog
Code:
CPU 0 BANK 0
CPU 1 BANK 0
CPU 0 BANK 0
CPU 1 BANK 0
CPU 2 BANK 0
CPU 2 BANK 0
CPU 1 BANK 0
CPU 0 BANK 0
Alex
Asus has a 2 or 3 yr warrantee if you're having problems with it you can RMA it. That will atleast rule out motherboard problem.
Distribution: Cinnamon Mint 20.1 (Laptop) and 20.2 (Desktop)
Posts: 1,672
Rep:
Quote:
The next time I open the box I shall remove what I consider the most appropriate memory stick and see whether the problem disappears.
From this I'm going to assume you have 2 x 4Gb DIMMS as you don't say what your 8Gb is made up of.
Have you tried swapping them round? The error, if it's memory, should then move to slot 1
If you DO have 2 x 4Gb DIMMs, are they identical and from the same manufacturer? If they're not a matched pair, you could get problems.
Did a quick run at GIMPS in torture mode. I could hear some fans speed up significantly and the temperature shot up to what I thought was an unsatisfactory level. So I cancelled the test.
In the messages log I saw this for the duration of the test -
Code:
Mar 8 07:04:42 office kernel: [ 1451.992598] CPU4: Core temperature/speed normal
Mar 8 07:04:42 office kernel: [ 1451.992599] CPU2: Package temperature/speed normal
Mar 8 07:04:42 office kernel: [ 1451.992600] CPU3: Package temperature/speed normal
Mar 8 07:04:42 office kernel: [ 1451.992600] CPU0: Core temperature/speed normal
Mar 8 07:04:42 office kernel: [ 1451.992601] CPU6: Package temperature/speed normal
Mar 8 07:04:42 office kernel: [ 1451.992602] CPU7: Package temperature/speed normal
Mar 8 07:04:42 office kernel: [ 1451.992603] CPU5: Package temperature/speed normal
Mar 8 07:04:42 office kernel: [ 1451.992603] CPU1: Package temperature/speed normal
Mar 8 07:04:42 office kernel: [ 1451.992604] CPU0: Package temperature/speed normal
Mar 8 07:04:42 office kernel: [ 1451.992615] CPU4: Package temperature/speed normal
Mar 8 07:05:31 office kernel: [ 1501.326203] mce: [Hardware Error]: Machine check events logged
I'm going to have a look at the cooling for the box currently temperature is 31/31/31/27 with no fan noise.
I had something a bit like that. In my case, it would say
hda: not ready
hdb: not ready
and just sit there. I could not even shut the thing down. It most probably has to do with logic levels being borderline. Having gone through things with an oscilloscope I am used to seeing it in other paople's equipment. It's usually lows not low enough, although sometimes it's highs not high enough.
Unload the data bus, (fewer dimms). One of them could be sinking current. And clean the cpu heatsink with a paintbrush or something that will get in between the fins.
I swapped out one of the memory sticks and still had errors, put it back in and took the other one out, still get the errors.
I then realized that I was using an old version of mcelog, the latest version results in this -
Code:
Hardware event. This is not a software error.
MCE 0
CPU 2 BANK 0
TIME 1428909602 Mon Apr 13 08:20:02 2015
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 4 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
As it's a "Corrected error" and my machine is not exhibiting any problems, other than these messages, I'm going to ignore them until something more significant happens.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.