LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Slackware (https://www.linuxquestions.org/questions/slackware-14/)
-   -   [Hardware Error]: System Fatal error (https://www.linuxquestions.org/questions/slackware-14/%5Bhardware-error%5D-system-fatal-error-4175598372/)

epg 01-27-2017 10:02 AM

[Hardware Error]: System Fatal error
 
Hi,

yesterday when my PC was starting up, before it had completed the booting process (i.e. before I got the login prompt), it rebooted autonomously. I not sure how far it went but the second time the startup was completed successfully and it has been working normal since then. When I was checking the syslogs, found these error messages:

Jan 26 08:46:10 epg-hp kernel: [ 3.839827] [Hardware Error]: System Fatal error.
Jan 26 08:46:10 epg-hp kernel: [ 3.839938] [Hardware Error]: CPU:0 (15:60:1) MC4_STATUS[Over|UE|MiscV|PCC|AddrV|-|-]: 0xfe00000000070f0f
Jan 26 08:46:10 epg-hp kernel: [ 3.840214] [Hardware Error]: MC4 Error Address: 0x00000000d0d00e50
Jan 26 08:46:10 epg-hp kernel: [ 3.840314] [Hardware Error]: MC4 Error (node 0): Watchdog timeout due to lack of progress.
Jan 26 08:46:10 epg-hp kernel: [ 3.840510] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out)

Googled a bit and found some people saying this could be RAM errors, however after ~8 hours running memtest didn find any errors.

So... What next? Any ideas of what could have caused this error??

Running Slackware64-14.2, kernel 4.4.38

Thank you!

BW-userx 01-27-2017 11:46 AM

it might be one of them things that only take place when you're not looking.
I'd keep an eye on it and perhaps get a new set of RAM chips just in case it's going down. Or at least stash some just in case money away for it.

this may help give you a little more info on this

How to identify defective DIMM from EDAC error on Linux

epg 01-30-2017 09:38 AM

Thank you BW-userx for your reply and for the link; very useful...

I just ran memtest over the weekend (48+ hours) and still no errors were found. So I guess I can only wait and monitor if it'll come again.

BW-userx 01-30-2017 09:42 AM

Quote:

Originally Posted by epg (Post 5662318)
Thank you BW-userx for your reply and for the link; very useful...

I just ran memtest over the weekend (48+ hours) and still no errors were found. So I guess I can only wait and monitor if it'll come again.

You're welcome, happy to have helped.

Ilgar 02-01-2017 08:40 AM

I'm no expert on the subject, but aren't these errors related to the CPU cache and not the RAM? I agree with BW-userx that it could be a one-time thing.

epg 02-01-2017 02:42 PM

Thks for the feedback... Yeah, you could be right. Anyway, I tried to run mcelog to capture proper logs if this issue happens again, but unfortunately AMD cpus are not supported. :-(

epg 02-08-2017 09:05 AM

It just happened again:

[ 3.839717] [Hardware Error]: System Fatal error.
[ 3.839828] [Hardware Error]: CPU:0 (15:60:1) MC4_STATUS[Over|UE|MiscV|PCC|AddrV|-|-]: 0xfe00000000070f0f
[ 3.840069] [Hardware Error]: MC4 Error Address: 0x00000000d0d00e50
[ 3.840208] [Hardware Error]: MC4 Error (node 0): Watchdog timeout due to lack of progress.
[ 3.840406] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out)

Same error message, same address... Is it fair to say it's a hardware issue? Any suggestions on how to troubleshoot this further??

Thank you

BW-userx 02-08-2017 09:18 AM

Quote:

Originally Posted by epg (Post 5667172)
It just happened again:

Same error message, same address... Is it fair to say it's a hardware issue? Any suggestions on how to troubleshoot this further??

Thank you

go to a store that takes returns, buy some hardware, swap it out, then see if the problem persists if yes, replace with the old, then swap out another piece of hardware with a new one then do the same.

repeat until the problem is no longer there.

take back everything that did not fix the problem and get your money back.

Skaendo 02-08-2017 10:39 AM

Quote:

Originally Posted by epg (Post 5667172)
It just happened again:

Same error message, same address... Is it fair to say it's a hardware issue? Any suggestions on how to troubleshoot this further??

Thank you

Are you overclocking in any way?

epg 02-08-2017 11:22 AM

Not at all, no overclocking...

And the idea of changing HW until the problem disappears, I'm afraid it's not going to work for me. First, this is a company-owned laptop so I can't/shouldn't change the parts myself. And second, it's still under warranty so I'm gonna void it if I open the laptop.

I could just call warranty and see what they're gonna say, but I wanted to be sure this is indeed a HW issue...

Skaendo 02-08-2017 11:50 AM

Quote:

Originally Posted by epg (Post 5667251)
Not at all, no overclocking...

And the idea of changing HW until the problem disappears, I'm afraid it's not going to work for me. First, this is a company-owned laptop so I can't/shouldn't change the parts myself. And second, it's still under warranty so I'm gonna void it if I open the laptop.

I could just call warranty and see what they're gonna say, but I wanted to be sure this is indeed a HW issue...

Are you are using a AMD CPU?

Is input–output memory management unit (IOMMU) available in the BIOS, and is it on?

epg 02-08-2017 12:00 PM

Yes, it's an AMD CPU on an HP 745 G3. I didn't see any iommu option in bios, don't think my pc supports that.

Skaendo 02-08-2017 04:33 PM

Quote:

Originally Posted by epg (Post 5667273)
Yes, it's an AMD CPU on an HP 745 G3. I didn't see any iommu option in bios, don't think my pc supports that.

I am at a loss. I kind of think that it's a software or driver issue.

glorsplitz 02-08-2017 06:23 PM

How long you have company-owned laptop? How long laptop worked before this error started happening? Did you do something with the system since it was installed?

Check out this LINK and the link in the answer, seems to be cpu problem

epg 02-08-2017 06:44 PM

Thank you for replying!

It's a brand new PC, got it just a couple of months ago. First time I noticed this error was around two weeks ago, when I started this thread. Yesterday it happened again... And no, no changes were done since I installed slackware.

And I had seen that link you shared, but unfortunately I couldn't run mcelog, it seems (correct me if I'm wrong) that it doesn't support AMD cpus.


All times are GMT -5. The time now is 09:51 AM.