System completely freezes CPU Stuck / Hardware Error

business_kid · 03-20-2021, 11:21 AM

Have you tried slowing the cpu frequency? Reseating the ram?

What the logs show us is errors related to one particular bank of ram, and some (probably kernel) process going out to lunch. You need to approach this logically, eliminate what you can. Have you ram in Bank 5? If not, ignore it. If so, swap it.

This sort of thing is solved by "divide & conquer" techniques.

It's a while since I had hardware issues but there's a pile of things you can do with special debugging kernel keys. There's also a good choice of boot options to nobble various functions. stop running the processes you always run, and see if any of them ease your troubles. Add them back, and test again.

EDIT: We did clarify that the kernel does not directly support your wifi, didn't we? Did you use the Realtek source code to compile your wifi driver? And remember Gnu/linux ≠ Darwin 64. Which is the driver for?

Also, somebody else probably had this fault, etc. Read his experience, if it's out there.

jsbjsb001 · 03-20-2021, 11:44 AM

Your posted kernel log output mentions the amdgpu driver having problems, and from my research and own experiences, it's bugs in the amdgpu kernel driver. This would explain the freezing you're having, although I can't say if amdgpu is the only problem with your system though.

In relation to your particular error messages from amdgpu, you might see if the following helps from this Gentoo forum thread quoted below. You might also want to check out this Gentoo forum thread too.

Quote:

Since I installed a Radeon RX570 graphics card and got the AMDGPU driver working, I've had annoying 10 sec delays, when booting and shutdown, and when switching sessions between tty7 and tty 8. Dmesg shows the driver trying to send a series of amdgpu powerplay commands and failing, along the following lines:

Code:

[13842.569209] amdgpu: [powerplay]
                failed to send message 171 ret is 0

I found the hangs and messages disappear set the amdgpu module parameter dpm=0, either with a /etc/modprobe.d entry along the lines:

Code:

options amdgpu dpm=0

or the command line parameter

Code:

amdgpu.dpm=0

I don't know if it's disabling something important or not. Xorg.0.log shows

Code:

[    22.232] (II) AMDGPU(0): DPMS capabilities: Off
[    22.535] (==) AMDGPU(0): DPMS enabled
[    22.547] (II) Initializing extension DPMS

but I'm not sure if DPMS (Energy Star power saving) is the same thing as dpm.

I found an intriguing reference that said on old cards and kernels dpm=1 enabled the new dpm; then when AMD power play came out, they swapped its definition and dpm=0 would select power play and dpm=1 would still select the old power management, which might explain the problem.

dosensuppe · 03-31-2021, 09:26 AM

Quote:

Originally Posted by jsbjsb001

Your posted kernel log output mentions the amdgpu driver having problems, and from my research and own experiences, it's bugs in the amdgpu kernel driver. This would explain the freezing you're having, although I can't say if amdgpu is the only problem with your system though.

In relation to your particular error messages from amdgpu, you might see if the following helps from this Gentoo forum thread quoted below. You might also want to check out this Gentoo forum thread too.

thanks. I now disabled dmp in /etc/modprobe.d/amdgpu.conf with

Code:

options amdgpu dpm=0

Let's see if this fixes something.
But unfortunately I think it's an independent problem from the total reboot freezes I get.

dosensuppe · 04-06-2021, 03:06 PM

Quote:

Originally Posted by dosensuppe

thanks. I now disabled dmp in /etc/modprobe.d/amdgpu.conf with

Code:

options amdgpu dpm=0

Let's see if this fixes something.
But unfortunately I think it's an independent problem from the total reboot freezes I get.

Had to delete the paramater again.
It causes terrible performance slowdowns in video games.

business_kid · 04-07-2021, 03:24 AM

At 33 posts on this thread, it's not something simple.

Time surely to brae yourself, download and build the latest stable kernel, & repeat tests. If the problem still exists, and prepare for an exchange with a moody developer,

I had one bug where I had to take on a hardware manufacturer, followed by a kernel dev. It turned out the hardware was at fault throwing spurious warnings, so the kernel dev altered his code to ignore it. He had been looking for this bug for years.

But things get sorted, because yours is the sort of feedback they need to make things better. Just develop a thick skin and you'll be fine.

dosensuppe · 04-14-2021, 10:12 AM

Quote:

Originally Posted by business_kid

At 33 posts on this thread, it's not something simple.

Time surely to brae yourself, download and build the latest stable kernel, & repeat tests. If the problem still exists, and prepare for an exchange with a moody developer,

I had one bug where I had to take on a hardware manufacturer, followed by a kernel dev. It turned out the hardware was at fault throwing spurious warnings, so the kernel dev altered his code to ignore it. He had been looking for this bug for years.

But things get sorted, because yours is the sort of feedback they need to make things better. Just develop a thick skin and you'll be fine.

Thank you. I guess sometimes you just get unlucky with specific hardware combinations. At this point I don't even know if I have any guarantee left for the mainboard, which I'm starting to believe is the culprit here.

dosensuppe · 04-14-2021, 12:07 PM

you think I can report this to a kernel dev directly?

business_kid · 04-15-2021, 06:19 AM

I have done so more than once over the years, and got results too. The deal is, you've got the problem hardware, you're seeing the bug, and they want to fix the bug.

Download the latest stable kernel, build it using your current config. If it pukes, file the bug against that kernel, & previous ones. Don't get shirty if whatever dev is dealing with you seems rude - they don't hold your hand. After admitting I am no programmer, I corrected a very basic C syntax error on one patch and that caused a storm. But I had built a kernel with his faulty patch tested, & reported results; I repeated for his patch with my fix, which worked. He didn't like it one bit, but my syntax fix went in.
IIRC if was "if <condition>" --> "elif <condition>"