System completely freezes CPU Stuck / Hardware Error
Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Have you tried slowing the cpu frequency? Reseating the ram?
What the logs show us is errors related to one particular bank of ram, and some (probably kernel) process going out to lunch. You need to approach this logically, eliminate what you can. Have you ram in Bank 5? If not, ignore it. If so, swap it.
This sort of thing is solved by "divide & conquer" techniques.
It's a while since I had hardware issues but there's a pile of things you can do with special debugging kernel keys. There's also a good choice of boot options to nobble various functions. stop running the processes you always run, and see if any of them ease your troubles. Add them back, and test again.
EDIT: We did clarify that the kernel does not directly support your wifi, didn't we? Did you use the Realtek source code to compile your wifi driver? And remember Gnu/linux ≠ Darwin 64. Which is the driver for?
Also, somebody else probably had this fault, etc. Read his experience, if it's out there.
Last edited by business_kid; 03-20-2021 at 11:29 AM.
Distribution: Currently: OpenMandriva. Previously: openSUSE, PCLinuxOS, CentOS, among others over the years.
Posts: 3,881
Rep:
Your posted kernel log output mentions the amdgpu driver having problems, and from my research and own experiences, it's bugs in the amdgpu kernel driver. This would explain the freezing you're having, although I can't say if amdgpu is the only problem with your system though.
In relation to your particular error messages from amdgpu, you might see if the following helps from this Gentoo forum thread quoted below. You might also want to check out this Gentoo forum thread too.
Quote:
Since I installed a Radeon RX570 graphics card and got the AMDGPU driver working, I've had annoying 10 sec delays, when booting and shutdown, and when switching sessions between tty7 and tty 8. Dmesg shows the driver trying to send a series of amdgpu powerplay commands and failing, along the following lines:
Code:
[13842.569209] amdgpu: [powerplay]
failed to send message 171 ret is 0
I found the hangs and messages disappear set the amdgpu module parameter dpm=0, either with a /etc/modprobe.d entry along the lines:
Code:
options amdgpu dpm=0
or the command line parameter
Code:
amdgpu.dpm=0
I don't know if it's disabling something important or not. Xorg.0.log shows
Code:
[ 22.232] (II) AMDGPU(0): DPMS capabilities: Off
[ 22.535] (==) AMDGPU(0): DPMS enabled
[ 22.547] (II) Initializing extension DPMS
but I'm not sure if DPMS (Energy Star power saving) is the same thing as dpm.
I found an intriguing reference that said on old cards and kernels dpm=1 enabled the new dpm; then when AMD power play came out, they swapped its definition and dpm=0 would select power play and dpm=1 would still select the old power management, which might explain the problem.
Last edited by jsbjsb001; 03-20-2021 at 11:45 AM.
Reason: grammer fix
Your posted kernel log output mentions the amdgpu driver having problems, and from my research and own experiences, it's bugs in the amdgpu kernel driver. This would explain the freezing you're having, although I can't say if amdgpu is the only problem with your system though.
In relation to your particular error messages from amdgpu, you might see if the following helps from this Gentoo forum thread quoted below. You might also want to check out this Gentoo forum thread too.
thanks. I now disabled dmp in /etc/modprobe.d/amdgpu.conf with
Code:
options amdgpu dpm=0
Let's see if this fixes something.
But unfortunately I think it's an independent problem from the total reboot freezes I get.
At 33 posts on this thread, it's not something simple.
Time surely to brae yourself, download and build the latest stable kernel, & repeat tests. If the problem still exists, and prepare for an exchange with a moody developer,
I had one bug where I had to take on a hardware manufacturer, followed by a kernel dev. It turned out the hardware was at fault throwing spurious warnings, so the kernel dev altered his code to ignore it. He had been looking for this bug for years.
But things get sorted, because yours is the sort of feedback they need to make things better. Just develop a thick skin and you'll be fine.
At 33 posts on this thread, it's not something simple.
Time surely to brae yourself, download and build the latest stable kernel, & repeat tests. If the problem still exists, and prepare for an exchange with a moody developer,
I had one bug where I had to take on a hardware manufacturer, followed by a kernel dev. It turned out the hardware was at fault throwing spurious warnings, so the kernel dev altered his code to ignore it. He had been looking for this bug for years.
But things get sorted, because yours is the sort of feedback they need to make things better. Just develop a thick skin and you'll be fine.
Thank you. I guess sometimes you just get unlucky with specific hardware combinations. At this point I don't even know if I have any guarantee left for the mainboard, which I'm starting to believe is the culprit here.
I have done so more than once over the years, and got results too. The deal is, you've got the problem hardware, you're seeing the bug, and they want to fix the bug.
Download the latest stable kernel, build it using your current config. If it pukes, file the bug against that kernel, & previous ones. Don't get shirty if whatever dev is dealing with you seems rude - they don't hold your hand. After admitting I am no programmer, I corrected a very basic C syntax error on one patch and that caused a storm. But I had built a kernel with his faulty patch tested, & reported results; I repeated for his patch with my fix, which worked. He didn't like it one bit, but my syntax fix went in.
IIRC if was "if <condition>" --> "elif <condition>"
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.