Really glad this community exists, because I'm really having trouble with one of my systems, and I can't understand what's going on.
Initially I had two systems. Both worked great. Never had even one issue with either of them. Then I decided that I wanted to switch the motherboards in each system with the other motherboard. So I switched them and while one system worked absolutely fine the other started having problems.. The two motherboards I switched were both ASUS boards; One was a ROG Strix B450-F and the other was a TUF x570 Plus WIFI. The TUF x570 was the one that started having problems.
I have two drives on this system, one is a NVME that has Debian 11 and the other is a SATA SSD that has the newest Kali image. The SATA SSD is the one that has the greatest issues although I have noticed the problem in the NVME as well.
# PROBLEMS:
1. Intermittent stuttering
2. sudden freezing that follows the stuttering and leads to reboot.
After these reboots I cannot get the system back to normal until I shut down the computer and turn it on with the case's power button
This incident might occur instantly at the login screen, 1 min after logging in or 20 minutes after logging in but it always happens when I try to boot to the SATA SSD, and though the NVME is more stable it has occurred with it as well.
After one of these incidents where the system stutters, freezes and then reboots, I see the following error message (or a variation of the same message) on the debian 11 screen where it prompts me to enter my full-disk-encryption password..
Code:
[ 1.249302] mce: [Hardware Error]: Machine check events logged
[ 1.249303] mce: [Hardware Error]: CPU 5: Machine Check: 0 Bank 5: bea0000000000108
[ 1.249381] mce: [Hardware Error]: TSC 0 ADDR 1ffffc1657028 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[ 1.249462] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1651494778 SOCKET 0 APIC c microcode 8701021
I'm not sure what to make of this and I've tried to install and run mcelog but to no avail... It says that my processor isn't supported by the program.
This incident might occur instantly at the login screen, 1 min after logging in or 20 minutes after logging in but it always happens when I try to boot to the SATA SSD, and though the NVME is more stable it has occurred with it as well.
# The troubleshooting steps that I have taken
1. I sent the TUF x570 back to ASUS in an RMA and they replaced the board stating that there was some sort of hardware error on it so I'm working with a brand new motherboard. But the motherboard immediately started having the same problem as soon as I got it back.
2. I switched out the CPU with another CPU that I have, but it did the same thing again.. and when I put the first CPU into my other system, it ran fine, no problems.
3. I replaced the PSU with a 1000W (upgraded from a 750w), still no improvement
4. I tried different ram, no improvement
5. I tried a different AMD GPU, no improvement
6. I've monitored the system under load and I'm sure that it's not a CPU overheating problem because the temps never get above 60C even under heavy load. ( I have a liquid cooled CPU)
I've noticed that a couple times before the SATA SSD system freezes and reboots that HTOP shows that one or more of my CPUs are maxed out to 100% and colored red.. Not sure what to make of that. I've disabled Global C-state control in my BIOS (which did seem to help), but I'm not sure where to go from here. It does seem to me that the CPU is at least part of the problem but I cant figure it out. I don't have a ton of experience in hardware/software troubleshooting.
The Debian 11 on the NVME is much more stable than the SSD, I'm writing this from the NVME right now but It's very unreliable whether it works or not. Could it have something to do with the BIOS cpu voltages?
What do you guys think?
Is the problem the x570 chipset perhaps?? I'm sort of lost here.