LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 05-04-2022, 08:02 PM   #1
Arct1c_f0x
Member
 
Registered: Feb 2020
Posts: 124

Rep: Reputation: 24
System stutters, freezes, and reboots randomly even when not under load...


Really glad this community exists, because I'm really having trouble with one of my systems, and I can't understand what's going on.

Initially I had two systems. Both worked great. Never had even one issue with either of them. Then I decided that I wanted to switch the motherboards in each system with the other motherboard. So I switched them and while one system worked absolutely fine the other started having problems.. The two motherboards I switched were both ASUS boards; One was a ROG Strix B450-F and the other was a TUF x570 Plus WIFI. The TUF x570 was the one that started having problems.

I have two drives on this system, one is a NVME that has Debian 11 and the other is a SATA SSD that has the newest Kali image. The SATA SSD is the one that has the greatest issues although I have noticed the problem in the NVME as well.


# PROBLEMS:
1. Intermittent stuttering
2. sudden freezing that follows the stuttering and leads to reboot.

After these reboots I cannot get the system back to normal until I shut down the computer and turn it on with the case's power button

This incident might occur instantly at the login screen, 1 min after logging in or 20 minutes after logging in but it always happens when I try to boot to the SATA SSD, and though the NVME is more stable it has occurred with it as well.

After one of these incidents where the system stutters, freezes and then reboots, I see the following error message (or a variation of the same message) on the debian 11 screen where it prompts me to enter my full-disk-encryption password..

Code:
    [    1.249302] mce: [Hardware Error]: Machine check events logged
    [    1.249303] mce: [Hardware Error]: CPU 5: Machine Check: 0 Bank 5: bea0000000000108
    [    1.249381] mce: [Hardware Error]: TSC 0 ADDR 1ffffc1657028 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
    [    1.249462] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1651494778 SOCKET 0 APIC c microcode 8701021
I'm not sure what to make of this and I've tried to install and run mcelog but to no avail... It says that my processor isn't supported by the program.


This incident might occur instantly at the login screen, 1 min after logging in or 20 minutes after logging in but it always happens when I try to boot to the SATA SSD, and though the NVME is more stable it has occurred with it as well.


# The troubleshooting steps that I have taken
1. I sent the TUF x570 back to ASUS in an RMA and they replaced the board stating that there was some sort of hardware error on it so I'm working with a brand new motherboard. But the motherboard immediately started having the same problem as soon as I got it back.
2. I switched out the CPU with another CPU that I have, but it did the same thing again.. and when I put the first CPU into my other system, it ran fine, no problems.
3. I replaced the PSU with a 1000W (upgraded from a 750w), still no improvement
4. I tried different ram, no improvement
5. I tried a different AMD GPU, no improvement
6. I've monitored the system under load and I'm sure that it's not a CPU overheating problem because the temps never get above 60C even under heavy load. ( I have a liquid cooled CPU)


I've noticed that a couple times before the SATA SSD system freezes and reboots that HTOP shows that one or more of my CPUs are maxed out to 100% and colored red.. Not sure what to make of that. I've disabled Global C-state control in my BIOS (which did seem to help), but I'm not sure where to go from here. It does seem to me that the CPU is at least part of the problem but I cant figure it out. I don't have a ton of experience in hardware/software troubleshooting.

The Debian 11 on the NVME is much more stable than the SSD, I'm writing this from the NVME right now but It's very unreliable whether it works or not. Could it have something to do with the BIOS cpu voltages?

What do you guys think?
Is the problem the x570 chipset perhaps?? I'm sort of lost here.
 
Old 05-04-2022, 08:38 PM   #2
mrmazda
LQ Guru
 
Registered: Aug 2016
Location: SE USA
Distribution: openSUSE 24/7; Debian, Knoppix, Mageia, Fedora, OS/2, others
Posts: 6,336
Blog Entries: 1

Rep: Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196
Quote:
Originally Posted by Arct1c_f0x View Post
Could it have something to do with the BIOS cpu voltages?
Not likely. RAM voltage is more likely. Does running with only one RAM stick make any difference? Is your RAM on the explicitly supported list for your 570 on Asus' web site? Is the latest 570 firmware (BIOS) installed?

Are you sure none of the case standoffs are contacting any trace or solder spike on the bottom of the motherboard? Are all standoffs at the same elevation, so that the board isn't unnecessarily flexed? If you put the motherboard back in the original case, does the problem remain?
 
1 members found this post helpful.
Old 05-04-2022, 08:41 PM   #3
Arct1c_f0x
Member
 
Registered: Feb 2020
Posts: 124

Original Poster
Rep: Reputation: 24
Quote:
Originally Posted by mrmazda View Post
Not likely. RAM voltage is more likely. Does running with only one RAM stick make any difference? Is your RAM on the explicitly supported list for your 570 on Asus' web site? Is the latest 570 firmware (BIOS) installed?

Are you sure none of the case standoffs are contacting any trace or solder spike on the bottom of the motherboard? Are all standoffs at the same elevation, so that the board isn't unnecessarily flexed? If you put the motherboard back in the original case, does the problem remain?
Thanks Mr. Mazda! These are great suggestions; I'm going to try them and get back to you.
 
Old 05-04-2022, 09:28 PM   #4
computersavvy
Senior Member
 
Registered: Aug 2016
Posts: 3,345

Rep: Reputation: 1486Reputation: 1486Reputation: 1486Reputation: 1486Reputation: 1486Reputation: 1486Reputation: 1486Reputation: 1486Reputation: 1486Reputation: 1486
That machine check tells you it was seen by cpu 5 on RAM bank 5.

Note that an x570 board is different than a b450 board. It seems likely that they may support different RAM chips. It also seems possible that a bad connection between the DIMM and the RAM socket could affect it. Additionally RAM often must be plugged into the bank per the mobo specs. Plugging into the bank A slots when the book says to use bank B first (or vice versa) can sometimes cause problems

Are you certain the RAM is supported on that board?
Are you certain that BIOS supports that RAM and CPU.
Is the BIOS properly configured for the RAM installed?
A slight error in timing settings or ram voltages could cause issues. Maybe reset the bios to defaults then try again.

I have an Asrock B450 mobo and the ram works at default settings but to get the performance it is designed for I have to reset it specifically per the RAM specs.
 
1 members found this post helpful.
Old 05-05-2022, 05:08 PM   #5
Arct1c_f0x
Member
 
Registered: Feb 2020
Posts: 124

Original Poster
Rep: Reputation: 24
Alright guys so I tried most of the things that you suggested. Thanks for the suggestions by the way.

First I went to ASUS's support website and found my ram in the support ram list. Nothing changed, same issues experienced
Then I check to see if all the RAM (4 sticks) was securely fitted to the board; They were.
Then I went to the manufacturer of my RAM and they said the DRAM voltage should be 1.35 volts so I set it to that in the BIOS. Nothing changed, same issues experienced
Then I checked the case to make sure no solder joints were inadvertently contacting the case. Nothing changed, same issues experienced
Then I visually inspected the motherboard to see if there were any red flags; Didn't find anything
Then I checked the BIOS to see if I had the latest version (I did). Remember this was a board that ASUS completely replaced (they sent me a brand new one).


Everything works now and what was my final solution?... I switched the boards back. All the other components stayed the same and I simply switched the MOBOs. Now both systems have been running a long time with no apparent issues. But what did I learn from this? Seriously help me derive a lesson from this.


I can't understand what the issue was with the x570 motherboard; Sometimes computers seem like such crazy black magic. There are just so many levels of abstraction to a computer that one has difficulty understanding how to accurately diagnose and troubleshoot problems.

In this case I think there just must have been some really subtle hardware or software incompatibility... Now I have that x570 board running like a champ with Ubuntu 5.4 Kernel on my other system like it was before.

The thing I hate the most about this is that I'm going to come away from this not really knowing what I learned from this whole ordeal... Feels bad man. I guess going with a less sophisticated chipset was what resolved my problem? I guess don't boot from a SATA SSD on the x570 chipset when booting to Debian 11 5.10 kernel or Kali 5.16 kernel.

Last edited by Arct1c_f0x; 05-05-2022 at 05:09 PM. Reason: left out a word for clarification
 
Old 05-05-2022, 08:57 PM   #6
computersavvy
Senior Member
 
Registered: Aug 2016
Posts: 3,345

Rep: Reputation: 1486Reputation: 1486Reputation: 1486Reputation: 1486Reputation: 1486Reputation: 1486Reputation: 1486Reputation: 1486Reputation: 1486Reputation: 1486
I did not see you mention if there was a difference in the CPUs or if the CPU remained mounted on the board when you did the swap.

I did see you note that you had changed a power supply.

If the ONLY difference is the SSD vs HDD then it is very easy to test by simply swapping the drives. If there are other differences then swapping one device at a time should help narrow down what triggered the fault.

Making a major change with several items different is difficult to track down an error, but one at a time makes it easy.

I did not think of it before, but it also could be affected by temps if the CPU cooler was disturbed or the case air flow allowed the RAM to overheat.
 
1 members found this post helpful.
Old 05-05-2022, 10:04 PM   #7
mrmazda
LQ Guru
 
Registered: Aug 2016
Location: SE USA
Distribution: openSUSE 24/7; Debian, Knoppix, Mageia, Fedora, OS/2, others
Posts: 6,336
Blog Entries: 1

Rep: Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196Reputation: 2196
Quote:
Originally Posted by Arct1c_f0x View Post
Then I went to the manufacturer of my RAM and they said the DRAM voltage should be 1.35 volts so I set it to that in the BIOS.
The trouble with this is that there is no guarantee the RAM voltage sensor is spot-on accurate, or the RAM is OK "at" the spec voltage. The same applies to the rails from the PSU. Standard reference voltage for DDR4 is 1.2 V. 1.35 from manufacturer is probably only a supported voltage, not the ideal. These are reasons why many non-gamer motherboards allow manually setting a non-spec voltage. I would have tested with RAM voltage at least as high as 1.39, as well as setting manually to 1.20, and several in-between increments, to try to rule out or affirm RAM as any issue.
 
1 members found this post helpful.
Old 05-05-2022, 10:33 PM   #8
Arct1c_f0x
Member
 
Registered: Feb 2020
Posts: 124

Original Poster
Rep: Reputation: 24
I had no idea how many problems improperly configured RAM can have in a system.


By now all systems have been running for about 6 hours and I'm confident that the problem is resolved. I just stuck a Ryzen 5900X in one of the systems and flashed the BIOS to the newest version and all is running excellently.

The lesson I'm taking away is 'mo sophisticated chipset, mo problems'. Although I'm willing to concede that it may well have been a CPU/DRAM voltage issue.
 
Old 05-06-2022, 03:03 AM   #9
Arnulf
Member
 
Registered: Jan 2022
Location: Hanover, Germany
Distribution: Slackware
Posts: 305

Rep: Reputation: 111Reputation: 111
Quote:
Originally Posted by Arct1c_f0x View Post
But what did I learn from this?
Never change a running system.
 
1 members found this post helpful.
Old 05-06-2022, 10:27 AM   #10
Arct1c_f0x
Member
 
Registered: Feb 2020
Posts: 124

Original Poster
Rep: Reputation: 24
Quote:
Originally Posted by Arnulf View Post
Never change a running system.
Sometimes you have to take the plunge! Headaches, misery and all
 
  


Reply

Tags
cpu, freezecrash, motherboard, rebooting


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Video and desktop locks and sound stutters over and over greyskier Linux - Desktop 16 09-10-2014 10:57 PM
SUSE 10.3 randomly powers off and reboots dccombs SUSE / openSUSE 5 03-14-2009 12:31 PM
The pc reboots after ~2mn under Slackware and a little more under knoppix. Linux.tar.gz Linux - Hardware 4 01-06-2005 11:55 AM
fedora 2 reboots randomly on opteron 250 system fdarvas Fedora 1 12-04-2004 11:45 AM
Box randomly reboots...how do I find out why? InDIo Linux - General 4 02-07-2004 11:23 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 03:01 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration