LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 12-03-2021, 05:05 PM   #1
MirceaKitsune
Member
 
Registered: May 2009
Distribution: Manjaro
Posts: 155

Rep: Reputation: 1
Graphics card suddenly causes boot crash with mce error


Something strange and unsettling happened to me today. I woke up to my screen no longer powering back on after moving the mouse, not an entirely unique occurrence. I restarted and was surprised to see that right before the login screen, the monitor would power itself off, and this time I was unable to do a clean shutdown by pressing the power button. It soon became apparent the computer would stay frozen for roughly a minute, then proceed to restart itself and repeat the cycle. After one restart I'm able to catch the following error message in the console:

https://i.imgur.com/zNK01Vs.jpg

I realized it must be hardware related since I didn't install any updates nor make changes to the system configuration for over a week, this wouldn't happen yesterday on the exact same system... to confirm it I reproduced by booting a live image, exact same behavior there. I pulled out the memory modules and tried them in sets, disconnected all hard drives, tried two different screens (HDMI and DisplayPort cables), booting two kernels (5.14 and 5.15), radeon vs amdgpu, reset the CMOS via pins... in the end the only thing that worked was removing my video card and plugging in an older one.

What makes this extremely bizarre is that I get image up until boot time: I can enter BIOS just fine, see GRUB, there are no GPU freezes or graphical corruption... this seems to be all Linux detecting an error and freaking out over it. All error messages are prefixed with "mce" and oddly enough reference a CPU issue, the rest of my hardware works just fine so it's not the processor thank god.

Does anyone know what could break in a video card that would make Linux do this? I saw a reference about a `mcelog` command for these errors, but like I said the machine becomes completely inoperable after that's printed so I can't issue any commands. If you can suggest further tests I'll take a look, but please mention everything I could test first as I don't feel comfortable plugging and pulling the video card with my motherboard so often and risk breaking things (tried it twice today). If this is a hardware issue that can't be solved from kernel I have no choice but to spend a large sum of money I didn't want to spend... figured I'd ask for help here first so I know I tried everything else.
 
Old 12-05-2021, 05:03 AM   #2
elcore
Senior Member
 
Registered: Sep 2014
Distribution: Slackware
Posts: 1,753

Rep: Reputation: Disabled
Quote:
Originally Posted by MirceaKitsune View Post
Does anyone know what could break in a video card that would make Linux do this?
Could be it's just bent because of the heat. Is there another machine where you could test the GPU?
 
Old 12-05-2021, 09:02 AM   #3
MirceaKitsune
Member
 
Registered: May 2009
Distribution: Manjaro
Posts: 155

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by elcore View Post
Could be it's just bent because of the heat. Is there another machine where you could test the GPU?
I got no overheating by the temperature sensor last time. This happens on boot, the card is very cool especially then. Overheating in the past would cause square corruption, I repasted the card and such issues went away since, until whatever happened this week.
 
Old 12-05-2021, 09:24 AM   #4
elcore
Senior Member
 
Registered: Sep 2014
Distribution: Slackware
Posts: 1,753

Rep: Reputation: Disabled
Well, if it's bent it's clearly visible you can't miss it. I've seen one that got melted and bent because the cable weight pulled it down.
It does not bend back when cooled down, it is stuck in that position until heated and straightened back up.
Checked the copper part for mold and that sort of thing? Had one that oxidized somehow, cleaned with rubber pencil eraser and fixed it.
So there is no other machine where you could test it?
 
Old 12-05-2021, 09:55 AM   #5
MirceaKitsune
Member
 
Registered: May 2009
Distribution: Manjaro
Posts: 155

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by elcore View Post
Well, if it's bent it's clearly visible you can't miss it. I've seen one that got melted and bent because the cable weight pulled it down.
It does not bend back when cooled down, it is stuck in that position until heated and straightened back up.
Checked the copper part for mold and that sort of thing? Had one that oxidized somehow, cleaned with rubber pencil eraser and fixed it.
So there is no other machine where you could test it?
No hardware defects that I can tell, looks pristine from the outside. Dust was cleared from it a while ago when I repasted it. Only other machine is my mother's computer, unfortunately I can't test it there as neither the case nor its PSU allow connecting it (additional pins cable doesn't reach from the motherboard).
 
Old 12-05-2021, 10:10 AM   #6
elcore
Senior Member
 
Registered: Sep 2014
Distribution: Slackware
Posts: 1,753

Rep: Reputation: Disabled
Additional pins, so it requires a backup power source. And the working (old) GPU does not require that?
It could mean the GPU burned out, power connector on the motherboard burned out, or PSU and/or cable fault.
Possibly just a capacitor, but I'm no electronics expert. I'd test it on another machine to make sure it's not other components' fault.
If all the other parts are working, including the additional power connector, then I'd suspect the GPU should go to repair shop.
Some folks use the oven, sometimes all it takes is to melt the soldering.. most just buy a new GPU.
 
Old 12-05-2021, 10:26 AM   #7
MirceaKitsune
Member
 
Registered: May 2009
Distribution: Manjaro
Posts: 155

Original Poster
Rep: Reputation: 1
Old (broken) card has two additional connectors, a 6-pin plus an 8-pin... the older (fallback) has only one 6-pin connector and works fine. The PSU makes them customizable (6-pin or 8-pin) so I tried reversing which are plugged into which connector last time, no effect so I don't suspect a bad socket. New card is supposed to arrive soon, I needed an upgrade anyway, I'll be seeing how it goes.
 
  


Reply

Tags
crash, freeze, graphics, hardware, video



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
failed boot in ubuntu - mce: [Hardware Error] shanp77 Linux - Hardware 4 04-09-2020 07:15 AM
crash () { crash|crash& }; crash grob115 Linux - Security 6 05-07-2011 03:06 AM
LXer: why Linux MCE is superior to windows MCE LXer Syndicated Linux News 0 02-23-2009 09:02 PM
Crash, Crash, Crash, Crash and You Guessed it Crash! little_penguin SUSE / openSUSE 8 07-04-2005 09:34 AM
xmms crash xine crash mplayer crash paledread Linux - Software 9 03-09-2004 07:09 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 07:02 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration