LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 09-28-2020, 08:45 PM   #1
MirceaKitsune
Member
 
Registered: May 2009
Distribution: Manjaro
Posts: 156

Rep: Reputation: 1
amdgpu is overheating the graphics card and crashing the system (Radeon XFX R9 390)


First some background on the relevant hardware and software: My OS is openSUSE Tumbleweed x64, Kernel 5.8.10, Mesa 20.1.8, xf86-video-amdgpu 19.1.0, amdgpu module in use. My graphics card is a Radeon XFX R9 390.

Starting roughly one or two months ago, my video card is causing the system to crash due to overheating. Once the edge sensor reads around 90C* I start seeing little flashing squares corrupting my screen. Soon afterward, if I don't quickly reduce the load, the system crashes and reboots on its own then refuses to start up and reach POST for several minutes (I get 3 PC-speaker beeps and the machine won't boot once powered on). This makes working with anything that stresses the GPU a danger including most 3D engines. The only workaround I found is using the command:

Code:
echo low | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level
By forcing the performance level to the lowest frequencies I'm able to safely run most games, but of course they run extremely slow so this is not a real solution. It does however confirm that the way GPU / VRAM frequencies are set is causing overheating and a system crash.

I'm filing a bug report with FreeDesktop as I believe this is a driver issue rather than a hardware failure; It didn't happen until several weeks ago, and I remember how an year prior my card used to reach 94C* at times without any graphical corruption or crashing. I can also verify that the two GPU fans are working well, though it takes a very long time for them to come on at full power (which I know is also controlled by the power management module). I get the impression the default parameters might not be configured accordingly with the latest versions of the modules, causing the card to get overclocked and reach a dangerous temperature very quickly.

Please let me know how I can offer more needed info to better understand where this issue resides, such as which part of the driver is overestimating the safe frequencies of my graphics card model and pushing it too far. Also is there a way to tell amdgpu to cap the clocks at a particular (lower) frequency to make sure it doesn't reach the point where it gets overly hot? Just as importantly, how do I tell it to start the fans at full speed sooner?
 
Old 09-29-2020, 01:43 AM   #2
Roman Dyaba
Member
 
Registered: Sep 2020
Location: Russia, 690016 Vladivostok city, street Osipenko home 66, tel: +79247350007
Distribution: Slackware, UbuntuStudio, FreeBSD, GhostBSD
Posts: 317

Rep: Reputation: 40
https://www.linuxquestions.org/quest...rs-4175682814/
https://www.linuxquestions.org/quest...4/#post6170652

firefox -> " file:///usr/share/doc/amdgpu-* " console instructions, i was read about T'C console regulator.

see also : " file:///usr/share/doc/amdgpu-doc/index.html "

Last edited by Roman Dyaba; 09-29-2020 at 01:57 AM. Reason: corrections
 
Old 09-29-2020, 09:27 AM   #3
MirceaKitsune
Member
 
Registered: May 2009
Distribution: Manjaro
Posts: 156

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by Roman Dyaba View Post
I don't use the (now outdated) AMD Catalyst proprietary driver, only the official amdgpu (not amdgpu-pro) provided by the OS. IIRC that's the latest and most stable driver now, but I'm suspecting there is an issue with it in the latest version interfering with this card model.

For starters I was curious if I could check what the maximum frequencies defined by the driver for my video card are, then modify them and see how that changes things. There must be some setting or component telling the system how high to allow stressing the GPU / VRAM at maximum load, which I'm assuming is where the range isn't quite right. Maybe the amdgpu team happened to get the specs wrong and overestimated the safe clock speeds, or another bug is causing them to get pushed a bit higher?
 
Old 10-01-2020, 05:20 PM   #4
obobskivich
Member
 
Registered: Jun 2020
Posts: 610

Rep: Reputation: Disabled
Quote:
Originally Posted by MirceaKitsune View Post
I don't use the (now outdated) AMD Catalyst proprietary driver, only the official amdgpu (not amdgpu-pro) provided by the OS. IIRC that's the latest and most stable driver now, but I'm suspecting there is an issue with it in the latest version interfering with this card model.

For starters I was curious if I could check what the maximum frequencies defined by the driver for my video card are, then modify them and see how that changes things. There must be some setting or component telling the system how high to allow stressing the GPU / VRAM at maximum load, which I'm assuming is where the range isn't quite right. Maybe the amdgpu team happened to get the specs wrong and overestimated the safe clock speeds, or another bug is causing them to get pushed a bit higher?
Have you tried using 'radeon' instead of 'amdgpu'? The 390X is just a re-badge of the 290X (the actual GPU is called 'Hawaii' - note that all of the 300 series are rebadges of something else) and sits right on the line between the cards that use radeon and the newer Fury/Vega+ cards that amdgpu is designed for (Hawaii is listed as 'experimental' for amdgpu). Every linux system I've ever installed on my 290X will default to 'radeon' (Slackware, Ubuntu, Xubuntu, etc), and it has never had issues with that. On the Windows-side, the newer AMD proprietary drivers are 'less good' than the drivers closer to its age, and I suspect that trend may exist on the linux side as well.

Also to note: these cards had a reputation for being very hot - the 'stock' configuration will run at 90+* C under load (and AMD says this is just fine - the real question is 'for how long?'), but board partner cards with better heatsinks shouldn't be quite so toasty, although you've got a factory overclocked model there. My 290X (which is also an XFX with a similar looking cooler) will idle around 45-50* C in most settings, but 'heavy load' (which is rare for it these days) can see it climb to over 70* C pretty quickly.

Some sources:
https://wiki.archlinux.org/index.php/Xorg#AMD
https://www.x.org/wiki/RadeonFeature/
https://www.x.org/wiki/RadeonFeature/#index5h2 (Hawaii is part of 'Sea Islands')

What you're describing on-screen sounds like artefacting, and what you're describing with temperatures implies you may have a hardware problem - 94* C is the official throttle point from AMD, and the card's onboard firmware will take over if you exceed it (it will basically do anything it can to prevent you from going beyond 94* C), if you're artefacting at 90* C that may be indicative of premature failure due to the constant, repeated heat stress (this is where the 'for how long?' question comes back to bite us - I'm not saying its dead/dying, just that its a contingency to consider).

I would try radeon if it is available, and see if that manages the card better or solves your concerns. As far as the fan control and clock control, this guide may help, but it again assumes amdgpu: https://www.maketecheasier.com/overclock-amd-gpu-linux/

You may be able to get by with the fan/sensor control regardless of driver (that seems reasonable to me, since lm-sensors doesn't depend on amdgpu or radeon).

This may also be worth looking at: https://gitlab.com/corectrl/corectrl/
 
Old 10-01-2020, 07:29 PM   #5
MirceaKitsune
Member
 
Registered: May 2009
Distribution: Manjaro
Posts: 156

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by obobskivich View Post
Have you tried using 'radeon' instead of 'amdgpu'? The 390X is just a re-badge of the 290X (the actual GPU is called 'Hawaii' - note that all of the 300 series are rebadges of something else) and sits right on the line between the cards that use radeon and the newer Fury/Vega+ cards that amdgpu is designed for (Hawaii is listed as 'experimental' for amdgpu). Every linux system I've ever installed on my 290X will default to 'radeon' (Slackware, Ubuntu, Xubuntu, etc), and it has never had issues with that. On the Windows-side, the newer AMD proprietary drivers are 'less good' than the drivers closer to its age, and I suspect that trend may exist on the linux side as well.

Also to note: these cards had a reputation for being very hot - the 'stock' configuration will run at 90+* C under load (and AMD says this is just fine - the real question is 'for how long?'), but board partner cards with better heatsinks shouldn't be quite so toasty, although you've got a factory overclocked model there. My 290X (which is also an XFX with a similar looking cooler) will idle around 45-50* C in most settings, but 'heavy load' (which is rare for it these days) can see it climb to over 70* C pretty quickly.

Some sources:
https://wiki.archlinux.org/index.php/Xorg#AMD
https://www.x.org/wiki/RadeonFeature/
https://www.x.org/wiki/RadeonFeature/#index5h2 (Hawaii is part of 'Sea Islands')

What you're describing on-screen sounds like artefacting, and what you're describing with temperatures implies you may have a hardware problem - 94* C is the official throttle point from AMD, and the card's onboard firmware will take over if you exceed it (it will basically do anything it can to prevent you from going beyond 94* C), if you're artefacting at 90* C that may be indicative of premature failure due to the constant, repeated heat stress (this is where the 'for how long?' question comes back to bite us - I'm not saying its dead/dying, just that its a contingency to consider).

I would try radeon if it is available, and see if that manages the card better or solves your concerns. As far as the fan control and clock control, this guide may help, but it again assumes amdgpu: https://www.maketecheasier.com/overclock-amd-gpu-linux/

You may be able to get by with the fan/sensor control regardless of driver (that seems reasonable to me, since lm-sensors doesn't depend on amdgpu or radeon).

This may also be worth looking at: https://gitlab.com/corectrl/corectrl/
I can try the radeon driver by simply removing the amdgpu kernel parameters I added manually (radeon.si_support=0 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1). I use amdgpu since it's newer and faster, IIRC it not being enabled by default is a matter of minor and unrelated issues. Will try this later.

That's correct: I remember the GPU fans used to turn on at full power when the card reached 94C*, which only happened very briefly over an year ago hence why I noticed. No artifacts or crashing then from what I can remember. Now it seems to occur once the card reaches 90C* which is a high enough temperature to justify it. No issues at any point if the temperature is under 85C* so it seems to be strictly heat related.

I'm hoping this is due to a driver update, but who can know for sure. Even if it were a hardware issue though, it would appear the same solution applies: There should be a way to cap the maximum clocks the driver will push the card to at full capacity. I have no idea where the frequency settings are stored or loaded from however.
 
Old 10-02-2020, 01:20 PM   #6
biker_rat
Member
 
Registered: Feb 2010
Posts: 413

Rep: Reputation: 246Reputation: 246Reputation: 246
I believe the radeon driver has no vulkan support, whereas amdgpu does. If you think an recent update did this and you recently updated the kernel, revert the kernel?

Last edited by biker_rat; 10-02-2020 at 01:30 PM.
 
1 members found this post helpful.
Old 10-02-2020, 01:56 PM   #7
obobskivich
Member
 
Registered: Jun 2020
Posts: 610

Rep: Reputation: Disabled
Quote:
Originally Posted by MirceaKitsune View Post
I can try the radeon driver by simply removing the amdgpu kernel parameters I added manually (radeon.si_support=0 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1). I use amdgpu since it's newer and faster, IIRC it not being enabled by default is a matter of minor and unrelated issues. Will try this later.
Right - and your understanding of amdgpu largely mirrors what I've read, but I also know radeon 'works well enough' on my 290X for what I need that machine to do (spoiler alert: it isn't playing the latest games).

Quote:
That's correct: I remember the GPU fans used to turn on at full power when the card reached 94C*, which only happened very briefly over an year ago hence why I noticed. No artifacts or crashing then from what I can remember. Now it seems to occur once the card reaches 90C* which is a high enough temperature to justify it. No issues at any point if the temperature is under 85C* so it seems to be strictly heat related.
Honestly all of those temperatures 'seem high' to me - I know AMD said 290X/390X can do 94* C in operation, but they also said that was AOK for 4870X2 'back in the day' and mine died within 4 years running like that. Since then, I take manufacturer 'specs' on max thermals with a grain of salt (that is, sure the hardware can withstand it for some time, but I wouldn't consider that to mean 'for a long time over many years'), and try to run the hardware as cool as is reasonably possible - idle should be more like 30-50* C depending on case ventilation/# of attached monitors/ambient temperature/etc and ideally load temperatures aren't going up to the 80-90 range. I've run Half-Life 2 (I know, so demanding...) on my 290X and it doesn't get that hot, and I've booted Windows on that machine and tried a few 'newer' games (Fallout 4, for example) and it got into the mid-70s which seemed okay enough.

What you're describing still doesn't entirely rule out hardware problems, but if its 'stable' if the temperature is low enough, that's at least something. Have you looked at either of my other links - both purport to give you clock/fan control over the card.

Something else I just thought of, since there's a time difference here, and we're only talking about a few degrees - have you cleaned your case for dust/debris/etc recently? Maybe it's just gotten clogged up with [whatever] and that's enough to get you from 90 -> 94 and run the card into the threshold of unstable.
 
Old 10-02-2020, 02:17 PM   #8
MirceaKitsune
Member
 
Registered: May 2009
Distribution: Manjaro
Posts: 156

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by obobskivich View Post
Have you looked at either of my other links - both purport to give you clock/fan control over the card.
I've actually been looking at that today but so far there's no clear solution. I'm aware I need to edit some files in "/sys/class/drm/card0/device" but which and how is rather complex... there's also an amdgpu.ppfeaturemask kernel parameter which I'm trying to understand the functionality of but could find no explanation yet.

Quote:
Originally Posted by obobskivich View Post
Something else I just thought of, since there's a time difference here, and we're only talking about a few degrees - have you cleaned your case for dust/debris/etc recently? Maybe it's just gotten clogged up with [whatever] and that's enough to get you from 90 -> 94 and run the card into the threshold of unstable.
I blow the dust from the heatsink every once in a while, it's as clean as it gets. I also wanted to take the heatsink off to apply new thermal paste just in case, but it seems to be held in place by special screws I can't take off: I'm planning to call a friend and see if he can help out with that next week.
 
Old 10-02-2020, 03:29 PM   #9
MirceaKitsune
Member
 
Registered: May 2009
Distribution: Manjaro
Posts: 156

Original Poster
Rep: Reputation: 1
Some important news: It appears this may in fact be an issue with amdgpu specifically. I booted my system on the radeon module by temporarily removing the kernel parameters "radeon.si_support=0 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1". The console output and renamed temperature sensor confirmed the switch.

Code:
mircea@linux-qz0r:~> /sbin/lspci -nnk | egrep -A3 "VGA|Display|3D"
0a:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii PRO [Radeon R9 290/390] [1002:67b1] (rev 80)
        Subsystem: XFX Pine Group Inc. Device [1682:9390]
        Kernel driver in use: radeon
        Kernel modules: radeon, amdgpu
Versus (on amdgpu):

Code:
mircea@linux-qz0r:~> /sbin/lspci -nnk | egrep -A3 "VGA|Display|3D"
0a:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii PRO [Radeon R9 290/390] [1002:67b1] (rev 80)
        Subsystem: XFX Pine Group Inc. Device [1682:9390]
        Kernel driver in use: amdgpu
        Kernel modules: radeon, amdgpu
I then proceeded to open up Blender and load up an Eevee scene, one that I know would overheat my GPU within seconds if the viewport was set to rendered view. This time however there were no issues! I got some stretched vertices likely caused by another unrelated bug, but no crashes or the square glitches caused by overheating as I moved the view around.

Watching the sensors in a console explains why: The GPU was never allowed to exceed 84C* on radeon, unlike the 94C* it will get to on amdgpu... precisely the safe temperature I noticed for my card, the square glitches will occur starting from 88C*. Just as importantly, the optional fan on the card started spinning soon after 80C*; On amdgpu it doesn't spin until over 90C* instead which is extremely high.

Already I can see something out of the ordinary in the outputs. Here's a snapshot of the "watch sensors" command while on the radeon module (under load by Blender):

Code:
Every 2.0s: sensors                                                                                      linux-qz0r: Fri Oct  2 22:47:34 2020

k10temp-pci-00c3
Adapter: PCI adapter
Vcore:         1.32 V
Vsoc:          1.09 V
Tctl:         +61.2°C
Tdie:         +61.2°C
Tccd1:        +52.5°C
Icore:         7.00 A
Isoc:          8.00 A

nvme-pci-0100
Adapter: PCI adapter
Composite:    +54.9°C  (low  = -273.1°C, high = +84.8°C)
                       (crit = +84.8°C)
Sensor 1:     +54.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +54.9°C  (low  = -273.1°C, high = +65261.8°C)

radeon-pci-0a00
Adapter: PCI adapter
temp1:        +82.0°C  (crit = +120.0°C, hyst = +90.0°C)
Now here's what the same command looks like while on amdgpu:

Code:
Every 2.0s: sensors                                                                                      linux-qz0r: Fri Oct  2 23:14:22 2020

k10temp-pci-00c3
Adapter: PCI adapter
Vcore:         1.32 V
Vsoc:          1.09 V
Tctl:         +44.8°C
Tdie:         +44.8°C
Tccd1:        +46.2°C
Icore:        10.00 A
Isoc:          8.00 A

nvme-pci-0100
Adapter: PCI adapter
Composite:    +47.9°C  (low  = -273.1°C, high = +84.8°C)
                       (crit = +84.8°C)
Sensor 1:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +49.9°C  (low  = -273.1°C, high = +65261.8°C)

amdgpu-pci-0a00
Adapter: PCI adapter
vddgfx:        1.04 V
fan1:             N/A  (min =    0 RPM, max = 6000 RPM)
edge:         +58.0°C  (crit = +104000.0°C, hyst = -273.1°C)
power1:       59.15 W  (cap = 208.00 W)
Notice the GPU sensor (named temp1 on radeon and edge on amdgpu): (crit = +120.0°C, hyst = +90.0°C) in the first, (crit = +104000.0°C, hyst = -273.1°C) on the later. The temperature ranges in the second version seem like broken values! Could this be a clue?

Until that is solved I can use the legacy radeon module as a workaround if need be. However I don't wish to do so for too long: It's an older driver, slower, and the lack in performance improvements and outdated architecture will likely show in modern games. amdgpu is the normal driver even if it's still not enabled by default on GCN 1.0 / 2.0 cards, and apart from this issue it's working perfectly otherwise.
 
Old 10-02-2020, 04:15 PM   #10
MirceaKitsune
Member
 
Registered: May 2009
Distribution: Manjaro
Posts: 156

Original Poster
Rep: Reputation: 1
Just ruled out DisplayCore as well since I know it's an usual suspect: Booting with the additional parameter "amdgpu.dc=0" does not affect the issue, the card will still overheat on amdgpu but not radeon.
 
Old 10-03-2020, 01:57 PM   #11
obobskivich
Member
 
Registered: Jun 2020
Posts: 610

Rep: Reputation: Disabled
Quote:
Originally Posted by MirceaKitsune View Post

I blow the dust from the heatsink every once in a while, it's as clean as it gets. I also wanted to take the heatsink off to apply new thermal paste just in case, but it seems to be held in place by special screws I can't take off: I'm planning to call a friend and see if he can help out with that next week.
Having taken apart a few of those XFX 'DD' coolers, if memory serves they've got a few illegal 'warranty void' stickers on some screw heads which may obscure them, but I recall it being a very straightforward disassembly - the 'heatsink itself' should only be the 4 screws on the back, and there's also a 'baseplate' under it that is held on by other screws (it's a two part (technically three - the plastic comes off too, but only once you take it off) affair). I don't know if they've also moved to obnoxious screw heads just to further impede your efforts at repair.

I do know, however, that on the few 'newer' graphics cards I have (nVidia 600-series and newer), the factory TIM is generally good enough that switching to any fancy aftermarket stuff makes little-to-no difference, especially at load, and I haven't seen pink slime in probably a decade.


Quote:
Originally Posted by MirceaKitsune View Post
Notice the GPU sensor (named temp1 on radeon and edge on amdgpu): (crit = +120.0°C, hyst = +90.0°C) in the first, (crit = +104000.0°C, hyst = -273.1°C) on the later. The temperature ranges in the second version seem like broken values! Could this be a clue?

Until that is solved I can use the legacy radeon module as a workaround if need be. However I don't wish to do so for too long: It's an older driver, slower, and the lack in performance improvements and outdated architecture will likely show in modern games. amdgpu is the normal driver even if it's still not enabled by default on GCN 1.0 / 2.0 cards, and apart from this issue it's working perfectly otherwise.
Glad radeon works - it looks like basically what they did was hatchet-jobbed the power/thermal management vs spending the time to set it up right. Hawaii was one of the first 'modern' GPUs in terms of its power/clock management, where it doesn't (technically) have a 'nominal clockspeed' - it will just algorithmically pick a given speed, although it does have an enforced 'maximum clock' (that isn't dictated purely by TDP). If you disable all the power limits/thermal limits/etc you can usually get some more performance out of the card, IF you can cool it (and this generally requires liquid cooling to achieve).

I'm not sure the statement 'amdgpu is the normal driver' is accurate here - its listed (and has always been listed that I'm aware of) as 'experimental' for early GCN cards like Hawaii, and I doubt there's any push for future/further development of that status because such cards are largely considered 'abandoned' or 'obsolete' (or at least obscure) - I think most folks with AMD GPUs have Polaris or Vega these days. If you can figure out a way to force better power/thermal parameters in amdgpu's configuration that would probably be the ticket though. From digging around on the archwiki this may be useful: https://github.com/sibradzic/amdgpu-clocks (although note that weasel word 'recent' in the description).
 
Old 10-05-2020, 05:25 PM   #12
MirceaKitsune
Member
 
Registered: May 2009
Distribution: Manjaro
Posts: 156

Original Poster
Rep: Reputation: 1
Since this seemed relevant, here's the output of "cat /sys/kernel/debug/dri/0/amdgpu_pm_info" which shows clock related video card settings. Ran that command while stressing the GPU with Blender in case that gives off more info.

Code:
Clock Gating Flags Mask: 0x0
        Graphics Medium Grain Clock Gating: Off
        Graphics Medium Grain memory Light Sleep: Off
        Graphics Coarse Grain Clock Gating: Off
        Graphics Coarse Grain memory Light Sleep: Off
        Graphics Coarse Grain Tree Shader Clock Gating: Off
        Graphics Coarse Grain Tree Shader Light Sleep: Off
        Graphics Command Processor Light Sleep: Off
        Graphics Run List Controller Light Sleep: Off
        Graphics 3D Coarse Grain Clock Gating: Off
        Graphics 3D Coarse Grain memory Light Sleep: Off
        Memory Controller Light Sleep: Off
        Memory Controller Medium Grain Clock Gating: Off
        System Direct Memory Access Light Sleep: Off
        System Direct Memory Access Medium Grain Clock Gating: Off
        Bus Interface Medium Grain Clock Gating: Off
        Bus Interface Light Sleep: Off
        Unified Video Decoder Medium Grain Clock Gating: Off
        Video Compression Engine Medium Grain Clock Gating: Off
        Host Data Path Light Sleep: Off
        Host Data Path Medium Grain Clock Gating: Off
        Digital Right Management Medium Grain Clock Gating: Off
        Digital Right Management Light Sleep: Off
        Rom Medium Grain Clock Gating: Off
        Data Fabric Medium Grain Clock Gating: Off
        Address Translation Hub Medium Grain Clock Gating: Off
        Address Translation Hub Light Sleep: Off

GFX Clocks and Power:
        1500 MHz (MCLK)
        999 MHz (SCLK)
        300 MHz (PSTATE_SCLK)
        150 MHz (PSTATE_MCLK)
        1206 mV (VDDGFX)
        206.147 W (average GPU)

GPU Temperature: 88 C
GPU Load: 100 %
MEM Load: 12 %

UVD: Disabled

VCE: Disabled
 
Old 10-11-2020, 05:59 PM   #13
MirceaKitsune
Member
 
Registered: May 2009
Distribution: Manjaro
Posts: 156

Original Poster
Rep: Reputation: 1
Some nice news: I was able to improve the situation on the hardware side, by removing the radiator and applying new thermal paste to the GPU. The old one had dried off and solidified in places... on top of that some radiator screws were worryingly lose and I tightened them. A positive outcome is already noticeable: Idle temp seems to be a tiny bit lower, the card takes a longer time to heat up, and although the "edge" sensor will still cap out at 94C* I no longer seem to get the square corruption and system crash right away. I only ran a brief test and extra sustained heating may still cause issues, but new thermal paste definitely helped for now.

It remains arguable whether amdgpu still has a fault for not picking up on the problem and doing something to prevent it: Letting the card heat up to 94C* still feels too much in my opinion, even if it's probably allowed to reach this temperature by design. That still feels too hot especially if the driver doesn't have a way to detect when this heat is about to cause a system failure, putting other cards with worn thermal paste or lowered heat tolerance in danger. What do you think?

I do believe the amdgpu PWM module has at least one problem: It takes far too long to turn on the secondary fan. Even after the sensor reached the maximum temperature mentioned, I only heard the back fan briefly come on for a few seconds after it stood at that temperature for a while. What decides when the fans will run at full power, and could the driver be tweaked to make this happen a little sooner for safe measure?
 
Old 10-15-2020, 04:44 PM   #14
obobskivich
Member
 
Registered: Jun 2020
Posts: 610

Rep: Reputation: Disabled
Quote:
Originally Posted by MirceaKitsune View Post
Some nice news: I was able to improve the situation on the hardware side, by removing the radiator and applying new thermal paste to the GPU. The old one had dried off and solidified in places... on top of that some radiator screws were worryingly lose and I tightened them. A positive outcome is already noticeable: Idle temp seems to be a tiny bit lower, the card takes a longer time to heat up, and although the "edge" sensor will still cap out at 94C* I no longer seem to get the square corruption and system crash right away. I only ran a brief test and extra sustained heating may still cause issues, but new thermal paste definitely helped for now.
Glad to hear this worked. I've had heatsinks in the past that 'work screws loose' - usually that's the end of the line for that heatsink in my book (I won't put something like loctite on there and make it impossible to take back apart, but that's just me).

Quote:
It remains arguable whether amdgpu still has a fault for not picking up on the problem and doing something to prevent it: Letting the card heat up to 94C* still feels too much in my opinion, even if it's probably allowed to reach this temperature by design. That still feels too hot especially if the driver doesn't have a way to detect when this heat is about to cause a system failure, putting other cards with worn thermal paste or lowered heat tolerance in danger. What do you think?
According to AMD, 94* C is perfectly fine and you should not complain; in my view it's too hot for long-term reliability and I'd love to see hardware manufacturers come back from the ledge with temps because it does impact reliability. Running these chips at (or near enough to not be different) 100* C 24x7 is a really bad idea, long-term, but I guess if it keeps you buying a new laptop/machine/whatever every 2-3 years it's good for someone's performance bonus...

Quote:
I do believe the amdgpu PWM module has at least one problem: It takes far too long to turn on the secondary fan. Even after the sensor reached the maximum temperature mentioned, I only heard the back fan briefly come on for a few seconds after it stood at that temperature for a while. What decides when the fans will run at full power, and could the driver be tweaked to make this happen a little sooner for safe measure?
What do you mean the 'secondary fan'? Is this one of those stupid after-market cards that ignores the reference fan curve/design and has multiple fans that come on/off at different points, and assumes you'll install their 300-600MB+ bloatware Windows-only package to manage this 'feature' effectively? On my XFX DD cards, all the fans are just hardwired together, and never turn off - they do slow down, but they never turn entirely off. I have two newer nVidia cards that shut their fans off at idle, but this is apparently standard behavior (as in, all newer GeForce cards apparently do this), and they never exceed 50* C in this idle state (and if they somehow managed it, the fans come on at 50* C).

Something really crude you might consider here: just remove the driver/software/board from the equation - run the fans directly into the PSU or into a separate controller that you control (if you don't like the performance/noise/whatever at a fixed voltage). I run a lot of my systems like this specifically because I don't trust software fan control, for the reasons you're largely running into here ('firmware control' like what the newer nVidia cards do I'm largely undecided on - neither of my cards has died yet, neither of them runs particularly hot, and the 'older' of the two is probably around 3 years old at this point - the thing is at low RPM their fans are still inaudible and they'd run cooler in that situation, but maybe it extends the fan's lifecycle by being 'off' a lot (realistically the fan on either of those boards is only on about 10-20% of the machine's overall duty-cycle, so that's probably hundreds/thousands of hours a year of runtime they're not racking up)).
 
Old 10-27-2020, 11:18 AM   #15
obobskivich
Member
 
Registered: Jun 2020
Posts: 610

Rep: Reputation: Disabled
I know this thread 'ended' a while ago, but I saw this announcement today and thought it would be relevant: https://linuxreviews.org/Linux_5.9_B..._Kernel_Driver

It looks like Kernel 5.9 may hold some improvement(s) for your R9 390.
 
  


Reply

Tags
amd, amdgpu, crash, gpu, hardware


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] XFX - THICC II Pro AMD Radeon RX 5500 XT 8GB GDDR6 PCI Express 4.0 Graphics Card someone else Linux - Hardware 2 04-19-2020 11:14 AM
[SOLVED] AMDGPU module refuses to load with Radeon RX 570 card bulletfreak Slackware 6 02-12-2019 09:44 PM
xfx radeon 5450 card in ubuntu sumeet inani Linux - Newbie 4 07-03-2011 04:32 AM
No Display with graphics card NVidia XFX 7300GT panshul007 Linux - Newbie 4 10-04-2007 07:36 AM
Driver for xfx geforce fx 5600 graphics card basshead62887 Linux - Software 6 09-10-2007 10:26 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 09:23 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration