Ryzen system, occasional lockups still in 2018?

asheshambasta · 03-31-2020, 10:12 AM

True for both. I'm beginning to regret falling for the Threadripper brandname. I'm now stuck with a very expensive desktop system that needs to be hacked to run normally, not to forget that there are other people reporting that disabling C6 in the BIOS may or may not fix the issue completely. There are other microcode related bugs floating around and AMD is not even close to fixing them, or even admit to these issues so they can be addressed (for example the rdrand bug that was patched last year or so: https://arstechnica.com/gadgets/2019...ed-my-weekend/)

It turns out that my processor was also affected by this at the time of buying, but a BIOS update patched the problem. So now I'm no longer experiencing issues because of this bug but because of something else. Who knows how many more of these issues I will need to fix before this system becomes reliable, if at all?

There is a really expensive lesson to be learnt here for me: stay away from the new hardware and put in months of research before investments in hardware this big.

I since also replied to the AMD support email making it clear to them that their advice cannot really be termed a fix, let alone not even an acceptable response to a customer spending a good chunk of money on their product. I'll keep posting here how this progresses.

At this time, I'd happily trade the horsepower of my Threadripper system for stability.

bassmadrigal · 03-31-2020, 10:37 AM

Quote:

Originally Posted by asheshambasta

So AMD was decently prompt in responding to my support request but far from helpful. Their fix to this issue is to disable C6 in your BIOS. This is 2020, and a high end processor that costs close to $1000 (at the time of buying), and the vendor advises to disable power management to make a system as expensive behave acceptably.
Needless to say, at this point I'm desperate enough to try anything to make the desktop work decently, so I'll take their advice. But I'm keeping away from AMD for the time being for all new hardware investment for the foreseeable future.

I would highly recommend trying the rcu-nocbs kernel option on your CPU. It has completely solved my lockups for over 2 years. I used to get them every few days or maybe 1-2 weeks max, and since adding it, I've had my computer running for over 90 days before a power outage forced me to shut it down.

With using it, you would add rcu-nocbs=0-15 to your kernel appends (replacing the 15 with the number of simultaneous threads you can handle minus one -- mine is a Ryzen 7 1800x, which is 8c/16t, so mine is 0-15). NOTE: You do need to have a few things enabled in the kernel to use this option, but if you're using Pat's -current kernel or his config from -current, they are already enabled.

Code:

k/kernel-source-4.14.35-noarch-1.txz:  Upgraded.
   RCU_EXPERT n -> y
  +RCU_FANOUT 32
  +RCU_FANOUT_LEAF 16
  +RCU_FAST_NO_HZ y
  +RCU_NOCB_CPU y
  Thanks to alienBOB.

slackerDude · 03-31-2020, 01:19 PM

My Ryzen 1700 (BIOS recent as of 11/2019 or so, IIRC) has been much better recently (MSI B350 Gaming Plus board), but just the other day, tried to run Folding@Home with 16 threads. Ran fine overnight. I came back, turned on the screen, opened a new tab in the browser, read a few things, and poof. Blank screen. This is WITH rcu-nocbs=0-15. However, I don't know if I have all the right kernel defines enabled. No messages of any kind in /var/logs that indicated anything was wrong / detected.

I have an RMAed 1700 (i.e. not the original) because the gcc stress test failed the original. I haven't bothered to run it on this one, but I'm pretty sure they didn't actually properly stress test the one they sent me..

It's my desktop, so I just refrain from running heavy loads for long periods of time (short compiles of SlackBuilds packages are fine), but no way do I trust this machine 100%. I've wondered if getting a cheap 1600 AF for $85 would solve the problem. I would gladly trade 2 cores/4 threads for 100% stability. Or maybe I should try playing with voltage levels..

This is in contrast to my Q9550 that ran for months without a fan (too much dust stalled the fan - it was a file server) and thermal throttling kept it running at 90-95C for months and I didn't even notice any performance degradation.. Obviously, file server loads were not that high.. I only noticed when I was playing around with lmsensors and decided to try it on my file server box and thought somethings was wrong because it kept reading 90 or 95C.

Mind you, bang-for-buck is CRAZY in favor of AMD right now. Plus, Ryzen 4 seems like it's going to put more hurt onto Intel (Ryzen 4 mobile has some eye-popping numbers!) - hopefully they have worked out all the bugs..

Timothy Miller · 03-31-2020, 01:31 PM

Quote:

Originally Posted by slackerDude

My Ryzen 1700 (BIOS recent as of 11/2019 or so, IIRC) has been much better recently (MSI B350 Gaming Plus board), but just the other day, tried to run Folding@Home with 16 threads. Ran fine overnight. I came back, turned on the screen, opened a new tab in the browser, read a few things, and poof. Blank screen. This is WITH rcu-nocbs=0-15. However, I don't know if I have all the right kernel defines enabled. No messages of any kind in /var/logs that indicated anything was wrong / detected.

I have an RMAed 1700 (i.e. not the original) because the gcc stress test failed the original. I haven't bothered to run it on this one, but I'm pretty sure they didn't actually properly stress test the one they sent me..

It's my desktop, so I just refrain from running heavy loads for long periods of time (short compiles of SlackBuilds packages are fine), but no way do I trust this machine 100%. I've wondered if getting a cheap 1600 AF for $85 would solve the problem. I would gladly trade 2 cores/4 threads for 100% stability. Or maybe I should try playing with voltage levels..

This is in contrast to my Q9550 that ran for months without a fan (too much dust stalled the fan - it was a file server) and thermal throttling kept it running at 90-95C for months and I didn't even notice any performance degradation.. Obviously, file server loads were not that high.. I only noticed when I was playing around with lmsensors and decided to try it on my file server box and thought somethings was wrong because it kept reading 90 or 95C.

Mind you, bang-for-buck is CRAZY in favor of AMD right now. Plus, Ryzen 4 seems like it's going to put more hurt onto Intel (Ryzen 4 mobile has some eye-popping numbers!) - hopefully they have worked out all the bugs..

The thing to remember about the 1600AF compared to a 1700 though, is that it's zen+ vs. zen. So the 1600AF has some IPC advantage, too, that helps to negate some of it's disadvantage of having 2 fewer cores.

slackerDude · 03-31-2020, 01:34 PM

Digging into it a bit more - if I have to recompile my kernel anyway (the RCU adjustments were disabled), any reason to pick the "append" model vs just picking +RCU_NOCB_CPU_ALL y in the kernel config? Seems more absolute. Sure, may mess up a non-Ryzen CPU, but it's not like I'm going to re-use this specific kernel on anything else..

slackerDude · 03-31-2020, 02:31 PM

Rebuilt 4.12.2 with RCU_NOCB_CPU_ALL (and the append, which gets ignored) and so far, so good. Updated to latest BIOS for good measure.

Realized I lied about running at stock clocks - I had set the CPU to 3.6 GHz stock clock, no boost. Temps seem fine, even under load. It also doesn't do a good job detecting my 3200 MHz DDR4 - it only detects it at 2133 MHz, so I had to override it to 2933 - the best I can get.

Not sure it will make much of a difference, but worth a shot. Anxiously waiting for 15.0 for a full re-install..

bassmadrigal · 03-31-2020, 03:03 PM

Quote:

Originally Posted by slackerDude

Digging into it a bit more - if I have to recompile my kernel anyway (the RCU adjustments were disabled), any reason to pick the "append" model vs just picking +RCU_NOCB_CPU_ALL y in the kernel config? Seems more absolute. Sure, may mess up a non-Ryzen CPU, but it's not like I'm going to re-use this specific kernel on anything else..

I honestly don't know for sure. I just remember needing to use these options with the early kernel I was working with, I think 4.8 or 4.9. Since it's worked for me, I've continued to use the append line.

And in my dmesg for my 5.4.25 kernel, I do see the following:

Code:

[    0.000000] rcu:     Offload RCU callbacks from CPUs: 0-15.

I'm not sure if that means it's detecting my append line and applying it or if this output happens regardless because of the kernel config.

Timothy Miller · 03-31-2020, 03:12 PM

So, not an expert at this, but using 0-15 I would think SHOULD fail. That covers EVERY THREAD of the CPU, and you need to be able to do callbacks SOMEWHERE. 0-14/1-15 would be (I would think) the max you could disable callbacks on.

slackerDude · 03-31-2020, 03:30 PM

Quote:

Originally Posted by Timothy Miller

So, not an expert at this, but using 0-15 I would think SHOULD fail. That covers EVERY THREAD of the CPU, and you need to be able to do callbacks SOMEWHERE. 0-14/1-15 would be (I would think) the max you could disable callbacks on.

I don't know much either. But, if that's true, why is there an "RCU_NOCB_CPU_ALL" option? Does it make sense to specifically add a setting that will never work / should never be used?

slackerDude · 03-31-2020, 03:35 PM

Hmm. From here (not sure it's 100% applicable for non-virtualized kernel): https://docs.windriver.com/bundle/Wi...756922111.html

rcu-nocbs

Use this option to prevent RCU callback routines from being executed in the targeted CPUs. Valid parameters are a list, range, or combination of CPU identifiers.

RCU callbacks are functions that perform cleanup work after a RCU grace period passes. With some workloads, the number of callbacks can get quite large, for example, when requesting that CPUs 1,2, 3, and 6 not to be used to execute RCU callbacks, you use rcu-nocbs=1-3,6 on the host's boot line. You cannot assign every CPU in the system to the no callback list; at least one processor, CPU 0, must remain in traditional mode or RCU grace period processing will not function properly.

The rcu-nocbs range will typically match the isolcpus parameters in order to further improve the isolation status of the targeted CPUs.

isolcpus

Use this option to specify one or more CPUs to isolate from the general SMP balancing and scheduling algorithms. You can move a process onto or of an isolated CPU with the CPU affinity system calls or the taskset and cpuset commands.

Timothy Miller · 03-31-2020, 03:36 PM

Quote:

Originally Posted by slackerDude

I don't know much either. But, if that's true, why is there an "RCU_NOCB_CPU_ALL" option? Does it make sense to specifically add a setting that will never work / should never be used?

Good point. No clue...the only thing I have found is that apparently the RCU_NOCB_CPU_ALL option has been deprecated after 4.13, so you'd have to use the append in order to accomplish it if still needed in anything newer. Which answers exactly nothing.

andrew.46 · 03-31-2020, 03:52 PM

Quote:

Originally Posted by asheshambasta

I'm not sure that my issue is the same as the issues reported here but I'm still experiencing random lockups on the latest kernels and BIOS versions.

I see from your post on superuser that you are running a 2950X with some difficulties. I have also been running a 2950X for about 8 months so far with Slackware -current and perhaps it is just luck but I have had a rock solid system. I see that we have different motherboards, I am running an MSI Meg X399 Creation board, we are running the same RAM, different video card and different brands SSDs.

When you say 'latest kernels' which kernel are you running now? I am on 5.5.10 at the moment but I usually track the latest 'stable' kernels...

slackerDude · 03-31-2020, 03:56 PM

Dang, lost my reply.

Based on this: https://forums.gentoo.org/viewtopic-...8-start-0.html
and this:
https://community.amd.com/thread/225795

There is more info. RCU_NOCB_CPU_ALL being deprecated apparently TRIGGERED lots of Ryzen issues. I think it was brought back because of it. (I'm on 4.12.2 for now)

Also, there is "idle=nomwait" append option, as well as "pci=msi" for some.

AND, a "zenstates" python / github thing to disable sleep state C6 (and maybe also some BIOS options in later revisions) if you get a lot of idle-related hangs. Fun reading :-)

I don't think that Intel is entirely free of these issues, just that their issues were found / fixed quicker, and that they don't do as much of wholesale architecture changes as Ryzen was from Bulldozer.

bassmadrigal · 03-31-2020, 04:27 PM

Quote:

Originally Posted by Timothy Miller

So, not an expert at this, but using 0-15 I would think SHOULD fail. That covers EVERY THREAD of the CPU, and you need to be able to do callbacks SOMEWHERE. 0-14/1-15 would be (I would think) the max you could disable callbacks on.

I read that too earlier today, but I've been using this on my appends since late 2017 (I was thinking it was 2018, but I found a post from me stating I did it Nov or Dec of 2017).

Here's the kernel bug report on where I first found that fix documented.

szo · 04-01-2020, 12:00 AM

Ran into this bug as well a long time ago. What fixed it for me was setting the bios option "low current idle" to "typical current idle" .... however this may be board-specific. It worked on my Asus motherboard.

I do have the previously suggested RCU options modified as well in my kernel build except for the "CONFIG_RCU_NOCB_CPU" option which was no longer necessary after the bios fix mentioned above.