LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 12-31-2021, 01:58 AM   #1
adrian_stephens
Member
 
Registered: Apr 2005
Location: Cambridge, UK
Distribution: KDE Neon, Proxmox
Posts: 37
Blog Entries: 2

Rep: Reputation: 1
Ubuntu 21.10 Western Digital & Samsung nvme unreliability on H310 chipset


Hardware: Intel “Coffee Lake” (8th Gen) i7-8700 @ 3.2 GHz
Asustek Prime H310M-A R2.0 motherboard
Corsair 32 GB DDR4 2166 MHz RAM
WD 1TB 750 (firmware 111130WD) & 850 (firmware 614900WD) nvme disks
Samsung 1TB SSD 980 nvme disk (Firmware 1B4QFX07)
Samsung Evo 1TB SATA disk

Software: Kubuntu (Ubuntu 21.10 & KDE/Plasma desktop), Linux 5.13.0
ZFS filesystem for root

I have been chasing down unreliability of my Kubuntu install for almost a month. My nvme drives are all unreliable (pci io errors reported), but an attached SATA SSD is reliable. At least one of the nvme drives is solidly reliable running Windows 10.

Things I tried, without resolving the issue:
1. Setting nvme_core.default_ps_max_latency_us to 0, 6000, 12000 in the kernel command line
2. Setting pcie_aspm=off in the kernel command line
3. Updating BIOS to latest version
4. Updating WD disks to latest firmware version
5. Running ext4 rather than zfs on root.
6. Zorin 16 (Ubuntu 20.04-based) and Debian 11 (Bullseye) distros

I couldn’t find online similar reports specific to the H310 chipset, but I suspect this to be the cause somehow, or perhaps my motherboard is faulty. I have a replacement motherboard with a better chipset courtesy of Aliexpress on its way to me. I’ll post an update when that arrives.

I wanted to keep this post short, but I also wanted to give more detail about what I did and what I observed. You can read some more detail in my related blog post https://www.linuxquestions.org/quest...chipset-38727/.

Greetings from unreliable Cambridge, UK
Adrian
 
Old 12-31-2021, 06:09 PM   #2
mrmazda
LQ Guru
 
Registered: Aug 2016
Location: SE USA
Distribution: openSUSE 24/7; Debian, Knoppix, Mageia, Fedora, OS/2, others
Posts: 6,425
Blog Entries: 1

Rep: Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219
I wouldn't hastily blame the H310 before ruling out Asus' BIOS. I have its B560M-A and an 11th gen i5 that has never booted past the loading of the i915 graphics driver if I have more than one connected display. If I wish use of more than one display, I must either limit performance by disabling KMS, or boot first, then attach the other display cables. Simply powering the displays down before attempting to boot does not help. I reported a bug on gitlab.freedesktop.org which generated a comment that blames the Asus BIOS. Meanwhile I've been going back & forth with Asus' least competent support staff asking for information I gave them originally, and telling me to try things I already tried. I'm betting the BIOS needs fixing.
 
Old 01-01-2022, 01:19 AM   #3
adrian_stephens
Member
 
Registered: Apr 2005
Location: Cambridge, UK
Distribution: KDE Neon, Proxmox
Posts: 37

Original Poster
Blog Entries: 2

Rep: Reputation: 1
Quote:
Originally Posted by mrmazda View Post
I wouldn't hastily blame the H310 before ruling out Asus' BIOS. I have its B560M-A and an 11th gen i5 that has never booted past the loading of the i915 graphics driver if I have more than one connected display. If I wish use of more than one display, I must either limit performance by disabling KMS, or boot first, then attach the other display cables. Simply powering the displays down before attempting to boot does not help. I reported a bug on gitlab.freedesktop.org which generated a comment that blames the Asus BIOS. Meanwhile I've been going back & forth with Asus' least competent support staff asking for information I gave them originally, and telling me to try things I already tried. I'm betting the BIOS needs fixing.
Thank you, Mr Mazda, for your reply. Yes, it could be a problem in the BIOS. But I can't imagine what kind of problem that would be that would enable Windows 10 to run fine, but would single out linux. Yes, the board would be developed to run Windows 10, so you'd expect it to work reliably. Perhaps there's some hardware feature Windows 10 is not using that Linux is.

I have no real way of localizing the issue except by swapping out components.
Regards,
Adrian
 
Old 01-02-2022, 01:27 AM   #4
mrmazda
LQ Guru
 
Registered: Aug 2016
Location: SE USA
Distribution: openSUSE 24/7; Debian, Knoppix, Mageia, Fedora, OS/2, others
Posts: 6,425
Blog Entries: 1

Rep: Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219Reputation: 2219
Manufacturers work with M$ to have the required drivers in place when a product is released. A choice may be made to "solve" with a driver change a problem the BIOS causes, to kludge instead of a providing the best fix.
 
Old 01-02-2022, 03:11 AM   #5
adrian_stephens
Member
 
Registered: Apr 2005
Location: Cambridge, UK
Distribution: KDE Neon, Proxmox
Posts: 37

Original Poster
Blog Entries: 2

Rep: Reputation: 1
At the moment a combination of nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off on the command line appears to have resolved this issue. I don't know why it has fixed it now, as I'm pretty sure I tried this combination some time ago without success.

I did place an order for a motherboard, with a better chipset on it, so hopefully I can avoid the workaround when it arrives. I'll post an update with my findings.

And, Mr Mazda, I certainly agree that hardware/bios bugs might be resolved in the driver. I used to write firmware for the 802.11 chips our company designed. You would not believe some of the ugly kludges/fixes/workarounds we had to use to avoid re-spinning the hardware.
 
Old 01-02-2022, 01:37 PM   #6
obobskivich
Member
 
Registered: Jun 2020
Posts: 610

Rep: Reputation: Disabled
Just a thought/comment: you are aware that Samsung SSDs have known, documented, and pervasive flaws in their proprietary controllers that Samsung has consistently failed (refused?)* to fix, which leads to incompatibilities/instabilities in both linux and macOS, right? Specifically TRIM functionality is more or less broken, and the presence of the Samsung controllers can also lead to IO stalls. This has been reported as far back as the -40 series Samsung drives, and is/was still an issue with the -70 series drives, and I'd imagine that hasn't improved with the -80 series. These issues do not seem to be reported (or reported as significantly) in Windows, and my own (admittedly 'quick and dirty') testing supports that - the two Samsung -70 series drives I have will work just fine in Windows 7x64, but will both cause system-breaking instability in various linux distros (I haven't felt like testing this on the Mac, but the opencore documentation points to similar issues with Samsung drives as has been document for linux, and TRIM is blacklisted by macOS for Samsung drives).

If you remove the Samsung drives, does everything tidy up? If so, I'd blame them as the known problem child and replace them with something that isn't a known problem child - more 'generic' devices (that often use Phison or Silicon Motion (SMI) controllers) tend to have no problems.

* Why 'refused'? Early on, with the initial reports on the 840, it seemed that Samsung was open to the issue being a firmware problem, but as the issues were reported with later generations they appear to have changed their tactic to just declaring linux broken.

EDIT
Here's an example article I found from a quick search: https://www.neowin.net/news/linux-pa...d-amd-systems/

I know there are some Bugzilla threads about this from over the years, and I've both experienced (and seen this documented/discussed) on the NVMe (900 series) models as well, but this was what I could find with a quick search. From first-hand experience I spent the better part of 3 months chasing random lockups, hangs, file-system gremlins, and so forth across multiple distros, motherboards, CPUs, memory, etc ('lots of hardware' was involved suffice to say), before finally turning on the supposedly 'gold standard' Samsung SSDs - tear those out, and everything went back to work...they work just dandy in a Windows box though.

Last edited by obobskivich; 01-02-2022 at 01:43 PM.
 
Old 01-03-2022, 12:31 AM   #7
adrian_stephens
Member
 
Registered: Apr 2005
Location: Cambridge, UK
Distribution: KDE Neon, Proxmox
Posts: 37

Original Poster
Blog Entries: 2

Rep: Reputation: 1
Quote:
Originally Posted by obobskivich View Post
If you remove the Samsung drives, does everything tidy up?
Thank you for your comment.

No. I saw the fault first on the WD Black nvmes. I tried two different types and two sizes. I thought to get a Samsung to compare to, as it was at the top of a random list of "nvme drives recommended for linux" I found. The WD Blacks were third on that list.

I do have a workaround, as I noted yesterday.
 
Old 01-03-2022, 02:05 PM   #8
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 17,428

Rep: Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591
What I felt a sense of dejà vue. Back a few decades, I had a Via chipset with the infamous "Hardware fault."

In fact, it wasn't a hardware fault. There's a whole pile of settings the motherboard supplies to the BIOS for correct operation; Via had tweaked these so it could use the Creative Soundblaster piece of junk in m$ windoze. Hence the problem, which was fixable using a utility and instructions from Via to tweak them back. "This setting reads 60; adjust it to 40" sort of thing.

Too late for Via, of course. They bit the dust or got sold for small money. It could well be something similar in the drive, firmware, or anywhere. Why do you think I got out of servicing Electronics?

Last edited by business_kid; 01-03-2022 at 02:07 PM.
 
Old 01-03-2022, 02:18 PM   #9
Timothy Miller
Moderator
 
Registered: Feb 2003
Location: Arizona, USA
Distribution: Debian, EndeavourOS, OpenSUSE, KDE Neon
Posts: 4,028
Blog Entries: 27

Rep: Reputation: 1524Reputation: 1524Reputation: 1524Reputation: 1524Reputation: 1524Reputation: 1524Reputation: 1524Reputation: 1524Reputation: 1524Reputation: 1524Reputation: 1524
Quote:
Originally Posted by business_kid View Post
What I felt a sense of dejà vue. Back a few decades, I had a Via chipset with the infamous "Hardware fault."

In fact, it wasn't a hardware fault. There's a whole pile of settings the motherboard supplies to the BIOS for correct operation; Via had tweaked these so it could use the Creative Soundblaster piece of junk in m$ windoze. Hence the problem, which was fixable using a utility and instructions from Via to tweak them back. "This setting reads 60; adjust it to 40" sort of thing.

Too late for Via, of course. They bit the dust or got sold for small money. It could well be something similar in the drive, firmware, or anywhere. Why do you think I got out of servicing Electronics?

Not related to the OP, but relevant to your post, Via actually is still around. They just a couple months ago completed a deal that sold off their Centaur Technologies (CPU Design) employees to Intel. They also have sold off all their production capability in the US (also to Intel if I recall but not sure), but are still actively developing and producing silicon in conjunction with Zhaoxin in China.
 
Old 01-04-2022, 06:21 AM   #10
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 17,428

Rep: Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591Reputation: 2591
Quote:
Originally Posted by Timothy Miller
Not related to the OP, but relevant to your post, Via actually is still around.
Interesting. My last dealing with them was a usb issue generating massive log spam. The ehci maintainer knew of it, but had nothing to go on. So I found some Via forum and went OTT in sounding off, looking for notice. In typical chinese fashion, my post was instantly moved, and I was assigned to some programmer dweeb on a private forum. He wrote and I tested a patch to spit certain registers to the log. I had to patch and compile the bleeding edge kernel, and debug his code. Finally I tested, extracted about a meg of log snippets where I had inserted/removed devices, and sent that off over dialup modem to the ehci_hcd maintainer. At last he had sight of his fault, and patched appropriately. We later found Via's programmer dweeb and I had the same chipset. The 2 ports paying no heed to registers were found by Via, who had disabled them without informing him!

I've since discovered that a lot of these buyouts are for the technical staff - designers/programmers and to get their guys up to speed on the tech they have bought. It must have been the chipset division they sold.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: DXVK To Enter Maintenance Mode Because Of Fragility And Unreliability LXer Syndicated Linux News 0 12-15-2019 11:15 AM
Migrate Linux/win10 dual boot from MBR nvme drive to a new GPT nvme drive bluemoo Linux - Software 7 09-25-2018 06:42 PM
[SOLVED] Installing OpenSUSE11.2 on Dell Perc H310 RAID Controller yogesh_attarde Linux - Newbie 2 12-09-2013 05:53 PM
Western Digital RE4 Drives vs Western Digital AV Drives krazybob Linux - Hardware 3 12-08-2013 03:43 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 08:14 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration