Gigabyte Ethernet RTL8168 broken on new Kernel releases
SlackwareThis Forum is for the discussion of Slackware Linux.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
The driver change broke the driver for existing hardware, as mentioned in the existing kernel bug report for this problem.
The change added a list of accepted PHY codes, which the previous 4.4 driver did not have.
Once that driver is patched to recognize the PHY, the driver works again. I just used to it to Ethernet configure another piece of hardware. Now, I will not likely use it again for months.
I have no need to throw money or hardware at this problem. Once I debug the problem enough to "re-discover" the patch I have sitting in the /usr/src directory, it does not cost much extra time.
The pain is in the Slackware kernel upgrade breaking it again, and that it may be months before I discover that the Ethernet is broken, again.
I am getting too old to remember this among all the other things that constantly need fixing.
Most traffic goes through a WIFI internet board. The Ethernet is necessary for Lab work, modems, and anything that needs a hard connection for setup.
Just how do you go about escalating these things. I seem to not have the right connections, as I always have trouble getting an account setup just to access the bug report system.
I agree. The end user should have never been required to resort to patching the kernels themselves or demanded to be replacing equipment in critical systems that need to be running and have been running 14-15 years, even so on the network all this time. In the past kernel devs have been receptive on preventing regressions and keeping critical systems running.
I would first go get the motherboard rev # and read the writing on the realtek chip that is on the motherboard edge behind the pcie slot next to the atx back panel. My chip is an rtl8111c. On my gigabyte board at the board's website, the official specs say it was equiped with a rtl8111c/d. Your board specs say it was equiped with a rtl8111d/e -- I guess that means yours is slightly newer, but at times, they could have had exact same chips on different manufacturing runs.
I looked at the pinout for rtl8111c & rtl8111d, and it makes sense. There is a package which means these chips should be drop in replacement for each other. So they intended to switch when the old chip supply dried up I would guess. Same for your board model.
However, I have no idea if internally these revisions are compatible. After all, their BIOS engineers seem to be out of sync and must have been dropping in the wrong BIOS code for a number of these boards, maybe having to do with the realtek revision swapping, or going from board design to board design with different chip manufactures. Whatever the case, they seem to have seen the problem, but why in the world is it ok to only rely on the BIOS in this way??? I thought Linux was being designed to do things right, and it seems no one noticed the wrong phy id for a number of these boards, maybe because the chip/driver manufacturers never bothered really using it in their code? So this new method seems to be wrong to rely on as the only way. Fixes some, and breaks some.
So anyway, then I would go to the maintainer for realtek.ko Johnson Leung or the phylib maintainer Andy Fleming and ask them to add a kernel param to allow you to set a phy id. I would use email to contact them. If they ignore this some more, I would then go to the subsystem maintainer (basically where they pipe all their patches through) and tell them there is a regression that you have to keep patching for and would like a kernel param to address the situation you can then describe to him as well. Hopefully they don't say send in the patch to make you get it done, but at least that might be something that at least go on that they would be open to accepting a patch to add a param.
I have had success many years ago getting patches in by sending them directly subsystem maintainers and getting attention of very important individuals, but this was many years ago before git or them using bugzilla. It was a much tighter community back then, quite open. I don't know today... but it should work if they haven't gone to the dark side.
Oh, if you do get a patch, or if you want to update your patch, I would try to use the methods defined for my phy_id 0x001cc912. I can't imagine it's gonna be that much different rev to rev, and also, I can't figure out why it's called a "rtl8211b" in the kernel, but somehow my network chip just works... But you can use whatever it was being detected as previously.
Last edited by the3dfxdude; 03-20-2024 at 02:08 PM.
...
Oh, if you do get a patch, or if you want to update your patch, I would try to use the methods defined for my phy_id 0x001cc912. I can't imagine it's gonna be that much different rev to rev, and also, I can't figure out why it's called a "rtl8211b" in the kernel, but somehow my network chip just works... But you can use whatever it was being detected as previously.
Board of the OP has a RTL8168d NIC, the version number mainly refers to the MAC layer of the NIC.
The integrated PHY is a derivate of the RTL8211b PHY (identifying as 0x001cc912, same as yours) which was available also standalone.
It's not the case that the kernel expects the BIOS to initialize the PHY in a specific way, it just expects the BIOS
not to break detection. Also on the OP's board the PHY later identifies as 0x001cc912, BIOS just brings it to an invalid
state initially, resulting in the PHY reporting a more or less random PHY ID value that doesn't match the Realtek numbering
scheme. It seems that a certain later access to the PHY makes it recover from the initial invalid state. Hard to say
in detail because neither Realtek nor Gigabyte release errata information.
On a side note: If the system is so critical to the OP, and he faced also other issues due to kernel upgrades:
Why not stick to a specific LTS kernel version?
He is using an LTS version stream provided by Slackware.
Oh ok. So it's not even a BIOS bug. It's just the kernel relies on some BIOSes initializing the hardware early to get the phy id, and some don't do it as expected. But then that would point to the kernel could do the same, but the kernel devs don't know the expected proper sequence. I've seen stuff like this before and of course is hurt by lacking documentation. This also explains the "use the BIOS option to enable boot rom", forcing the initialization of the hardware. If he can trigger the proper initialization later, then that means the kernel has done it and just need to track down which bits where sent and when and then rework the driver to do it. Do you even get the proper phy_id when reading it yourself after initial boot?
He is using an LTS version stream provided by Slackware.
Oh ok. So it's not even a BIOS bug. It's just the kernel relies on some BIOSes initializing the hardware early to get the phy id, and some don't do it as expected. But then that would point to the kernel could do the same, but the kernel devs don't know the expected proper sequence. I've seen stuff like this before and of course is hurt by lacking documentation. This also explains the "use the BIOS option to enable boot rom", forcing the initialization of the hardware. If he can trigger the proper initialization later, then that means the kernel has done it and just need to track down which bits where sent and when and then rework the driver to do it. Do you even get the proper phy_id when reading it yourself after initial boot?
You got me slightly wrong. Normally every PHY after power-on reset is in a state where you can properly read the PHY ID (w/o any BIOS initialization).
Maybe the NIC version we talk about here has some silicon bug that requires a fix or workaround in software (as part of BIOS code).
We don't know because, as I said, the involved companies don't publish errata information.
Ok? If it's a bug in silicon, that some BIOS patch over, then he needs a patch in the kernel to set which phy to use when realtek.ko is loaded. So whatever, either there is a magic sequence that can be triggered, or he is left with needing a patch in the kernel.
Would force unloading the module and reloading the module do the same thing ?
I currently am running a patched kernel, and that recognizes the PHY, so it would not be a valid test for what you want (I think).
I will have to think how to get to a version of the kernel that does not work, probably by installing the Slackware huge kernel. But then I cannot edit that source code.
I would have to make a special version of the kernel just to test that.
I'm not 100% sure, for now I'd say that both tests are independent.
The module reload test can be done with the stock Slackware kernel that doesn't work out of the box.
Alternatively you could remove your patch from the code base you used to build on own kernel, and rebuild.
Then again you should have a kernel that doesn't work out of the box.
Next step would be to apply the patch proposed in #37 and rebuild the kernel. Maybe it works w/o reloading
the module (if reloading the module helps for you at all).
I have made linux-5.15.117t, which has the nons1 patch, and disabled the other GA-880 patch. Otherwise it is a copy of my existing linux-5.15.117 source, with same config.
My previous kernel was complied with gcc 11.2.
This kernel was compiled with gcc 12.3.
Selected portions of dmesg:
There are 2 internet devices, one eth0, and other wlan0.
Code:
[ 4.893974] r8169 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control
[ 4.898762] input: HDA ATI HDMI HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:05.1/sound/card1/input3
[ 4.907145] r8169 0000:04:00.0 eth0: RTL8168d/8111d, 1c:6f:65:4a:XX:XX, XID 283, IRQ 35
[ 4.910805] r8169 0000:04:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
[ 4.967166] rtl8192ee: Using firmware rtlwifi/rtl8192eefw.bin
[ 4.968042] ieee80211 phy0: Selected rate control algorithm 'rtl_rc'
[ 4.968416] rtlwifi: rtlwifi: wireless switch is on
[ 31.988183] NET: Registered PF_INET6 protocol family
[ 31.989518] Segment Routing with IPv6
[ 31.989537] In-situ OAM (IOAM) with IPv6
[ 41.341589] RTL8211B Gigabit Ethernet r8169-0-400:00: attached PHY driver (mii_bus:phy_addr=r8169-0-400:00, irq=MAC)
[ 41.481363] r8169 0000:04:00.0 eth0: Link is Down
[ 671.713611] r8169 0000:04:00.0 eth0: Link is Up - 100Mbps/Full - flow control off
[ 671.713642] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 692.768985] fuse: init (API version 7.34)
[ 1951.049538] wlan0: authenticate with 04:95:e6:17:XX:XX
[ 1951.059891] wlan0: send auth to 04:95:e6:17:XX:XX (try 1/3)
[ 1951.097488] wlan0: authenticated
[ 1951.098546] wlan0: associate with 04:95:e6:17:XX:XX (try 1/3)
[ 1951.118179] wlan0: RX AssocResp from 04:95:e6:17:XX:XX (capab=0x411 status=0 aid=7)
[ 1951.123698] wlan0: associated
[ 1951.169621] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
I brought the eth0 network up and used it to login to a modem device config page.
This seems to have been a success.
I believe that this ought to be reported to dmesg as a quirk detected, so that the user knows, and then it can be documented that there is also a possible BIOS update.
That may not be practical if this is just going to be a "Reset the thing because SOME BIOS cannot be trusted to get it done right". In which case I believe a comment in code would be needed to protect it against those who discover this later and wonder why that is there, and may decide to take it out again because they do not know of any reason for it.
Last edited by selfprogrammed; 03-28-2024 at 07:55 AM.
BusinsessKid asked about chip ids: I went over the board with a magnifier to get numbers before I installed it. I did not get much because of heatsinks, and few identifiable chips.
This is some of what I have found (from my hardware detect file).
Code:
*** Motherboard
Gigabyte Technology Co., Ltd.
Product Name: GA-880GA-UD3H (rev 2.1)
UUID: 31433646-3635-3441-3832-4336FFFFFFFF
*** CPU
Athlon II X4 640
Family K10
(in hex 0x10, Family 16 in lscpu)
LABEL:
ADX640WFK42GM
NADHC AD 1030CPGW
9H81653H00185
From Datasheets:
(OEM/tray) processor (not boxed)
3000 MHz, clock mult 15.
667 MHz memory controller
938 pin, socket AM3 (or AM24)
Wattage: 95W
core: PROPUS
arch: K10
stepping: BL-C3
64 bit, 4 cpu cores, 4 threads
cache 1:
instruction cache: 4 x 64 KB, 2 way assoc
data cache: 4 x 64 KB, 2 way assoc
cache 2: 4 x 512 KB, 16 way assoc
cache 3: NONE
Capabilities:
MMX, MMXext
3DNow, 3DNowext
SSE, SSE2, SSE3, SSE4a
AMD64, AMD-V
Cool'n'Quiet 3.0
Dual Dyn Power Mngmt, Cool Core
States: S0 S1 S3 S4 S5
Core: C1 C1E
This CPU is supported in SMBios since F3.
** DMI
Version: AMD Athlon(tm) II X4 640 Processor
ID: 53 0F 10 00 FF FB 8B 17
Signature: Family 16, Model 5, Stepping 3
FPU (Floating-point unit on-chip)
VME (Virtual mode extension)
DE (Debugging extension)
PSE (Page size extension)
TSC (Time stamp counter)
MSR (Model specific registers)
PAE (Physical address extension)
MCE (Machine check exception)
CX8 (CMPXCHG8 instruction supported)
APIC (On-chip APIC hardware supported)
SEP (Fast system call)
MTRR (Memory type range registers)
PGE (Page global enable)
MCA (Machine check architecture)
CMOV (Conditional move instruction supported)
PAT (Page attribute table)
PSE-36 (36-bit page size extension)
CLFSH (CLFLUSH instruction supported)
MMX (MMX technology supported)
FXSR (FXSAVE and FXSTOR instructions supported)
SSE (Streaming SIMD extensions)
SSE2 (Streaming SIMD extensions 2)
HTT (Multi-threading)
Max Speed: 3000 MHz
External Clock: 200 MHz
*** North Bridge
AMD 880G
*** South Bridge
AMD SB850
*** LAN
10/100/1000 Mb, 1 port, on RealTek RTL811D
back panel: 1
Speed LED: off 1 MB/s
orange 10 MB/s
green 100 MB/s
Activity LED: blinking when active
driver: r8169
Note: driver wants to disable ASPM control
PHY ID: 0xc1071002
This is a bad PHY_ID reported by the Gigabyte BIOS.
maybe RTL8211B
bug reports indicate to use REALTEK_PHYLIB
huge kernel loads:
module: libphy
libphy 118784 3 r8169,realtek,mdio_devres, Live 0xf8463000
?? r8169 auto-selects Realtec RTL821x PHY
Realtek PHY support is broken in kernel 5.9.2, fixed in kernel 5.9.3
02:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03) (prog-if 30 [XHCI])
Subsystem: Gigabyte uPD720200 USB 3.0 Host Controller
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 03)
Subsystem: Gigabyte Motherboard
Kernel driver in use: r8169
Kernel modules: r8169
04:0e.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller (PHY/Link) (prog-if 10 [OHCI])
Subsystem: Gigabyte Motherboard
Kernel driver in use: firewire_ohci
Kernel modules: firewire_ohci
*** Bios
Vendor: Award Software International, Inc.
Version: F4
Release Date: 07/28/2010
Characteristics:
ISA is supported
PCI is supported
PNP is supported
APM is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
5.25"/360 kB floppy services are supported (int 13h)
5.25"/1.2 MB floppy services are supported (int 13h)
3.5"/720 kB floppy services are supported (int 13h)
3.5"/2.88 MB floppy services are supported (int 13h)
Print screen service is supported (int 5h)
8042 keyboard services are supported (int 9h)
Serial services are supported (int 14h)
Printer services are supported (int 17h)
CGA/mono video services are supported (int 10h)
ACPI is supported
USB legacy is supported
AGP is supported
LS-120 boot is supported
ATAPI Zip drive boot is supported
BIOS boot specification is supported
Targeted content distribution is supported
EFI(UEFI) NOT SUPPORTED
I grab everything I ever see on this board, so I already have copies of the newer BIOS versions. As everything seems to require 2 or more tries before it goes right, I do not dare try to install them as I do not have a way to back up and try a second time, and this machine is critical. I have already read the BIOS update information. It is better than some others but I can sense the danger.
There are safer ways to deal with this.
Thorough listing, but it's difficult to tell where your chips leave off and your comments begin. On the 811D realtek nic, these comments stood out to me.
Note: driver wants to disable ASPM control
bug reports indicate to use REALTEK_PHYLIB
PHY ID: 0xc1071002
bug reports indicate to use REALTEK_PHYLIB
1. Are you disabling ASPM?
2&4. I very much imagine that Realtek_phylib ≠ libphy. Libphy sounds generic, and part of the kernel Realtek_phylib sounds like is a pile of fixes cobbled together to cope with the inadequacies of Realtek phys. The realtek component of libphy starts with rtl 820x and the numbers go up. There's no mention of 81xx devices.
3. The phy ID gives you a search term for targeted online searches.
I would be thinking of buying a pcie or usb3 nic and blacklisting all realtek modules. If the equation "time=money" makes sense to you, that's the way to go. OTOH, if you have your teeth in this and want to see it through, I totally understand, and confess to propping up sh***y hardware myself in the past.
Lastly, I had read your comments to be about an RTL8110 nic, but in your last post you had it as an RTL 811D part. Which actually is it? If it's rtl811D, we could be loading the wrong module. You see, Realtek have more different components than imagination, so there's bound to be overlap. Half a dozen modules might work badly.
Last edited by business_kid; 03-28-2024 at 09:55 AM.
The hardware notes were added to, as more information was discovered. Due to the concerns, I have applied a magnifying glass to that part of the board more than once, and could not discover anything more than what is in the notes.
Note: The hardware was working fine before the kernel update. I did get the Ethernet working after the kernel update, and have not had any eth0 network failures other than the kernel changing the driver behavior regarding PHY id codes.
There is nothing wrong with the hardware. This problem was caused by changing the kernel driver in a way that made it dependent upon BIOS in a way that was not previously tested. Linux developers have known for a long time to not trust the BIOS on any details that Windows might not rely upon.
What I know is what the drivers and BIOS report.
As it is working, when patched, the disabling ASPM comment has been ignored. There are so many odd comments in the dmesg, that I don't have time to investigate every odd thing that does not appear to be broken.
I have not found a good explanation of what phylib is, or who supplies it. I do note that there is a libphy module and that it got used by huge kernel.
Note: proc/modules shows that libphy is still being loaded. Do not know which phy within that is actually being used.
Will have to see if what PHY id is reported now, supposing I find out how to access it.
Any hardware and financial suggestions are irrelevant, for so many reasons. I prefer to treat every question on Linuxquestions like that. There is value in restoring the driver beyond one users considerations because there are multiple users what are affected by the driver changes, and they all benefit from from fixing that.
Last edited by selfprogrammed; 03-29-2024 at 05:03 AM.
I have made linux-5.15.117t, which has the nons1 patch, and disabled the other GA-880 patch. Otherwise it is a copy of my existing linux-5.15.117 source, with same config.
...
I brought the eth0 network up and used it to login to a modem device config page.
This seems to have been a success.
...
Thanks for testing the patch. I removed irrelevant lines from the dmesg log snippet. The NIC-related information is ok now.
Because the proposed patch fixes the issue, I think the following is the root cause of the issue:
The PHY has many more registers than the 32 which can be directly addressed on the MDIO bus.
Many Realtek PHY's (including this one here) solve this by grouping registers in banks, and a write to register 0x1f selects a specific bank.
Presumably the faulty BIOS programs something in the PHY and misses to reset the bank selector to default 0.
Therefore reading the PHY ID accesses registers in a different bank, returning a more or less random value.
The proposed patch resets the bank selector before reading the PHY ID.
Regarding the NIC version numbers:
RTL8111D (I think one 1 was missing) is the version of the MAC + PHY combination. The integrated PHY is derived from standalone PHY RTL8211B.
Therefore dmesg shows different version numbers.
I have not found a good explanation of what phylib is, or who supplies it. I do note that there is a libphy module and that it got used by huge kernel.
I can illiuminate that a little. I grokked the source a bit. There is a drivers/net/phy directory where the manufacturers all seem to have at least one file. Broadcom has many files, realtek just one. I looked at it. There is grouping of it's (wired) nics to narrow the numbers of different scenarios it caters for.The same, I suppose for every manufacturer.
Leaving the digit out on the part number makes sense to me, btw. And it's easily done.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.