[SOLVED] Slackware-current: kernel panic with 5.4.1 on x86
Slackware - InstallationThis forum is for the discussion of installation issues with Slackware.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
I have two older x86 machines (Acer Aspire One & Asus EEEPC 1201HA) with Intel Atom CPUs. On both of them the new 5.4.1_smp kernel will panic/oops during early boot.
The 4.19.* kernels ran just fine along all kernel upgrades.
Also, I have *no* such problem with the 5.4.1_smp kernel on x86_64.
In order to investigate I ran the new (from 2019-November-30) x86 `usbboot.img' with `qemu-system-i386'. When I did *not* specify the CPU (using default) the image booted. However, when setting the cpu to n270 (similar/same to what I have) I got `oops 9', just like on my real HW.
I can confirm this on my Acer Aspire One - Atom N270 running Slackware 14.2 (stable) 32bit. Just tried the kernel-huge-smp-5.4.1_smp-i686-1.txz & kernel-modules-smp-5.4.1_smp-i686-1.txz packages from -current and got the same "BUG: unable to handle page fault for address: XXXXXXXX" error, and the subsequent kernel panic.
There are many reports on the net about the error above and apparently it started with 5.2.x. The error reports are usually linked with a specific module, like in this case: https://lkml.org/lkml/2019/4/25/1138
On this system I have my own 5.3.1 compilation running without any issues, but it's tailored for the system, efi & pcmcia & co disabled in the config.
Regarding my previous observation "apparently it started with 5.2.x" I realized that I was searching after: "unable to handle page fault for address" and "reserved bit violation", which both were introduced in recent kernels: https://lore.kernel.org/patchwork/patch/1022776/ https://lore.kernel.org/patchwork/patch/1064269/
meaning, it doesn't seem to have any connections with 5.2.x, but with the recent reformulation of the kernel error codes.
Don't know why memremap is failing at "unable to handle page fault for address: XXXXXXX", and why that address is protected (reserved).
Since this dumb Atom N270 has only one core and shows off as having two - Hyper-Threading (cannot disable it in BIOS), I suspected some race conditions and tried to boot the 5.4.1 kernel with the kernel boot parameters: maxcpus=1 nosmt
Didn't help...
I'm actually busy recompiling 5.4.1 on my own, already re-compiled it 3 times disabling options that were not enabled by default in the kernel defconfig and under suspicion for being the cause for the crash in the Slackware provided kernel.
First I focused on the last two drivers that were loaded before the crash, respectively zswap and btrfs, disabled them and it didn't help.
Then I went on playing with the MTRR options, disabling the MTRR cleanup support -> CONFIG_MTRR_SANITIZER, didn't help either. https://wiki.gentoo.org/wiki/MTRR_and_PAT https://www.kernel.org/doc/Documentation/x86/mtrr.txt
I was looking again over your crash report (which is the same as mine, except I didn't capture (saved) it) and noticed that in the "Call Trace" section: efi_rci2 is listed and found out that it's enabled (CONFIG_EFI_RCI2_TABLE=y) in the Slackware provided kernel. In my first successful test, detailed in the kernel thread, I disabled EFI and that automatically disabled the efi_rci2 https://lore.kernel.org/patchwork/patch/861224/
I'm now re-compiling the kernel with config-huge-smp-5.4.1-smp and the option "# CONFIG_EFI_RCI2_TABLE is not set". It will take a while on this lazy Atom N270 and I'll report once done and tested (booted).
The EFI Runtime Configuration Interface Table Version 2 Support, to be found in the kernel config:
Code:
.config - Linux/x86 5.4.1 Kernel Configuration
> Firmware Drivers > EFI (Extensible Firmware Interface) Support
[ ] EFI Runtime Configuration Interface Table Version 2 Support
With the config doc:
Code:
Displays the content of the Runtime Configuration Interface
Table version 2 on Dell EMC PowerEdge systems as a binary
attribute 'rci2' under /sys/firmware/efi/tables directory.
RCI2 table contains BIOS HII in XML format and is used to populate
BIOS setup page in Dell EMC OpenManage Server Administrator tool.
The BIOS setup page contains BIOS tokens which can be configured.
Say Y here for Dell EMC PowerEdge systems.
Symbol: EFI_RCI2_TABLE [=n]
Type : bool
Prompt: EFI Runtime Configuration Interface Table Version 2 Support
Location:
-> Firmware Drivers
-> EFI (Extensible Firmware Interface) Support
Defined at drivers/firmware/efi/Kconfig:183
Depends on: EFI [=y] && (X86 [=y] || COMPILE_TEST [=n])
When building kernels, in the kernel config ncurses window, one can search the kernel options; there can be vendor specific kernel options that can go along way--I gave a n270 asus eee netbook to my son, and there is a kernel option for EEE netbooks, that when enabled, allowed booting: it is way easier to find the option to enable by searching for eee, than trying to peruse the thousands of kernel options. The initial 14.2 kernel boots the eee... but if I apply the patches, and upgrade it to a fully patched 14.2 kernel, then the eee no longer boots... somewhere along the way, Pat must have disabled the EEE option, and perhaps he disabled similar options for your Acer. Although irrelevant while upgrading last night, I noticed that 14.2 is at kernel 4.4.2, and current is at kernel 5.4.2, so Pat must like numerical symmetries. I built a 4.20 kernel for the eee, and another for the ideapad, and have the kernels blacklisted in /etc/slackpkg/blacklist, so I can upgrade and patch, without overwriting these custom 4.20 kernels. Thus, when building kernels, it never hurts to search the kernel options for vendor name, and enable anything relevant... likewise with cpu and gpu.
I'm now re-compiling the kernel with config-huge-smp-5.4.1-smp and the option "# CONFIG_EFI_RCI2_TABLE is not set". It will take a while on this lazy Atom N270 and I'll report once done and tested (booted).
After a long, long, LONG! compilation time, natively on the Atom N270 - Slackware 14.2 32 bit, 2-3 hours for the kernel and ~6-7 hours for the modules, I got the 5.4.1 smp 32 bit kernel ready and it works well. Again, the only .config change I made to the Slackware provided config-huge-smp-5.4.1-smp was the option "# CONFIG_EFI_RCI2_TABLE is not set".
Conclusion, the EFI Runtime Configuration Interface Table Version 2 Support for the Dell EMC PowerEdge (kernel config option CONFIG_EFI_RCI2_TABLE=y) was the bugger and I'll write a post in the "Requests for -current (14.2-->15.0)" thread, asking to disable it.
Here is the 5.4.1 successful boot dmesg on the Acer Aspire One: https://pastebin.com/yta2XLfi
I was also testing the graphics under X (i915) and couldn't make it crash/hang. Played extensively with Firefox and GIMP.
When building kernels, in the kernel config ncurses window, one can search the kernel options; there can be vendor specific kernel options that can go along way--I gave a n270 asus eee netbook to my son, and there is a kernel option for EEE netbooks, that when enabled, allowed booting: it is way easier to find the option to enable by searching for eee, than trying to peruse the thousands of kernel options. The initial 14.2 kernel boots the eee... but if I apply the patches, and upgrade it to a fully patched 14.2 kernel, then the eee no longer boots... somewhere along the way, Pat must have disabled the EEE option, and perhaps he disabled similar options for your Acer. Although irrelevant while upgrading last night, I noticed that 14.2 is at kernel 4.4.2, and current is at kernel 5.4.2, so Pat must like numerical symmetries. I built a 4.20 kernel for the eee, and another for the ideapad, and have the kernels blacklisted in /etc/slackpkg/blacklist, so I can upgrade and patch, without overwriting these custom 4.20 kernels. Thus, when building kernels, it never hurts to search the kernel options for vendor name, and enable anything relevant... likewise with cpu and gpu.
At least for the Acer Aspire One all the system specific modules are built in the Slackware 5.4.1 kernel.
In the meantime I also successfully compiled 5.4.2-smp and it boots OK. I took the `config-generic-smp-5.4.2-smp' from current and applied changes along the changes you described in the kernel-thread (e.g. HIHGMEM, MTRR...).
I definitely left `CONFIG_EFI_RCI2_TABLE=y', as you can see from the attached config from the currently running kernel (cat /proc/config.gz). (I had to name it `.log' otherwise LQ somehow did not allow to upload it).
Please note that I took the *generic* config instead of the *huge* as you.
As I made several changes I still would like to create a minimally differently configured kernel - or even try your suggestion with CONFIG_EFI_RCI2_TABLE.
However, with the compile-times on the real HW I probably would like to set up a VM first...
Last edited by Chalapticus; 12-07-2019 at 04:07 AM.
Reason: typo O -> OK
Interesting!
Given the large amount of modules config-huge-smp-5.4.1-smp is building, initially I was only rebuilding and testing the bzImage (vmlinuz, the actual kernel) with zswap, btrfs and CONFIG_MTRR_SANITIZER disabled, one image for each disabled option. I also adopted this "cheating approach", because the original crash occurred before mounting the root partition and accessing the modules, thus, no use to build the modules in the first place.
Since I didn't save the dmesg, but just observed the screen on the laptop, I might have missed the cause of the crash with CONFIG_MTRR_SANITIZER disabled, meaning, the kernel could have booted OK and crashed due to some modules (all of them were missing).
Both CONFIG_MTRR_SANITIZER and CONFIG_EFI_RCI2_TABLE come disabled by default in 5.4.1 (make defconfig), and I believe they should stay like that, at least CONFIG_EFI_RCI2_TABLE should definitely only be enabled on appropriate HW (Dell EMC PowerEdge systems).
MTRR on the other hand comes enabled by default
Code:
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
Can you please provide the dmesg log on your latest try with the config-5.4.2-on-n270.gz.log?
You can use https://pastebin.com/ for it.
I can confirm 5.4.x kernel(s) fail on my two x86 (32-bit) laptops. One is Asus EeePC 900 from 2009, other is Gateway MX6214 from 2006.
I tried both 5.4.0 when it was in -testing, and 5.4.1 from -current. 5.4.x-huge panicked on both machines; no root filesystem mount was attempted.
NOTE: this is not for me an installation issue. I run -current on all four of my x86 machines, two 64-bit and two 32-bit. The 5.4.x kernel issue showed up in the -current update; see the ChangeLog.txt for the timeline.
Thanks Chalapticus!
I was inspecting the log you attached and couldn't find any records about "resource sanity check".
Based on the log you provided in the first post and my investigation&tests, my understanding about the issue was, well, partially wrong:
- the Interface Table Version 2 Support for the Dell EMC PowerEdge (kernel config option CONFIG_EFI_RCI2_TABLE=y) is reserving some memory (not properly done?) and then the MTRR_SANITIZER is "sneezing" while trying to put some order in the memory management.
It turns out, I studied a little the kernel code, that it's not the MTRR_SANITIZER doing the memory check but it's triggered (still don't know why) in kernel/resource.c and called apparently by memremap (mm/memremap.c):
- dmesg log snippet:
Code:
resource sanity check: requesting [mem 0xffffffff-0x10000001c], which spans more than Reserved [mem 0xfffc0000-0xffffffff]
caller memremap+0x10b/0x1c0 mapping multiple BARs
BUG: unable to handle page fault for address: f7d97005
enable_mtrr_cleanup [X86]
The kernel tries to adjust MTRR layout from continuous
to discrete, to make X server driver able to add WB
entry later. This parameter enables that.
- the rest is left unmodified (incl. CONFIG_EFI_RCI2_TABLE=y)
- will pay more attention on the boot messages, check if it loads OK / fails & why it fails (unavailable modules?)
Setup is 17692 bytes (padded to 17920 bytes).
System is 8854 kB
CRC 1939fd0d
Kernel: arch/x86/boot/bzImage is ready (#1)
Booted it on the HW (Acer Aspire One) and it crashed exactly like before - screen capture: https://www120.zippyshare.com/v/oIWdTD0H/file.html
(Imgur wasn't working)
I don't have a serial console on this little netbook and not sure I can use the netconsole to send&save the kernel boot log over the network. CONFIG_NETCONSOLE= is built modular, I can change that and build it in the kernel, but I'm not sure if the networking stack is loaded (completely) before the crash and for other "kernel crash dump" methods I don't really have the time to set up... https://www.kernel.org/doc/Documenta...netconsole.txt
@Chalapticus
It looks like I was right with the report in post #6, didn't miss the cause, it's still "CONFIG_EFI_RCI2_TABLE=y" (efi_rci2)
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.