LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Slackware - Installation (https://www.linuxquestions.org/questions/slackware-installation-40/)
-   -   Slackware-current: kernel panic with 5.4.1 on x86 (https://www.linuxquestions.org/questions/slackware-installation-40/slackware-current-kernel-panic-with-5-4-1-on-x86-4175665274/)

Chalapticus 12-02-2019 04:14 AM

Slackware-current: kernel panic with 5.4.1 on x86
 
1 Attachment(s)
Hi Forum,

I have two older x86 machines (Acer Aspire One & Asus EEEPC 1201HA) with Intel Atom CPUs. On both of them the new 5.4.1_smp kernel will panic/oops during early boot.

The 4.19.* kernels ran just fine along all kernel upgrades.

Also, I have *no* such problem with the 5.4.1_smp kernel on x86_64.

In order to investigate I ran the new (from 2019-November-30) x86 `usbboot.img' with `qemu-system-i386'. When I did *not* specify the CPU (using default) the image booted. However, when setting the cpu to n270 (similar/same to what I have) I got `oops 9', just like on my real HW.

Running the following command

Code:

qemu-system-i386 -enable-kvm -cpu n270 -m 1G -hda usbboot.img -serial stdio | tee failed-x86-boot.log
and setting `console=ttyS0' to the kernel I captured the output of the boot.

Please find the log attached.

For me the line `BUG: unable to handle page fault for address: f7d97005' seems
peculiar, but I have no clue about kernel development.

Any insight would be appreciated.

Thank you in advance.

abga 12-03-2019 04:07 PM

I can confirm this on my Acer Aspire One - Atom N270 running Slackware 14.2 (stable) 32bit. Just tried the kernel-huge-smp-5.4.1_smp-i686-1.txz & kernel-modules-smp-5.4.1_smp-i686-1.txz packages from -current and got the same "BUG: unable to handle page fault for address: XXXXXXXX" error, and the subsequent kernel panic.
There are many reports on the net about the error above and apparently it started with 5.2.x. The error reports are usually linked with a specific module, like in this case:
https://lkml.org/lkml/2019/4/25/1138

On this system I have my own 5.3.1 compilation running without any issues, but it's tailored for the system, efi & pcmcia & co disabled in the config.

abga 12-03-2019 09:45 PM

Regarding my previous observation "apparently it started with 5.2.x" I realized that I was searching after: "unable to handle page fault for address" and "reserved bit violation", which both were introduced in recent kernels:
https://lore.kernel.org/patchwork/patch/1022776/
https://lore.kernel.org/patchwork/patch/1064269/
meaning, it doesn't seem to have any connections with 5.2.x, but with the recent reformulation of the kernel error codes.

Don't know why memremap is failing at "unable to handle page fault for address: XXXXXXX", and why that address is protected (reserved).
Since this dumb Atom N270 has only one core and shows off as having two - Hyper-Threading (cannot disable it in BIOS), I suspected some race conditions and tried to boot the 5.4.1 kernel with the kernel boot parameters: maxcpus=1 nosmt
Didn't help...

abga 12-04-2019 10:56 PM

@ Chalapticus

I did my own 5.4.1 smp 32 bit kernel compilation and it booted successfully.
More details in the kernel thread (it has more focus & visibility):
https://www.linuxquestions.org/quest...ml#post6064808

Chalapticus 12-05-2019 05:11 PM

@abga

Thank you very much for your answers.

Unfortunately (but, as more-or-less expected) 5.4.2-smp has the same problem.

As time permits during the weekend I will try to compile 5.4.2 based on your settings from the kernel thread.

I will report back with the results.

abga 12-05-2019 05:53 PM

I'm actually busy recompiling 5.4.1 on my own, already re-compiled it 3 times disabling options that were not enabled by default in the kernel defconfig and under suspicion for being the cause for the crash in the Slackware provided kernel.
First I focused on the last two drivers that were loaded before the crash, respectively zswap and btrfs, disabled them and it didn't help.
Then I went on playing with the MTRR options, disabling the MTRR cleanup support -> CONFIG_MTRR_SANITIZER, didn't help either.
https://wiki.gentoo.org/wiki/MTRR_and_PAT
https://www.kernel.org/doc/Documentation/x86/mtrr.txt

I was looking again over your crash report (which is the same as mine, except I didn't capture (saved) it) and noticed that in the "Call Trace" section: efi_rci2 is listed and found out that it's enabled (CONFIG_EFI_RCI2_TABLE=y) in the Slackware provided kernel. In my first successful test, detailed in the kernel thread, I disabled EFI and that automatically disabled the efi_rci2
https://lore.kernel.org/patchwork/patch/861224/
I'm now re-compiling the kernel with config-huge-smp-5.4.1-smp and the option "# CONFIG_EFI_RCI2_TABLE is not set". It will take a while on this lazy Atom N270 and I'll report once done and tested (booted).

The EFI Runtime Configuration Interface Table Version 2 Support, to be found in the kernel config:
Code:

.config - Linux/x86 5.4.1 Kernel Configuration
 > Firmware Drivers > EFI (Extensible Firmware Interface) Support
 
 [ ] EFI Runtime Configuration Interface Table Version 2 Support

With the config doc:
Code:

Displays the content of the Runtime Configuration Interface
Table version 2 on Dell EMC PowerEdge systems as a binary
attribute 'rci2' under /sys/firmware/efi/tables directory.

RCI2 table contains BIOS HII in XML format and is used to populate
BIOS setup page in Dell EMC OpenManage Server Administrator tool.
The BIOS setup page contains BIOS tokens which can be configured.

Say Y here for Dell EMC PowerEdge systems.

Symbol: EFI_RCI2_TABLE [=n]
Type  : bool
Prompt: EFI Runtime Configuration Interface Table Version 2 Support
  Location:
    -> Firmware Drivers
      -> EFI (Extensible Firmware Interface) Support
  Defined at drivers/firmware/efi/Kconfig:183
  Depends on: EFI [=y] && (X86 [=y] || COMPILE_TEST [=n])

Shouldn't be enabled and I don't know why it was...
This could explain its activation:
https://lore.kernel.org/linux-efi/20...el@linaro.org/

The lovely "make oldconfig" crap.

slac-in-the-box 12-06-2019 10:01 AM

When building kernels, in the kernel config ncurses window, one can search the kernel options; there can be vendor specific kernel options that can go along way--I gave a n270 asus eee netbook to my son, and there is a kernel option for EEE netbooks, that when enabled, allowed booting: it is way easier to find the option to enable by searching for eee, than trying to peruse the thousands of kernel options. The initial 14.2 kernel boots the eee... but if I apply the patches, and upgrade it to a fully patched 14.2 kernel, then the eee no longer boots... somewhere along the way, Pat must have disabled the EEE option, and perhaps he disabled similar options for your Acer. Although irrelevant while upgrading last night, I noticed that 14.2 is at kernel 4.4.2, and current is at kernel 5.4.2, so Pat must like numerical symmetries:). I built a 4.20 kernel for the eee, and another for the ideapad, and have the kernels blacklisted in /etc/slackpkg/blacklist, so I can upgrade and patch, without overwriting these custom 4.20 kernels. Thus, when building kernels, it never hurts to search the kernel options for vendor name, and enable anything relevant... likewise with cpu and gpu.

abga 12-06-2019 05:08 PM

Quote:

Originally Posted by abga (Post 6065126)
I'm now re-compiling the kernel with config-huge-smp-5.4.1-smp and the option "# CONFIG_EFI_RCI2_TABLE is not set". It will take a while on this lazy Atom N270 and I'll report once done and tested (booted).

After a long, long, LONG! compilation time, natively on the Atom N270 - Slackware 14.2 32 bit, 2-3 hours for the kernel and ~6-7 hours for the modules, I got the 5.4.1 smp 32 bit kernel ready and it works well. Again, the only .config change I made to the Slackware provided config-huge-smp-5.4.1-smp was the option "# CONFIG_EFI_RCI2_TABLE is not set".

Conclusion, the EFI Runtime Configuration Interface Table Version 2 Support for the Dell EMC PowerEdge (kernel config option CONFIG_EFI_RCI2_TABLE=y) was the bugger and I'll write a post in the "Requests for -current (14.2-->15.0)" thread, asking to disable it.

Here is the 5.4.1 successful boot dmesg on the Acer Aspire One:
https://pastebin.com/yta2XLfi
I was also testing the graphics under X (i915) and couldn't make it crash/hang. Played extensively with Firefox and GIMP.

abga 12-06-2019 05:23 PM

Quote:

Originally Posted by slac-in-the-box (Post 6065324)
When building kernels, in the kernel config ncurses window, one can search the kernel options; there can be vendor specific kernel options that can go along way--I gave a n270 asus eee netbook to my son, and there is a kernel option for EEE netbooks, that when enabled, allowed booting: it is way easier to find the option to enable by searching for eee, than trying to peruse the thousands of kernel options. The initial 14.2 kernel boots the eee... but if I apply the patches, and upgrade it to a fully patched 14.2 kernel, then the eee no longer boots... somewhere along the way, Pat must have disabled the EEE option, and perhaps he disabled similar options for your Acer. Although irrelevant while upgrading last night, I noticed that 14.2 is at kernel 4.4.2, and current is at kernel 5.4.2, so Pat must like numerical symmetries:). I built a 4.20 kernel for the eee, and another for the ideapad, and have the kernels blacklisted in /etc/slackpkg/blacklist, so I can upgrade and patch, without overwriting these custom 4.20 kernels. Thus, when building kernels, it never hurts to search the kernel options for vendor name, and enable anything relevant... likewise with cpu and gpu.

At least for the Acer Aspire One all the system specific modules are built in the Slackware 5.4.1 kernel.

Chalapticus 12-07-2019 03:06 AM

1 Attachment(s)
In the meantime I also successfully compiled 5.4.2-smp and it boots OK. I took the `config-generic-smp-5.4.2-smp' from current and applied changes along the changes you described in the kernel-thread (e.g. HIHGMEM, MTRR...).

I definitely left `CONFIG_EFI_RCI2_TABLE=y', as you can see from the attached config from the currently running kernel (cat /proc/config.gz). (I had to name it `.log' otherwise LQ somehow did not allow to upload it).

Please note that I took the *generic* config instead of the *huge* as you.

As I made several changes I still would like to create a minimally differently configured kernel - or even try your suggestion with CONFIG_EFI_RCI2_TABLE.

However, with the compile-times on the real HW I probably would like to set up a VM first...

abga 12-07-2019 04:04 AM

Interesting!
Given the large amount of modules config-huge-smp-5.4.1-smp is building, initially I was only rebuilding and testing the bzImage (vmlinuz, the actual kernel) with zswap, btrfs and CONFIG_MTRR_SANITIZER disabled, one image for each disabled option. I also adopted this "cheating approach", because the original crash occurred before mounting the root partition and accessing the modules, thus, no use to build the modules in the first place.
Since I didn't save the dmesg, but just observed the screen on the laptop, I might have missed the cause of the crash with CONFIG_MTRR_SANITIZER disabled, meaning, the kernel could have booted OK and crashed due to some modules (all of them were missing).
Both CONFIG_MTRR_SANITIZER and CONFIG_EFI_RCI2_TABLE come disabled by default in 5.4.1 (make defconfig), and I believe they should stay like that, at least CONFIG_EFI_RCI2_TABLE should definitely only be enabled on appropriate HW (Dell EMC PowerEdge systems).
MTRR on the other hand comes enabled by default
Code:

CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set

Can you please provide the dmesg log on your latest try with the config-5.4.2-on-n270.gz.log?
You can use https://pastebin.com/ for it.

Chalapticus 12-07-2019 11:33 AM

1 Attachment(s)
Please find attached the dmesg for the successful boot until the initrd /init starts - the boot process does not get that far if the Ooops is present.

It's an xz-d text file with the compulsory `.log' extension.

gus3 12-07-2019 06:58 PM

I can confirm 5.4.x kernel(s) fail on my two x86 (32-bit) laptops. One is Asus EeePC 900 from 2009, other is Gateway MX6214 from 2006.

I tried both 5.4.0 when it was in -testing, and 5.4.1 from -current. 5.4.x-huge panicked on both machines; no root filesystem mount was attempted.

NOTE: this is not for me an installation issue. I run -current on all four of my x86 machines, two 64-bit and two 32-bit. The 5.4.x kernel issue showed up in the -current update; see the ChangeLog.txt for the timeline.

abga 12-07-2019 07:26 PM

Thanks Chalapticus!
I was inspecting the log you attached and couldn't find any records about "resource sanity check".

Based on the log you provided in the first post and my investigation&tests, my understanding about the issue was, well, partially wrong:
- the Interface Table Version 2 Support for the Dell EMC PowerEdge (kernel config option CONFIG_EFI_RCI2_TABLE=y) is reserving some memory (not properly done?) and then the MTRR_SANITIZER is "sneezing" while trying to put some order in the memory management.

It turns out, I studied a little the kernel code, that it's not the MTRR_SANITIZER doing the memory check but it's triggered (still don't know why) in kernel/resource.c and called apparently by memremap (mm/memremap.c):
- dmesg log snippet:
Code:

resource sanity check: requesting [mem 0xffffffff-0x10000001c], which spans more than Reserved [mem 0xfffc0000-0xffffffff]
caller memremap+0x10b/0x1c0 mapping multiple BARs
BUG: unable to handle page fault for address: f7d97005

- kernel/resource.c
https://patchwork.kernel.org/patch/4090071/
- mm/memremap.c
https://patchwork.kernel.org/patch/6863601/
https://lwn.net/Articles/652964/
https://lwn.net/Articles/653585/
https://lwn.net/Articles/654119/
... etc (TL;DR)
Now, I still believe the "CONFIG_EFI_RCI2_TABLE=y" is the bugger but cannot blame the MTRR_SANITIZER for the crash and I don't know why this happens only on 32bit systems.
"CONFIG_EFI_RCI2_TABLE=y" -> rci2-table.c is freshly adopted and already had an issue resolved recently:
https://github.com/torvalds/linux/co...i/rci2-table.c

Enabling the MTRR_SANITIZER looks beneficial, that's according to:
http://my-fuzzy-logic.de/blog/index....-problems.html
https://superuser.com/questions/2653...-mtrrs-at-boot
And my understanding is that even if it's compiled and available it needs activation with a kernel boot parameter:
Code:

enable_mtrr_cleanup [X86]
                The kernel tries to adjust MTRR layout from continuous
                to discrete, to make X server driver able to add WB
                entry later. This parameter enables that.

https://www.kernel.org/doc/html/late...arameters.html

I'm recompiling (again) now just the kernel with config-huge-smp-5.4.1-smp having:
Code:

CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1

substituted with:
Code:

# CONFIG_MTRR_SANITIZER is not set
- the rest is left unmodified (incl. CONFIG_EFI_RCI2_TABLE=y)
- will pay more attention on the boot messages, check if it loads OK / fails & why it fails (unavailable modules?)

...
Side note on the memory management, the only good doc resource I could find (TL;DR):
https://www.xml.com/ldd/chapter/book/ch13.html

abga 12-07-2019 09:32 PM

Done compiling natively (Acer Aspire One) the 5.4.1 kernel with config-huge-smp-5.4.1-smp, only the MTRR_SANITIZER disabled:
Code:

CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1

substituted with:
Code:

# CONFIG_MTRR_SANITIZER is not set
Result:
Code:

Setup is 17692 bytes (padded to 17920 bytes).
System is 8854 kB
CRC 1939fd0d
Kernel: arch/x86/boot/bzImage is ready  (#1)

Booted it on the HW (Acer Aspire One) and it crashed exactly like before - screen capture:
https://www120.zippyshare.com/v/oIWdTD0H/file.html
(Imgur wasn't working)
I don't have a serial console on this little netbook and not sure I can use the netconsole to send&save the kernel boot log over the network. CONFIG_NETCONSOLE= is built modular, I can change that and build it in the kernel, but I'm not sure if the networking stack is loaded (completely) before the crash and for other "kernel crash dump" methods I don't really have the time to set up...
https://www.kernel.org/doc/Documenta...netconsole.txt

@Chalapticus
It looks like I was right with the report in post #6, didn't miss the cause, it's still "CONFIG_EFI_RCI2_TABLE=y" (efi_rci2)


All times are GMT -5. The time now is 07:27 PM.