Debugging a kernel panic on Ubuntu 14.04 LTS with Kernel 3.13

prl242 · 04-10-2019, 09:55 PM

Hi, we are experiencing a kernel panic with a system we are building. We have XUbuntu 14.04 LTS operating system installed. The computer is a single board computer with a PC104 stack.

We have additional thirdparty hardware with device drivers installed. We suspect something is conflicting with these that causes the crash.

When trying to operate our firmware the kernel will panic and the computer is completely frozen. We do get a dump of the kernel registers on the screen (cant copy the text though).

We have also installed linux-crashdump. It generated a vmcore file in /var/crash. However, we can not seem to find a way to use this file.

The question, can anyone direct us on a good way to debug this kernel panic to try and determine the cause?

Thanks
Paul

rtmistler · 04-11-2019, 09:18 AM

This is hard to say because you've given fragments of information.

It is a SBC, so my assumption is that you're building your kernel. Please affirm or correct this.

You have firmware of some type you are running. What exactly do you mean by this? Is there separate firmware on the board on another processor? Is there a custom driver you are using? Is this just an application of some type which may or may not use the kernel space?

Well, before all that. Is there, or is there not a configuration of the kernel which you can run, without seeing this problem? For instance, this SBC should come in a standard form, with no modifications. Have you modified the hardware at all? You are adding at least one peripheral to it, the one you mentioned. Can you bring it back to the original form and then run the standard (first) kernel you got to run on it? This being a kernel with only modifications to operate on that SBC and not to perform the custom functions you are adding to it? Next, build it up. If you are to be adding hardware to some interface, what interface is it? PCI, UART, GPIO, I2C, SPI? What? Note that all serial stuff or networking for that matter should have little effect on the kernel, and you also need not change it much, that would be UART, SPI, I2C, RS485, RS232, or etc. Same for GPIO, everything should be defaulted to be an input and thus high impedance for now, so it should have little effect on the board, except that if you had code to read the GPIO pins, you could tell their states. Network, wireless, some communications module that is not serial, but instead PCI, it should not negatively affect your kernel. It may be detected, but not understood, but it should otherwise perform as a benign addition, until you add driver software to control it.

If none of that is true, then you are likely making a mistake as you add hardware to it. Consider rechecking your schematics.

Bottom line is that we need a lot more information about exactly what you're doing.

Standard SBC, custom kernel.
Standard SBC, standard kernel, custom application.
Modified SBC (describe how), kernel?, drivers?, application?

prl242 · 04-11-2019, 11:28 AM

Hi rtmistler,

Thank you for your response. Below is more information on our system:

We are building an instrument that uses a WinSystems PPM-C407 PC104 SBC. It has 4 core Atom E3800 Processors and 4 GB RAM. Included on the PC104 stack is a RTD Embedded Technologies DM7820 High Speed Digital I/O board, RTD Embedded Technologies LAN17222 Gigabit Ethernet module and a WinSystems PCM-MIO-G-1 DAC module.

We are not compiling our own kernel. We are using the distrubution XUbuntu 14.04 LTS. 3.13.0-168-generic #218-Ubuntu SMP Thu Mar 14 16:56:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

We use the company provided software to compile, build and install the device drivers for the PCM-MIO board and the DM7820 board. The LAN17222 chipset is an Intel 82574 and recognized by the operating system. There is an additional API, the Pleora eBUS SDK, that we use to communicate via ethernet to our camera.

What I've called our firmware is software written by our group (generally C++) to control and communicate with the instrument. This is software we compile and install on the SBC itself.

The computer boots and runs nominally, including with the device drivers loaded. The crash occurs when we run our software. The crash does not happen at the same spot or time but it is frequent (usually only takes a few minutes for it to occur). We suspect it is a conflict with device drivers sharing the PCI bus but we do not have the experience to go about finding the root cause from a kernel panic that freezes the computer.

I installed the linux-crashdump. The system crashed and rebooted to a login screen. However, the keyboard and mouse (USB) did not work so I could not login. When we booted back again there was a dated directory in /var/crash with a file vmcore.(thedate). None of the information I could find online seemed to indicate what to do with this file. The file is no longer there (not sure why).

I've attached a picture of the kernel panic.

Thanks for the help.

Cheers
Paul

Quote:

Originally Posted by rtmistler

This is hard to say because you've given fragments of information.

It is a SBC, so my assumption is that you're building your kernel. Please affirm or correct this.

You have firmware of some type you are running. What exactly do you mean by this? Is there separate firmware on the board on another processor? Is there a custom driver you are using? Is this just an application of some type which may or may not use the kernel space?

Well, before all that. Is there, or is there not a configuration of the kernel which you can run, without seeing this problem? For instance, this SBC should come in a standard form, with no modifications. Have you modified the hardware at all? You are adding at least one peripheral to it, the one you mentioned. Can you bring it back to the original form and then run the standard (first) kernel you got to run on it? This being a kernel with only modifications to operate on that SBC and not to perform the custom functions you are adding to it? Next, build it up. If you are to be adding hardware to some interface, what interface is it? PCI, UART, GPIO, I2C, SPI? What? Note that all serial stuff or networking for that matter should have little effect on the kernel, and you also need not change it much, that would be UART, SPI, I2C, RS485, RS232, or etc. Same for GPIO, everything should be defaulted to be an input and thus high impedance for now, so it should have little effect on the board, except that if you had code to read the GPIO pins, you could tell their states. Network, wireless, some communications module that is not serial, but instead PCI, it should not negatively affect your kernel. It may be detected, but not understood, but it should otherwise perform as a benign addition, until you add driver software to control it.

If none of that is true, then you are likely making a mistake as you add hardware to it. Consider rechecking your schematics.

Bottom line is that we need a lot more information about exactly what you're doing.

Standard SBC, custom kernel.
Standard SBC, standard kernel, custom application.
Modified SBC (describe how), kernel?, drivers?, application?

rtmistler · 04-11-2019, 12:59 PM

What you can do is look at the PCI bus and all the information related to the assignments for IRQ, I/O ports, and DMA address and see if your thought that there is a conflict has any bearing.

Otherwise, it would seem that you may need to build the kernel, change the configuration to add debug capabilities and also have the code for the drivers to be able to debug this problem a bit further.

Has Win Systems claimed that this kernel will run and has been tested on that board?

smallpond · 04-11-2019, 01:37 PM

The stack trace says you took an interrupt on the E1000 ethernet driver and executed an illegal instruction. It tells you the line in the source code where this occurred: skbuff.h line 1486, which is probably skb_pull_inline in your kernel. Since there's nothing weird in the code there, it is most likely that your driver (possibly some other kernel bug but Occam's Razor is very sharp) overwrote part of kernel memory so that it faulted when it tried to execute that victim location. You can use gdb to examine the core dump and maybe learn a little more. Does your device do DMA? Are you mapping bus addresses correctly?

https://elixir.bootlin.com/linux/v3....linux/skbuff.h

prl242 · 04-11-2019, 04:42 PM

Hi rtmistler, yes the kernel is supported for the board.

We looked at the PCI bus info (lspci) and did not see any conflicts.

We now seem to be working though. We swapped the ethernet controller for the Pleora boards. So now the camera packets are directly on the SBC bus (i.e we are no longer using the LAN17222 PCI expansion board). Our current guess is that we were swamping/conflicting the PCI bus between the camera packets and the RTD DAQ board?

Thanks again for the help.

Cheers
Paul

Quote:

Originally Posted by rtmistler

What you can do is look at the PCI bus and all the information related to the assignments for IRQ, I/O ports, and DMA address and see if your thought that there is a conflict has any bearing.

Otherwise, it would seem that you may need to build the kernel, change the configuration to add debug capabilities and also have the code for the drivers to be able to debug this problem a bit further.

Has Win Systems claimed that this kernel will run and has been tested on that board?

prl242 · 04-11-2019, 04:50 PM

Hi smallpond,

Thanks for pointing out how to read the stack trace. Much appreciated

The RTD DAQ board is using DMA. This is a valid point. We will need to investigate this.

Thanks for the help.

Cheers
Paul

Quote:

Originally Posted by smallpond

The stack trace says you took an interrupt on the E1000 ethernet driver and executed an illegal instruction. It tells you the line in the source code where this occurred: skbuff.h line 1486, which is probably skb_pull_inline in your kernel. Since there's nothing weird in the code there, it is most likely that your driver (possibly some other kernel bug but Occam's Razor is very sharp) overwrote part of kernel memory so that it faulted when it tried to execute that victim location. You can use gdb to examine the core dump and maybe learn a little more. Does your device do DMA? Are you mapping bus addresses correctly?

https://elixir.bootlin.com/linux/v3....linux/skbuff.h