LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-10-2019, 09:55 PM   #1
prl242
LQ Newbie
 
Registered: Apr 2019
Posts: 7

Rep: Reputation: Disabled
Debugging a kernel panic on Ubuntu 14.04 LTS with Kernel 3.13


Hi, we are experiencing a kernel panic with a system we are building. We have XUbuntu 14.04 LTS operating system installed. The computer is a single board computer with a PC104 stack.

We have additional thirdparty hardware with device drivers installed. We suspect something is conflicting with these that causes the crash.

When trying to operate our firmware the kernel will panic and the computer is completely frozen. We do get a dump of the kernel registers on the screen (cant copy the text though).

We have also installed linux-crashdump. It generated a vmcore file in /var/crash. However, we can not seem to find a way to use this file.

The question, can anyone direct us on a good way to debug this kernel panic to try and determine the cause?

Thanks
Paul
 
Old 04-11-2019, 09:18 AM   #2
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
This is hard to say because you've given fragments of information.

It is a SBC, so my assumption is that you're building your kernel. Please affirm or correct this.

You have firmware of some type you are running. What exactly do you mean by this? Is there separate firmware on the board on another processor? Is there a custom driver you are using? Is this just an application of some type which may or may not use the kernel space?

Well, before all that. Is there, or is there not a configuration of the kernel which you can run, without seeing this problem? For instance, this SBC should come in a standard form, with no modifications. Have you modified the hardware at all? You are adding at least one peripheral to it, the one you mentioned. Can you bring it back to the original form and then run the standard (first) kernel you got to run on it? This being a kernel with only modifications to operate on that SBC and not to perform the custom functions you are adding to it? Next, build it up. If you are to be adding hardware to some interface, what interface is it? PCI, UART, GPIO, I2C, SPI? What? Note that all serial stuff or networking for that matter should have little effect on the kernel, and you also need not change it much, that would be UART, SPI, I2C, RS485, RS232, or etc. Same for GPIO, everything should be defaulted to be an input and thus high impedance for now, so it should have little effect on the board, except that if you had code to read the GPIO pins, you could tell their states. Network, wireless, some communications module that is not serial, but instead PCI, it should not negatively affect your kernel. It may be detected, but not understood, but it should otherwise perform as a benign addition, until you add driver software to control it.

If none of that is true, then you are likely making a mistake as you add hardware to it. Consider rechecking your schematics.

Bottom line is that we need a lot more information about exactly what you're doing.

Standard SBC, custom kernel.
Standard SBC, standard kernel, custom application.
Modified SBC (describe how), kernel?, drivers?, application?
 
Old 04-11-2019, 11:28 AM   #3
prl242
LQ Newbie
 
Registered: Apr 2019
Posts: 7

Original Poster
Rep: Reputation: Disabled
Hi rtmistler,

Thank you for your response. Below is more information on our system:

We are building an instrument that uses a WinSystems PPM-C407 PC104 SBC. It has 4 core Atom E3800 Processors and 4 GB RAM. Included on the PC104 stack is a RTD Embedded Technologies DM7820 High Speed Digital I/O board, RTD Embedded Technologies LAN17222 Gigabit Ethernet module and a WinSystems PCM-MIO-G-1 DAC module.

We are not compiling our own kernel. We are using the distrubution XUbuntu 14.04 LTS. 3.13.0-168-generic #218-Ubuntu SMP Thu Mar 14 16:56:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

We use the company provided software to compile, build and install the device drivers for the PCM-MIO board and the DM7820 board. The LAN17222 chipset is an Intel 82574 and recognized by the operating system. There is an additional API, the Pleora eBUS SDK, that we use to communicate via ethernet to our camera.

What I've called our firmware is software written by our group (generally C++) to control and communicate with the instrument. This is software we compile and install on the SBC itself.

The computer boots and runs nominally, including with the device drivers loaded. The crash occurs when we run our software. The crash does not happen at the same spot or time but it is frequent (usually only takes a few minutes for it to occur). We suspect it is a conflict with device drivers sharing the PCI bus but we do not have the experience to go about finding the root cause from a kernel panic that freezes the computer.

I installed the linux-crashdump. The system crashed and rebooted to a login screen. However, the keyboard and mouse (USB) did not work so I could not login. When we booted back again there was a dated directory in /var/crash with a file vmcore.(thedate). None of the information I could find online seemed to indicate what to do with this file. The file is no longer there (not sure why).

I've attached a picture of the kernel panic.

Thanks for the help.

Cheers
Paul


Quote:
Originally Posted by rtmistler View Post
This is hard to say because you've given fragments of information.

It is a SBC, so my assumption is that you're building your kernel. Please affirm or correct this.

You have firmware of some type you are running. What exactly do you mean by this? Is there separate firmware on the board on another processor? Is there a custom driver you are using? Is this just an application of some type which may or may not use the kernel space?

Well, before all that. Is there, or is there not a configuration of the kernel which you can run, without seeing this problem? For instance, this SBC should come in a standard form, with no modifications. Have you modified the hardware at all? You are adding at least one peripheral to it, the one you mentioned. Can you bring it back to the original form and then run the standard (first) kernel you got to run on it? This being a kernel with only modifications to operate on that SBC and not to perform the custom functions you are adding to it? Next, build it up. If you are to be adding hardware to some interface, what interface is it? PCI, UART, GPIO, I2C, SPI? What? Note that all serial stuff or networking for that matter should have little effect on the kernel, and you also need not change it much, that would be UART, SPI, I2C, RS485, RS232, or etc. Same for GPIO, everything should be defaulted to be an input and thus high impedance for now, so it should have little effect on the board, except that if you had code to read the GPIO pins, you could tell their states. Network, wireless, some communications module that is not serial, but instead PCI, it should not negatively affect your kernel. It may be detected, but not understood, but it should otherwise perform as a benign addition, until you add driver software to control it.

If none of that is true, then you are likely making a mistake as you add hardware to it. Consider rechecking your schematics.

Bottom line is that we need a lot more information about exactly what you're doing.

Standard SBC, custom kernel.
Standard SBC, standard kernel, custom application.
Modified SBC (describe how), kernel?, drivers?, application?
Attached Thumbnails
Click image for larger version

Name:	image.jpg
Views:	19
Size:	153.5 KB
ID:	30325  
 
Old 04-11-2019, 12:59 PM   #4
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
What you can do is look at the PCI bus and all the information related to the assignments for IRQ, I/O ports, and DMA address and see if your thought that there is a conflict has any bearing.

Otherwise, it would seem that you may need to build the kernel, change the configuration to add debug capabilities and also have the code for the drivers to be able to debug this problem a bit further.

Has Win Systems claimed that this kernel will run and has been tested on that board?
 
Old 04-11-2019, 01:37 PM   #5
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: Fedora
Posts: 4,140

Rep: Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263
The stack trace says you took an interrupt on the E1000 ethernet driver and executed an illegal instruction. It tells you the line in the source code where this occurred: skbuff.h line 1486, which is probably skb_pull_inline in your kernel. Since there's nothing weird in the code there, it is most likely that your driver (possibly some other kernel bug but Occam's Razor is very sharp) overwrote part of kernel memory so that it faulted when it tried to execute that victim location. You can use gdb to examine the core dump and maybe learn a little more. Does your device do DMA? Are you mapping bus addresses correctly?

https://elixir.bootlin.com/linux/v3....linux/skbuff.h
 
Old 04-11-2019, 04:42 PM   #6
prl242
LQ Newbie
 
Registered: Apr 2019
Posts: 7

Original Poster
Rep: Reputation: Disabled
Hi rtmistler, yes the kernel is supported for the board.

We looked at the PCI bus info (lspci) and did not see any conflicts.

We now seem to be working though. We swapped the ethernet controller for the Pleora boards. So now the camera packets are directly on the SBC bus (i.e we are no longer using the LAN17222 PCI expansion board). Our current guess is that we were swamping/conflicting the PCI bus between the camera packets and the RTD DAQ board?

Thanks again for the help.

Cheers
Paul

Quote:
Originally Posted by rtmistler View Post
What you can do is look at the PCI bus and all the information related to the assignments for IRQ, I/O ports, and DMA address and see if your thought that there is a conflict has any bearing.

Otherwise, it would seem that you may need to build the kernel, change the configuration to add debug capabilities and also have the code for the drivers to be able to debug this problem a bit further.

Has Win Systems claimed that this kernel will run and has been tested on that board?
 
Old 04-11-2019, 04:50 PM   #7
prl242
LQ Newbie
 
Registered: Apr 2019
Posts: 7

Original Poster
Rep: Reputation: Disabled
Hi smallpond,

Thanks for pointing out how to read the stack trace. Much appreciated

The RTD DAQ board is using DMA. This is a valid point. We will need to investigate this.

Thanks for the help.

Cheers
Paul


Quote:
Originally Posted by smallpond View Post
The stack trace says you took an interrupt on the E1000 ethernet driver and executed an illegal instruction. It tells you the line in the source code where this occurred: skbuff.h line 1486, which is probably skb_pull_inline in your kernel. Since there's nothing weird in the code there, it is most likely that your driver (possibly some other kernel bug but Occam's Razor is very sharp) overwrote part of kernel memory so that it faulted when it tried to execute that victim location. You can use gdb to examine the core dump and maybe learn a little more. Does your device do DMA? Are you mapping bus addresses correctly?

https://elixir.bootlin.com/linux/v3....linux/skbuff.h
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Canonical Patches OpenSSL Regression in Ubuntu 16.04 LTS, 14.04 LTS and 12.04 LTS LXer Syndicated Linux News 0 09-27-2016 12:32 PM
LXer: Ubuntu 12.04.5 LTS (Precise Pangolin) Released with Linux Kernel from Ubuntu 14.04 LTS LXer Syndicated Linux News 0 08-08-2014 08:00 AM
[SOLVED] Effective debugging or improving ones debugging skills Ajit Gunge Programming 3 05-22-2009 09:29 AM
Difference between kernel - debugging and application debugging topworld Linux - Software 2 03-30-2006 12:50 AM
Visual Debugging and Linux Kernel Debugging Igor007 Programming 0 09-30-2005 10:33 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 12:38 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration