LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware > Linux - Embedded & Single-board computer
User Name
Password
Linux - Embedded & Single-board computer This forum is for the discussion of Linux on both embedded devices and single-board computers (such as the Raspberry Pi, BeagleBoard and PandaBoard). Discussions involving Arduino, plug computers and other micro-controller like devices are also welcome.

Notices


Reply
  Search this Thread
Old 03-24-2016, 04:28 PM   #1
FarmerSez
LQ Newbie
 
Registered: Mar 2016
Posts: 1

Rep: Reputation: Disabled
Question console tty stops working and random page faults on OMAP3 DM3730


Hey,

----------- Preface -----------

We have at my workplace a custom board running an OMAP DM3730. The CPU's peripherals are as follows:
  • 2*256MB LPDDR1 SDRAM chips
  • 2*128MB SPI NOR - Each nor holds one software version - the first is a read-only NOR, and the second is upgradable. The CPU boots from the NOR chip, it holds u-boot, the linux kernel and the rootfs.
  • 2*2GB ULPI nand flashes for storage.

An important thing - the CPU is accompanied by a Xylinx FPGA that is connected to it via GPMC. The flashes aren't connected directly to the CPU, rather are routed by the FPGA.

We have about 20 of these custom boards, and they all run the same version that has been working for about 6-7 months with no real errors in them.

----------- The plot thickens -----------

We have one board that has some issues. After boot, sometimes the console (in our case, ttyO2) stops working. When we connect to it via ethernet (over GPMC) and try to run `echo "blabla" > /dev/ttyO2`, it doesn't show any bytes on the putty connected to it, and it sometimes even block, even though we are using a UART line without flow control. Why would a non flow-controlled tty ever block?

A few more things that happen, is that sometimes random errors occure in well-tested processes. For example, we have a python script that creates a DB on bootup. On our last boot, we got a segfault that was traced to a C `return` statement within python2.7's C source code, which means that probably we had some sort of a stack overflow and we overridden our return address. This happaned in a pure python script, so i find it hard to believe to be a real software bug. Running the script again worked.

We had a hunch that we have a RAM malfunction, but we couldn't prove it. We ran RAM tests that checked the data + address bus thuroughly. We even had a test that generated a PN15 series on the whole ram for 200 times and verified it to be exact.

Another clue is that this condition usually starts on the second NOR (the writable one), and when doing a warmreset and switching to the read-only NOR it happens too.

----------- The begging part -----------

We have no clue what the hell is going on. The software guys (moi) are blaming the hardware, the hardware guys are blaming the software.... as per usual.
The exact same version has been running for more than 7 months on over 20 boards, which to me kind of cancels the idea of a software bug.

Any idea why could a thing like this happen? Please advise, i've been working over 14 hrs a day over this thing for more than a week >.>

Thanks.

Last edited by FarmerSez; 03-24-2016 at 05:20 PM.
 
Old 03-28-2016, 02:54 PM   #2
blue_z
Member
 
Registered: Jul 2015
Location: USA
Distribution: Ubuntu, Lubuntu, Mint, custom embedded
Posts: 104

Rep: Reputation: Disabled
Quote:
Originally Posted by FarmerSez View Post
Why would a non flow-controlled tty ever block?
Two possible failures come to mind:
A) symptom: the UART Transmit Buffer doesn't empty -- no baudrate generator to shift the bits out.
B) symptom: no UART interrupt -- the TransmitBufferEmpty interrupt is not firing or is blocked/disabled.

Quote:
Originally Posted by FarmerSez View Post
We had a hunch that we have a RAM malfunction, but we couldn't prove it. We ran RAM tests that checked the data + address bus thuroughly. We even had a test that generated a PN15 series on the whole ram for 200 times and verified it to be exact.
Negative results from memory "tests" should probably be taken with a grain of salt, then look for another program. I've gotten reliable results from the extended memory test in U-Boot.

Quote:
Originally Posted by FarmerSez View Post
The software guys (moi) are blaming the hardware, the hardware guys are blaming the software.... as per usual.
That's an unfortunate situation.
I've worked with very agreeable guys (they would listen, and then start probing with a scope) to an incompetent who argued that his schematics had no errors when told that the new board wasn't operating to spec (and he did that for three different HW bugs). One of the 3 board problems was a disconnected baudrate generator (just like symptom A above), and IIRC there was also a UART interrupt issue.
Of course it helps to never "cry wolf" (i.e. you have evidence and logical reasoning, not just a hunch, when you start pointing fingers).

Quote:
Originally Posted by FarmerSez View Post
Any idea why could a thing like this happen?
Hard to say.
But from many years of personal experience of seeing intermittent issues on some boards but not on others:
a) bizarre SW failures were due to a processor control signal that was not terminated.
b) random interrupts were caused by an input that was floating.
c) peripheral misbehavior was due to a floating input on a peripheral input.
d) random kernel panics were caused by DRAM issues; damping resistors were needed on the memory lines.
e) one case involved bad timing, and the HW error was exacerbated by a SW flaw that intentionally ignored that (error) condition. Both a HW and SW fixes were applied, but the root cause was HW.
So of those five instances, each board-related problem was caused by hardware rather than software.

Failure analysis is crucial. Determine exactly what has gone wrong, and then come up with a hypothesis. Case (a) was solved in large part by having internal code checks (e.g. validating the arguments passed to a function). Using that pre-failure data, I came up the hypothesis that occasionally a return instruction was being executed as a NOP. In spite of skepticism, an afternoon of focused testing proved that correct, and produced a HW fix.

If you have an environmental chamber, you could try to turn an intermittent HW condition into a hard failure.

Last edited by blue_z; 03-28-2016 at 03:06 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Mouse stops working at random times. Seeker0 Ubuntu 5 03-30-2011 06:19 PM
Page Faults vs. Page Ins golmschenk Programming 2 07-16-2010 02:32 AM
Advice: Random segmentation faults - Fedora core 3 U4ea Linux - Software 2 10-07-2005 06:52 AM
random segmentation faults? l00zer Linux - Newbie 5 01-25-2005 06:05 PM
Why so many Random Segmentation Faults vda Linux - General 1 08-18-2004 01:40 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware > Linux - Embedded & Single-board computer

All times are GMT -5. The time now is 06:51 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration