Quote:
Originally Posted by FarmerSez
Why would a non flow-controlled tty ever block?
|
Two possible failures come to mind:
A) symptom: the UART Transmit Buffer doesn't empty -- no baudrate generator to shift the bits out.
B) symptom: no UART interrupt -- the TransmitBufferEmpty interrupt is not firing or is blocked/disabled.
Quote:
Originally Posted by FarmerSez
We had a hunch that we have a RAM malfunction, but we couldn't prove it. We ran RAM tests that checked the data + address bus thuroughly. We even had a test that generated a PN15 series on the whole ram for 200 times and verified it to be exact.
|
Negative results from memory "tests" should probably be taken with a grain of salt, then look for another program. I've gotten reliable results from the extended memory test in U-Boot.
Quote:
Originally Posted by FarmerSez
The software guys (moi) are blaming the hardware, the hardware guys are blaming the software.... as per usual.
|
That's an unfortunate situation.
I've worked with very agreeable guys (they would listen, and then start probing with a scope) to an incompetent who argued that his schematics had no errors when told that the
new board wasn't operating to spec (and he did that for three different HW bugs). One of the 3 board problems was a disconnected baudrate generator (just like symptom A above), and IIRC there was also a UART interrupt issue.
Of course it helps to never "cry wolf" (i.e. you have evidence and logical reasoning, not just a hunch, when you start pointing fingers).
Quote:
Originally Posted by FarmerSez
Any idea why could a thing like this happen?
|
Hard to say.
But from many years of personal experience of seeing intermittent issues on some boards but not on others:
a) bizarre SW failures were due to a processor control signal that was not terminated.
b) random interrupts were caused by an input that was floating.
c) peripheral misbehavior was due to a floating input on a peripheral input.
d) random kernel panics were caused by DRAM issues; damping resistors were needed on the memory lines.
e) one case involved bad timing, and the HW error was exacerbated by a SW flaw that intentionally ignored that (error) condition. Both a HW and SW fixes were applied, but the root cause was HW.
So of those five instances, each board-related problem was caused by hardware rather than software.
Failure analysis is crucial. Determine exactly what has gone wrong, and then come up with a hypothesis. Case (a) was solved in large part by having internal code checks (e.g. validating the arguments passed to a function). Using that pre-failure data, I came up the hypothesis that occasionally a return instruction was being executed as a NOP. In spite of skepticism, an afternoon of focused testing proved that correct, and produced a HW fix.
If you have an environmental chamber, you could try to turn an intermittent HW condition into a hard failure.