Line card freezes after 30 to 45 days of runtime

hiho888 · 06-15-2015, 05:54 AM

In our telco-system older line cards are running on Linux-Kernel 2.4.20_mvl31-wds-mips_fp_be. In the field these cards are freezing after 30 to 45 days of runtime out of normal processing state without any error message. They will be re-animated by the central card via I2C once it detected the outage (no periodical temperature reports anymore). In the lab this behavior is not reproducible. Currently the serial consoles of several cards are wired in the field. But no kernel Ooops or panic is detected before the freeze. Furthermore I installed a script supervising the processing load and the free memory periodically. The output of this script shows no abnormalities before the freeze. The issue seems to be independent of the current load at the system because it appeared during periods of low traffic too. Furthermore a hardware defect can be excluded at the current state of analysis.

Do you have any additional ideas how to trace an Embedded Linux-system in the field to narrow down such an issue without causing too much additional load?
E.g. is there a light-weight method to record the process context history during runtime?
Are there any special Linux ressources to be observed (apart from /proc/slabinfo which is already checked by my script periodically)?

rtmistler · 06-15-2015, 06:44 AM

Couple of thoughts here:

Telco cards, 2.4 kernel, MVL - MontaVista Linux, on MIPS ...

#@$%!!! It's probably something I worked on about 10+ years ago!!!!! (Cringing under desk)

Very old kernel, MontaVista usually customizes their kernels a lot. Not in a bad way, but just saying that you'd need their kernel source to debug this properly, and you either don't have it, or if you do, you should probably seek some assistance from them.

More substantively: You have serial consoles in use and there are no reports, my assumption is that the serial consoles are inoperable once things have "locked up"?

My conclusion here is that something has happened with the processor, be that a large enough memory fault, a file system fault, a bus error, or a plain old CPU halt. Are you SURE that NOTHING has occurred on the serial console prior to all this? Is the last known operation always the same? Or is there never any particular output of any relevance prior to these halts?

I had an embedded card which had random lock-ups and never really concluded what was up. Some small percentage of the lock-ups occurred near an update to the time/date on the RTC. We never could trust that particular board, and never could diagnose it, so we discarded it. I realize that this may not be an option due to the likely agedness of the equipment. But hardware is a factor to consider. The CPU depends on the memory working sufficiently, the flash, or whatever NV memory it is using to work sufficiently. If things get marginal, then stuff like system faults can occur.

The other point to consider is that you have some certain amount of these which are likely identical which do not have this problem. And I don't understand why you feel that hardware is ruled out as a possible fault point, this seems to be exactly hardware.

If it's critical, put a logic analyzer or an emulator on the MIPS save most recent trace info forever until it fails. Or find a way to determine if the CPU is still operating. No operating system can do any actions if the CPU is halted.

hiho888 · 06-15-2015, 08:26 AM

Thanks for the fast response!
I do see dumps at the serial console before lock-up. But these are related to normal processing only – e.g. periodical keep alive, processing of messages from the central card or from CPE-side - but no recurrences. As mentioned before, there are also cases with nearly no external traffic at all.

The reason for not blaming the hardware in the first place is that we retrieved an affected system from customer to our lab and run the traffic scenario of the customer there for several weeks without seeing the lock-up. We also checked for temperature, humidity, and supply voltage at the customer (interference radiation wasn’t checked yet).

There was an idea to establish a HW-watchdog at the line card. But there is no suitable additional component at the card (e.g. CPLD). For the latest debug load I added a keep alive in device driver area which prints the uptime at serial console every 10 sec – still waiting for the results.

Do you have any idea how to verify a CPU-halt without additional hardware?

rtmistler · 06-15-2015, 11:54 AM

Quote:

Originally Posted by hiho888

Do you have any idea how to verify a CPU-halt without additional hardware?

No besides a debug line saying "CPU will now halt."