Linux - Embedded & Single-board computerThis forum is for the discussion of Linux on both embedded devices and single-board computers (such as the Raspberry Pi, BeagleBoard and PandaBoard). Discussions involving Arduino, plug computers and other micro-controller like devices are also welcome.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
In our telco-system older line cards are running on Linux-Kernel 2.4.20_mvl31-wds-mips_fp_be. In the field these cards are freezing after 30 to 45 days of runtime out of normal processing state without any error message. They will be re-animated by the central card via I2C once it detected the outage (no periodical temperature reports anymore). In the lab this behavior is not reproducible. Currently the serial consoles of several cards are wired in the field. But no kernel Ooops or panic is detected before the freeze. Furthermore I installed a script supervising the processing load and the free memory periodically. The output of this script shows no abnormalities before the freeze. The issue seems to be independent of the current load at the system because it appeared during periods of low traffic too. Furthermore a hardware defect can be excluded at the current state of analysis.
Do you have any additional ideas how to trace an Embedded Linux-system in the field to narrow down such an issue without causing too much additional load?
E.g. is there a light-weight method to record the process context history during runtime?
Are there any special Linux ressources to be observed (apart from /proc/slabinfo which is already checked by my script periodically)?
#@$%!!! It's probably something I worked on about 10+ years ago!!!!! (Cringing under desk)
Very old kernel, MontaVista usually customizes their kernels a lot. Not in a bad way, but just saying that you'd need their kernel source to debug this properly, and you either don't have it, or if you do, you should probably seek some assistance from them.
More substantively: You have serial consoles in use and there are no reports, my assumption is that the serial consoles are inoperable once things have "locked up"?
My conclusion here is that something has happened with the processor, be that a large enough memory fault, a file system fault, a bus error, or a plain old CPU halt. Are you SURE that NOTHING has occurred on the serial console prior to all this? Is the last known operation always the same? Or is there never any particular output of any relevance prior to these halts?
I had an embedded card which had random lock-ups and never really concluded what was up. Some small percentage of the lock-ups occurred near an update to the time/date on the RTC. We never could trust that particular board, and never could diagnose it, so we discarded it. I realize that this may not be an option due to the likely agedness of the equipment. But hardware is a factor to consider. The CPU depends on the memory working sufficiently, the flash, or whatever NV memory it is using to work sufficiently. If things get marginal, then stuff like system faults can occur.
The other point to consider is that you have some certain amount of these which are likely identical which do not have this problem. And I don't understand why you feel that hardware is ruled out as a possible fault point, this seems to be exactly hardware.
If it's critical, put a logic analyzer or an emulator on the MIPS save most recent trace info forever until it fails. Or find a way to determine if the CPU is still operating. No operating system can do any actions if the CPU is halted.
Thanks for the fast response!
I do see dumps at the serial console before lock-up. But these are related to normal processing only – e.g. periodical keep alive, processing of messages from the central card or from CPE-side - but no recurrences. As mentioned before, there are also cases with nearly no external traffic at all.
The reason for not blaming the hardware in the first place is that we retrieved an affected system from customer to our lab and run the traffic scenario of the customer there for several weeks without seeing the lock-up. We also checked for temperature, humidity, and supply voltage at the customer (interference radiation wasn’t checked yet).
There was an idea to establish a HW-watchdog at the line card. But there is no suitable additional component at the card (e.g. CPLD). For the latest debug load I added a keep alive in device driver area which prints the uptime at serial console every 10 sec – still waiting for the results.
Do you have any idea how to verify a CPU-halt without additional hardware?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.