LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware > Linux - Embedded & Single-board computer
User Name
Password
Linux - Embedded & Single-board computer This forum is for the discussion of Linux on both embedded devices and single-board computers (such as the Raspberry Pi, BeagleBoard and PandaBoard). Discussions involving Arduino, plug computers and other micro-controller like devices are also welcome.

Notices


Reply
  Search this Thread
Old 06-15-2015, 05:54 AM   #1
hiho888
LQ Newbie
 
Registered: Jun 2015
Posts: 2

Rep: Reputation: Disabled
Line card freezes after 30 to 45 days of runtime


In our telco-system older line cards are running on Linux-Kernel 2.4.20_mvl31-wds-mips_fp_be. In the field these cards are freezing after 30 to 45 days of runtime out of normal processing state without any error message. They will be re-animated by the central card via I2C once it detected the outage (no periodical temperature reports anymore). In the lab this behavior is not reproducible. Currently the serial consoles of several cards are wired in the field. But no kernel Ooops or panic is detected before the freeze. Furthermore I installed a script supervising the processing load and the free memory periodically. The output of this script shows no abnormalities before the freeze. The issue seems to be independent of the current load at the system because it appeared during periods of low traffic too. Furthermore a hardware defect can be excluded at the current state of analysis.

Do you have any additional ideas how to trace an Embedded Linux-system in the field to narrow down such an issue without causing too much additional load?
E.g. is there a light-weight method to record the process context history during runtime?
Are there any special Linux ressources to be observed (apart from /proc/slabinfo which is already checked by my script periodically)?
 
Old 06-15-2015, 06:44 AM   #2
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Couple of thoughts here:

Telco cards, 2.4 kernel, MVL - MontaVista Linux, on MIPS ...

#@$%!!! It's probably something I worked on about 10+ years ago!!!!! (Cringing under desk)

Very old kernel, MontaVista usually customizes their kernels a lot. Not in a bad way, but just saying that you'd need their kernel source to debug this properly, and you either don't have it, or if you do, you should probably seek some assistance from them.

More substantively: You have serial consoles in use and there are no reports, my assumption is that the serial consoles are inoperable once things have "locked up"?

My conclusion here is that something has happened with the processor, be that a large enough memory fault, a file system fault, a bus error, or a plain old CPU halt. Are you SURE that NOTHING has occurred on the serial console prior to all this? Is the last known operation always the same? Or is there never any particular output of any relevance prior to these halts?

I had an embedded card which had random lock-ups and never really concluded what was up. Some small percentage of the lock-ups occurred near an update to the time/date on the RTC. We never could trust that particular board, and never could diagnose it, so we discarded it. I realize that this may not be an option due to the likely agedness of the equipment. But hardware is a factor to consider. The CPU depends on the memory working sufficiently, the flash, or whatever NV memory it is using to work sufficiently. If things get marginal, then stuff like system faults can occur.

The other point to consider is that you have some certain amount of these which are likely identical which do not have this problem. And I don't understand why you feel that hardware is ruled out as a possible fault point, this seems to be exactly hardware.

If it's critical, put a logic analyzer or an emulator on the MIPS save most recent trace info forever until it fails. Or find a way to determine if the CPU is still operating. No operating system can do any actions if the CPU is halted.
 
1 members found this post helpful.
Old 06-15-2015, 08:26 AM   #3
hiho888
LQ Newbie
 
Registered: Jun 2015
Posts: 2

Original Poster
Rep: Reputation: Disabled
Thanks for the fast response!
I do see dumps at the serial console before lock-up. But these are related to normal processing only – e.g. periodical keep alive, processing of messages from the central card or from CPE-side - but no recurrences. As mentioned before, there are also cases with nearly no external traffic at all.

The reason for not blaming the hardware in the first place is that we retrieved an affected system from customer to our lab and run the traffic scenario of the customer there for several weeks without seeing the lock-up. We also checked for temperature, humidity, and supply voltage at the customer (interference radiation wasn’t checked yet).

There was an idea to establish a HW-watchdog at the line card. But there is no suitable additional component at the card (e.g. CPLD). For the latest debug load I added a keep alive in device driver area which prints the uptime at serial console every 10 sec – still waiting for the results.

Do you have any idea how to verify a CPU-halt without additional hardware?
 
Old 06-15-2015, 11:54 AM   #4
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Quote:
Originally Posted by hiho888 View Post
Do you have any idea how to verify a CPU-halt without additional hardware?
No besides a debug line saying "CPU will now halt."
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Find out timestamp line and delete if it is older than particular/specific days using bash script nitya Linux - Newbie 10 09-26-2014 01:30 PM
Read data from SD card 3 days continuous then has problems on Linux2.6.29 quoctoan_3t Linux - Software 1 06-07-2012 09:33 PM
script to read and delete a line from a file, if the line is 30 days old freakin.raja Linux - Newbie 7 09-17-2011 08:52 AM
X freezes after a few days uptime slackaddict Slackware 12 03-01-2005 01:35 AM
Every ~10 Days Linux Freezes--Any ideas? bruce1271 Linux - General 6 11-06-2003 04:31 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware > Linux - Embedded & Single-board computer

All times are GMT -5. The time now is 03:20 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration