LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Kernel (https://www.linuxquestions.org/questions/linux-kernel-70/)
-   -   System halt after 25 days (https://www.linuxquestions.org/questions/linux-kernel-70/system-halt-after-25-days-4175651874/)

icav 04-10-2019 11:25 AM

System halt after 25 days
 
Hi
I'm facing with a strange system failure.
After 25 days since boot (more precisely 24 days, 20 hours and circa 30min) the system halts.
The wall-time is unrelated, only the boot-time seems valuable.

The nearest value to this time is (2^31-1)millisec: but the system doesn't halt exactly when CLOCK_MONOTONIC reach 2147483seconds, it runs for a handful of minutes (~10), then it stops. Until this, the system runs smoothly.
It seems that some kernel activity, scheduled for later processing, doesn't handle properly the wrap of this counter and it crashes.
I suspect something related to disk-cache-flush

I looked into the kernel tree for anything related to this issue, but nothing. All the time related functions use struct timespec/timeval or
int64, and no millisecond reference.

Have someone some suggestion ?
Thanks in advance

----
Linux kernel 2.6.26.8-3
CPU MIPS 4KSd V2.4
System busybox + libuClibc-0.9.30.so
Storage jffs2 / mtd

smallpond 04-10-2019 04:27 PM

What's on the console when it halts? Is there a stack trace? What does "last" say was the reason? Do you have a hardware watchdog timer enabled in the BIOS?

frankbell 04-10-2019 05:14 PM

It's a long shot, but is there anything in the logs?

icav 04-11-2019 01:48 AM

Logs
 
Hi,
unfortunately the console is not usable, because the machine is located remotely, the only access is via ssh.
After reboot, the previous logs are lost, because they are in tmpfs.

During the tests in laboratory, with console access, we never faced this issue

I tried unsuccessfully to reproduce the phenomenon

-"accelerate" the time
jiffies += SOME_LARGE_VALUE in do_timer(),
but it doesn't work: Linux doesn't run at all (there is a document
by Kobayashi/Toshiba about, I discovered *after*)

- "start" the timer near to the 25days expiration date
u64 jiffies_64 ... = INITIAL_JIFFIES + 2000000L;
but the system run flawless beyond the critical point

smallpond 04-11-2019 12:29 PM

Can you check for events at the time of the last crash:

Code:

ipmitool sel list

syg00 04-11-2019 06:05 PM

UPS ?.

ondoho 04-12-2019 01:36 AM

Quote:

Originally Posted by icav (Post 5983547)
After reboot, the previous logs are lost, because they are in tmpfs.

is that configurable?
write logs to different location, NOT tmpfs?

pan64 04-12-2019 02:00 AM

I don't think anyone can solve it without additional information. So as in post #7 save the logs (and come back after 25 days).
It can be even a simple disk full on your tmpfs, but we can only guess...

icav 04-15-2019 04:26 AM

Well
clearly this is not a "known" issue.
We are verifying the feasibility to connect a remote machine to the console, and hopefully ... But 25days is a long time :(
Thanks

dc.901 04-15-2019 05:42 AM

Quote:

Originally Posted by icav (Post 5985160)
Well
clearly this is not a "known" issue.
We are verifying the feasibility to connect a remote machine to the console, and hopefully ... But 25days is a long time :(
Thanks

Assuming you already checked cron?
And, as mentioned by others, anything in hardware logs:
Code:

ipmitool sel elist
ipmitool sensor

Since, you know this happens in 25 days, perhaps, you should set a script to capture some of the information from system:
- setup a syslog server and send syslogs to it.
- in a loop write out dmesg output to file; same with other info like vmstat, iostat etc

Also, which OS? I have seen in some OS: boot.olog and boot.log - not sure if you have looked at that?


All times are GMT -5. The time now is 08:05 AM.