LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Random Restarts With No Error Messages (https://www.linuxquestions.org/questions/linux-software-2/random-restarts-with-no-error-messages-4175586920/)

zombieno7 08-12-2016 05:14 PM

Random Restarts With No Error Messages
 
Over the past couple of weeks, my Gentoo system has been restarting randomly. There are no error messages displayed, and the restarts can happen days apart. I haven't noticed any cause or similar circumstances surrounding the restarts. I checked reboot, shutdown, and error logs, and the reboots are not logged. After the most recent reboot, I checked dmesg, and I saw that the last thing logged was an error saying that a kernel kworker stopped after reaching the stack limit. I'm not entirely sure if that's related. So, I have a couple of ideas, and I just want some help narrowing it down.

1. The Kernel - I'm running a custom build 4.7 kernel, and with those kworker messages.

2. The RAM - it's a common cause, but I rad memtest86+ for several hours with no errors

3. Somewhat old SSD - My /home folder is on an SSD that's a over a year old and has some bad sectors.

4. New HDD - I just cloned a dying 2TB HDD onto a new one, and I think there might have been some corruption because of the problems with the old drive.

That's all I can really think of, so if anyone has any ideas, it would be greatly appreciated. Thank you.

Emerson 08-12-2016 05:20 PM

RAM, memtest can run for days and the RAM can still be bad. The only conclusive result from memtest is when it tells the RAM is bad - bad it is then.
Overheating (dust), out of specs PSU are other common reasons.

zombieno7 08-12-2016 06:00 PM

It's definitely not overheating. It's water cooled. The PSU is new but refurbished. The wattage is good, though. I just don't understand why the RAM would randomly become a problem, especially since it's less than a year old. Could it be a bad kernel? I would try to test it, but I can't figure out what's triggering the resets.

hydrurga 08-13-2016 06:48 AM

Have you fsck'd all your filesystems (including checking your disks for badblocks), and checked the disks' SMART status?

Shadow_7 08-13-2016 07:46 AM

Ram would have been my first guess, but you covered that.

Power would be my 2nd guess. If it's not getting enough power it could cycle.

Heat, but that's a not much of a mystery if you're in the same room as the device. But there could be fans that are not working or clogged heat syncs that cancel out any would be air movement.

Beyond that software. Some sort of resource over usage. Running out of ram and no swap to ease the burden. If you have swap, you might try moving it to another device. It tends to wear out drives so if you're already having issues, putting something fresh, even if it's an SDHC card and a reader could be the answer. You might also have swap, but have swappiness set to 0 so it behaves like it doesn't have swap.

But generally if there's no messages / logs and such it's a hardware issue. Not to say that it isn't triggered by software. Like running a 64 bit OS on a 32 bit machine. Or a kernel compiled with extended CPU features and run on a CPU without said features. But "random" almost always means hardware. Which may not be "your" hardware, if the power blinks or the A/C stops working and such.

273 08-13-2016 08:04 AM

Perhaps run another kernel or, even better, a live distribution for a while and see whether it happens? I've had USB fans cause a reboot before and an of a mind to think as above that PSU or even mains problems may be the cause (I've seen lots of 2 second power cuts too).

zombieno7 08-13-2016 12:11 PM

Okay, so for an update; I ran SMART short tests on both drives, and they passed. I also ran memtest86+ for 6+ hours through 4 cycles with no errors.

It's not a heat issue with the CPU because it is water cooled, and I have a monitor on the desktop through lm_sensors. Is stays in an acceptable range at all times. Could it be the motherboard overheating independently? It just doesn't seem like overheating because it doesn't necessarily happen during peak loads. There is no dust problem either. I keep the machine clean.

The PSU is a Corsair AX 860i, so I seriously doubt that it isn't getting enough power.

Should I run longer SMART tests on the drives? Are there other tests that I can run? The restarts happen so infrequently that it's very hard to test. The computer can run fine for days without it happening.

Emerson 08-13-2016 12:25 PM

Yes, motherboard components can overheat, northbridge for instance. Memtest ... you say your computer may run for two days before it reboots ... makes you think memtest needs to run for two days, too? Anyhow, if it reboots then you won't get any errors from memtest, obviously. PSU can be out of specs, the voltages may fluctuate out of allowed range or be out of range permanently. Use a real voltmeter to measure. Re-seating all components (memory modules, PCI cards) won't hurt, either.

273 08-13-2016 12:44 PM

The PSU or motherboard may have one dry joint. Just start troubleshooting...

hydrurga 08-13-2016 12:51 PM

Do you have your system set up to auto reboot after a kernel panic?

zombieno7 08-13-2016 01:02 PM

I would love to be able to leave memtest running for two days, but I just don't have that kind of time. This is my only work computer. I might be able to let it go for a long time tonight, but not two days. The PSU is new. It's a Corsair AX860i. I seriously doubt that it's the problem, especially since this didn't start until weeks after it was installed. I could see getting a bad one out of the box, but not having it go bad in a couple of weeks. There are no logs of kernel panics, and the kernel is not set to reboot on a panic.

273 08-13-2016 01:13 PM

Then a live distro.

zombieno7 08-13-2016 01:17 PM

Live distro? Why? Is it at all possible that this is a hard drive problem? I'm running a long SMART test now, and when that finishes, I'll run fsck. They're the oldest parts of the system and dmesg did report bad partitions(not sure how accurate that is).

Emerson 08-13-2016 01:20 PM

HD failure is not likely to reboot the box, you would get some sort of hang/crash. Refurbished PS can go bad any time, I wouldn't rule it out, and I'd be all over it with voltmeter.

273 08-13-2016 01:25 PM

Quote:

Originally Posted by zombieno7 (Post 5590353)
Live distro? Why? Is it at all possible that this is a hard drive problem? I'm running a long SMART test now, and when that finishes, I'll run fsck. They're the oldest parts of the system and dmesg did report bad partitions(not sure how accurate that is).

if a live distro doesn't crash it may rule that in.
Please think.
Edit: and it rules out the kernel etc.


All times are GMT -5. The time now is 06:54 AM.