Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Laptop reboots apparently randomly. It seems to me due to hard drive overheating. It runs Debian 10 stable. Interestingly, it does not happen with Windows 8 with the exactly same hardware (multi boot with GRUB).
Easiest way to force the shutdown and reboot is with **dd**, **rsync** or **cp**. The next easiest way is with video playback, either Web streaming or from hard-drive. More importantly, any kind of compilation is likely to cause a shutdown too.
Two different hard-drives were tested.
It does not matter if **xorg** is running. Rescue mode is irrelevant.
**smartctl**, **badlock** or **memtest86** reveal nothing.
Limiting memory at boot with GRUB "mem=1024M" prolongs the working time slightly from 15 minutes to maybe 30.
All partitions are mounted with "noatime" option.
The fan is cleaned, panel is removed, battery is removed, the laptop is a eleveated by two centimiters. The elevation helps slightly.
**laptop_mode(8)** may be able to help. It happened once that the write speed of **dd** reduced by 1/4, from 80 to 20 MB, with **laptop_mode** enabled, that actually made the process complete and write the image I needed. Since then it did not appear to help. That much is acceptable if it can be reproduced.
What can be done to reduce the hard drive overheating and prolong it's "life"? Is it possible to force reduce the write speed to hopefully reduce the heat? Are there more relevant **fstab** options?
There are **thermald** and **laptop-mode-tools** installed. I also tried **tld**, but it is incompatible with **laptop-mode-tools**, or so they say.
Yes, I did clean the interior. I also removed the panel and elevated the laptop to help air circulation, as mentioned. Not enough to prevent hard drive from overheating. I read about how thermal paste might need replacement, but again not sure how it is applicable to the hard drive or how to even determine if it is necessary.
Isn't **thermald** for CPU overheating and not harddrive overheating?
+++ In exactly ten minutes since I booted this laptop, run **firefox** under **i3-wm** to access this forum only, and some **apt** installation commands, hard drive went from 24C to 40C. It's heatsink(??) is hot to the touch.
This just doesn't make any sense - I've never heard of a hard disk thermal event in a consumer laptop. Especially at such low temps. What are your BIOS limits set at - that would probably be for CPU temps, not disk.
Are there any logs that are relevant for the time of interest ?.
I am not sure if I have the option to set or get thermal limits, not in BIOS menu. When I try to follow a tutorial on how to do that I find no options in the menu that the tutorial suggests are required.
I did suffer the usual CPU overheating due to clogged fan three years ago, and the temperatures for the CPU were up to 78C.
Note, this laptop is around a decade old.
I did view **syslog(3)** and **xorg** log when the problems started. **xorg** log ended up being irrelevant. I could not make sense of **syslog**. There was no obvious error messages or warnings for sure. Only thing of note was that the end of the file was filled with null-characters, predictably.
I can try to reproduce shutdown with **dd** or torrent download if needed, and then share the relevant logs, if you tell me which ones are relevant.
Yes, I did try smartctl, memtest86+ and badlock. I also tried to disable GPU or limit available RAM.
No, the hard drive isn't dying, because 1) there are two hard drives that behave the same on this machine; 2) it works fine under Windows on this machine.
I mentioned this in the very first message.
How do I check if there is excessive I/O?
+++ This hard drive I wiped completely, then installed Debian 10 stable using whole disk. Therefore I expect no filesystem errors. I might double check with **fsck**, I am guessing I will have to use live media to check entire device. I also have "noatime" property set in **fstab**, again, mentioned previously.
I start torrenting. **top** showed nothing suspicious. Maybe I read it wrong. I also start **dstat**. Eventually the machine shutdown as expected, at hddtemp 42C.
If I could just limit hard drive access speed I expect it to get better. I don't know how unfortunately.
Yes, I did try smartctl, memtest86+ and badlock. I also tried to disable GPU or limit available RAM.
No, the hard drive isn't dying, because 1) there are two hard drives that behave the same on this machine; 2) it works fine under Windows on this machine.
I find it strange that you are saying the HDD is causing the shutdown at 42C. Have you monitored the CPU, GPU, and HDD temperatures simultaneously? 42C is only 107F so that is barely warm to the touch. My laptop with an i7-9750H 6 core 12 thread proc running at 98+% load continuously on all cores only gets to 29 on the drive, 49 GPU and 65 CPU. None of those are critical that would force a shutdown.
You said that your laptop is about a decade old. Have you replaced any of the fans internally? All it would take would be one of the fans to stop or slow and the CPU or GPU would spike temp rapidly triggering a shutdown and you might not even see that when monitoring. The fan could even be physically temp sensitive. Laptops are designed in bios to increase fan speed as temps climb and to also throttle CPU when temps near critical; Does that happen?
You also said that once the speed dropped to 1/4 the previous, which could indicate possible cpu throttling as a result of temp.
Last edited by computersavvy; 11-01-2020 at 09:34 AM.
> Have you monitored the CPU, GPU, and HDD temperatures simultaneously?
Yes, of course the temperatures from lm-sensors and hddtemp are taken almost simultaneously. I used to have scripts that would write log files for those and after shutdown happens I retrieve them. When I realised that CPU and GPU temperatures are always low and only hddtemp is slightly high I focused on that.
> If you believe that the temperature reading and shutdown are erroneous
Shutdown maybe erroneous, but hddtemp does not control it. I installed it about a month after shutdowns first started to happen. I suspect that the shutdown is caused by hardware drivers themselves.
> You also said that once the speed dropped to 1/4 the previous, which could indicate possible cpu throttling as a result of temp.
Yes that happened once that I noticed with "dd status=progress". When it was at full speed it would shutdown, but when it was 1/4 of previous speed, don't know how or why it determined that it was necessary, it completed normally as required.
> Have you replaced any of the fans internally?
I cleaned the fans many times over the years, which required unscrewing and them putting them back, but I did not replace them.
> The fan could even be physically temp sensitive. Laptops are designed in bios to increase fan speed as temps climb and to also throttle CPU when temps near critical; Does that happen?
I am not sure. Fan seems to start and stop as it pleases, I am not hearing it running continuosly. **fancontrol** documentation warns to use it with caution so I postponed tinkering with it.
From the feeling of it, CPU works at full speed almost always, except that one case I described. If I could configure it to be always limited, and it shutdown stopped, I would accept it as a solution. However, limiting memory in kernel mem=1024M did not help, it barely prolonged the duration without shutdown.
Also, I run smartctl fourth and fifth time, and this time it found some errors. I run e2fsck afterwards on it. Going to try smartctl again. This is another hard drive, on the previous one smartctl completed without error. __Both__ drives suffers shutdowns equally.
+++ I made sure that CPU throttling and aggressive HDD power management are enabled with laptop_mode. It reduced hddtemp from 40C to 38C but shutdowns still happen (after a minute of video streaming in this case).
+++ **pwmconfig(8)** that is **fancontrol(8)** configuration tool refused to start and gave me:
Quote:
/usr/sbin/pwmconfig: There are no pwm-capable sensor modules installed
I assume it means that my hardware does not support these tools?
That is an average of more than 2875 read errors per hour of operation with more than 14,630 seek errors per hour of operation.
It shows well over 1069 days (25672 hours) of operation.
I suspect the failure is not specifically temperature related but caused by physical drive failure.
The 20 errors displayed show as occurring since the drive reached 509 hours of operation and repeatedly since (617 total) to the current 25672 hours. With only 20 errors of 617 displayed there is no way to know when it actually started, but you have gotten more than average use out of that drive. The test totally failed at 90% read.
I strongly suggest that drive be replaced.
Please run the long test on the other drive and post the results as well so we can compare performance and condition.
Last edited by computersavvy; 11-02-2020 at 11:24 AM.
Well that's bad. But reboots happened to the other drive as well and it didn't have any errors with **smartctl** after I run it at least two times.
I am going to try to put that healthy drive back I guess.
The drive may work fine with windows because the partition is in a different area of the drive, and the failing portion of the disk may be only in the linux partition area.
Since the reboots occurred with both drives I would still like to see the results of smartctl on the other drive after you have run the long test on it.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.