Hard drive overheats with GNU/Linux specifically

vvbond · 10-30-2020, 02:20 PM

Laptop reboots apparently randomly. It seems to me due to hard drive overheating. It runs Debian 10 stable. Interestingly, it does not happen with Windows 8 with the exactly same hardware (multi boot with GRUB).

Easiest way to force the shutdown and reboot is with **dd**, **rsync** or **cp**. The next easiest way is with video playback, either Web streaming or from hard-drive. More importantly, any kind of compilation is likely to cause a shutdown too.

Two different hard-drives were tested.

It does not matter if **xorg** is running. Rescue mode is irrelevant.

**smartctl**, **badlock** or **memtest86** reveal nothing.

Limiting memory at boot with GRUB "mem=1024M" prolongs the working time slightly from 15 minutes to maybe 30.

All partitions are mounted with "noatime" option.

The fan is cleaned, panel is removed, battery is removed, the laptop is a eleveated by two centimiters. The elevation helps slightly.

**laptop_mode(8)** may be able to help. It happened once that the write speed of **dd** reduced by 1/4, from 80 to 20 MB, with **laptop_mode** enabled, that actually made the process complete and write the image I needed. Since then it did not appear to help. That much is acceptable if it can be reproduced.

What can be done to reduce the hard drive overheating and prolong it's "life"? Is it possible to force reduce the write speed to hopefully reduce the heat? Are there more relevant **fstab** options?

sensors(1):

Quote:

radeon-pci-0100
Adapter: PCI adapter
temp1: +24.0°C

acpitz-acpi-0
Adapter: ACPI interface
temp1: +30.0°C (crit = +126.0°C)
temp2: +40.0°C (crit = +103.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0: +38.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +38.0°C (high = +105.0°C, crit = +105.0°C)

hddtemp(8):

Quote:

/dev/sda: ST9500325AS: 42°C

43C is enough to cause a shutdown. The hottest it got is 48C.

The laptop is Acer TravelMate 5720G.

lspci:

Quote:

00:00.0 Host bridge: Intel Corporation Mobile PM965/GM965/GL960 Memory Controller Hub (rev 03)
00:01.0 PCI bridge: Intel Corporation Mobile PM965/GM965/GL960 PCI Express Root Port (rev 03)
00:1a.0 USB controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #5 (rev 03)
00:1a.7 USB controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1 (rev 03)
00:1c.1 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 2 (rev 03)
00:1c.2 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 3 (rev 03)
00:1d.0 USB controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev f3)
00:1f.0 ISA bridge: Intel Corporation 82801HM (ICH8M) LPC Interface Controller (rev 03)
00:1f.1 IDE interface: Intel Corporation 82801HM/HEM (ICH8M/ICH8M-E) IDE Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation 82801HM/HEM (ICH8M/ICH8M-E) SATA Controller [AHCI mode] (rev 03)
00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev 03)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV630/M76 [Mobility Radeon HD 2600]
02:00.0 Ethernet controller: Broadcom Limited NetLink BCM5787M Gigabit Ethernet PCI Express (rev 02)
04:00.0 Network controller: Intel Corporation PRO/Wireless 3945ABG [Golan] Network Connection (rev 02)
0f:06.0 CardBus bridge: Texas Instruments PCIxx12 Cardbus Controller
0f:06.1 FireWire (IEEE 1394): Texas Instruments PCIxx12 OHCI Compliant IEEE 1394 Host Controller
0f:06.2 Mass storage controller: Texas Instruments 5-in-1 Multimedia Card Reader (SD/MMC/MS/MS PRO/xD)
0f:06.3 SD Host controller: Texas Instruments PCIxx12 SDA Standard Compliant SD Host Controller

beachboy2 · 10-30-2020, 05:07 PM

vvbond,

Welcome to LQ forums.

You could try installing thermald and see whether that makes any difference:
https://wiki.debian.org/thermald

frankbell · 10-30-2020, 08:16 PM

You may have already done this, but, if not, check that all vents are clear and free of dust. I've had dust build-up sneak up on me a couple of times.

vvbond · 10-31-2020, 03:26 AM

There are **thermald** and **laptop-mode-tools** installed. I also tried **tld**, but it is incompatible with **laptop-mode-tools**, or so they say.

Yes, I did clean the interior. I also removed the panel and elevated the laptop to help air circulation, as mentioned. Not enough to prevent hard drive from overheating. I read about how thermal paste might need replacement, but again not sure how it is applicable to the hard drive or how to even determine if it is necessary.

Isn't **thermald** for CPU overheating and not harddrive overheating?

+++ In exactly ten minutes since I booted this laptop, run **firefox** under **i3-wm** to access this forum only, and some **apt** installation commands, hard drive went from 24C to 40C. It's heatsink(??) is hot to the touch.

syg00 · 10-31-2020, 04:45 AM

This just doesn't make any sense - I've never heard of a hard disk thermal event in a consumer laptop. Especially at such low temps. What are your BIOS limits set at - that would probably be for CPU temps, not disk.

Are there any logs that are relevant for the time of interest ?.

vvbond · 10-31-2020, 05:15 AM

I am not sure if I have the option to set or get thermal limits, not in BIOS menu. When I try to follow a tutorial on how to do that I find no options in the menu that the tutorial suggests are required.

I did suffer the usual CPU overheating due to clogged fan three years ago, and the temperatures for the CPU were up to 78C.

Note, this laptop is around a decade old.

I did view **syslog(3)** and **xorg** log when the problems started. **xorg** log ended up being irrelevant. I could not make sense of **syslog**. There was no obvious error messages or warnings for sure. Only thing of note was that the end of the file was filled with null-characters, predictably.

I can try to reproduce shutdown with **dd** or torrent download if needed, and then share the relevant logs, if you tell me which ones are relevant.

vvbond · 10-31-2020, 05:28 AM

After an hour of browsing the Web I decided to try to torrent Debian DVD images. The machine shutdown in five minutes.

hddtemp: 42C

sensors:
```
Adapter: PCI adapter
temp1: +25.0°C

acpitz-acpi-0
Adapter: ACPI interface
temp1: +34.0°C (crit = +126.0°C)
temp2: +41.0°C (crit = +103.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0: +38.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +38.0°C (high = +105.0°C, crit = +105.0°C)
```

ondoho · 10-31-2020, 06:19 AM

If it really is the hard drive, check if there's excessive I/O happening.
Or maybe the hd is simply dying, have you checked SMART (smartctl)?

vvbond · 10-31-2020, 11:46 PM

Yes, I did try smartctl, memtest86+ and badlock. I also tried to disable GPU or limit available RAM.

No, the hard drive isn't dying, because 1) there are two hard drives that behave the same on this machine; 2) it works fine under Windows on this machine.

I mentioned this in the very first message.

How do I check if there is excessive I/O?

+++ This hard drive I wiped completely, then installed Debian 10 stable using whole disk. Therefore I expect no filesystem errors. I might double check with **fsck**, I am guessing I will have to use live media to check entire device. I also have "noatime" property set in **fstab**, again, mentioned previously.

I start torrenting. **top** showed nothing suspicious. Maybe I read it wrong. I also start **dstat**. Eventually the machine shutdown as expected, at hddtemp 42C.

If I could just limit hard drive access speed I expect it to get better. I don't know how unfortunately.

Code:

dstat --disk --io > dstat.txt
tail -n32 dstat.txt

Code:

-dsk/total- --io/total-
 read  writ| read  writ
 272k  380k|8.28  3.23
   0     0 |   0     0
   0     0 |   0     0
   0     0 |   0     0
   0     0 |   0     0
   0    32k|   0  2.00
   0     0 |   0     0
   0     0 |   0     0
//...
   0     0 |   0     0
   0     0 |   0     0
   0     0 |   0     0
   0     0 |   0     0
   0    16M|   0   122
   0    13M|   0  50.0
   0     0 |   0     0
   0     0 |   0     0
   0     0 |   0     0
   0    48k|   0  2.00
   0    40k|   0  9.00
   0    12k|   0  2.00
   0     0 |   0     0
   0     0 |   0     0
   0    16k|   0  2.00
   0     0 |   0     0
   0     0 |   0     0
   0     0 |   0     0
   0     0 |   0     0
   0     0 |   0     0
   0    16k|   0  2.00
   0     0 |   0     0
   0     0 |   0     0
   0     0 |   0     0
   0     0 |   0     0
   0     0 |   0     0
   0    16k|   0  2.00
   0     0 |   0     0
   0     0 |   0     0
   0     0 |   0     0
   0     0 |   0     0

https://www.linuxquestions.org/quest...1&d=1604209592

computersavvy · 11-01-2020, 09:32 AM

Quote:

Originally Posted by vvbond

Yes, I did try smartctl, memtest86+ and badlock. I also tried to disable GPU or limit available RAM.

No, the hard drive isn't dying, because 1) there are two hard drives that behave the same on this machine; 2) it works fine under Windows on this machine.

https://www.linuxquestions.org/quest...1&d=1604209592

I find it strange that you are saying the HDD is causing the shutdown at 42C. Have you monitored the CPU, GPU, and HDD temperatures simultaneously? 42C is only 107F so that is barely warm to the touch. My laptop with an i7-9750H 6 core 12 thread proc running at 98+% load continuously on all cores only gets to 29 on the drive, 49 GPU and 65 CPU. None of those are critical that would force a shutdown.

You said that your laptop is about a decade old. Have you replaced any of the fans internally? All it would take would be one of the fans to stop or slow and the CPU or GPU would spike temp rapidly triggering a shutdown and you might not even see that when monitoring. The fan could even be physically temp sensitive. Laptops are designed in bios to increase fan speed as temps climb and to also throttle CPU when temps near critical; Does that happen?
You also said that once the speed dropped to 1/4 the previous, which could indicate possible cpu throttling as a result of temp.

EdGr · 11-01-2020, 05:26 PM

If you believe that the temperature reading and shutdown are erroneous, you can try removing the hddtemp package (apt-get remove hddtemp).

I am not sure why Debian has hddtemp enabled. It gives systemd one more thing to goof up.
Ed

vvbond · 11-02-2020, 02:43 AM

> Have you monitored the CPU, GPU, and HDD temperatures simultaneously?

Yes, of course the temperatures from lm-sensors and hddtemp are taken almost simultaneously. I used to have scripts that would write log files for those and after shutdown happens I retrieve them. When I realised that CPU and GPU temperatures are always low and only hddtemp is slightly high I focused on that.

> If you believe that the temperature reading and shutdown are erroneous

Shutdown maybe erroneous, but hddtemp does not control it. I installed it about a month after shutdowns first started to happen. I suspect that the shutdown is caused by hardware drivers themselves.

> You also said that once the speed dropped to 1/4 the previous, which could indicate possible cpu throttling as a result of temp.

Yes that happened once that I noticed with "dd status=progress". When it was at full speed it would shutdown, but when it was 1/4 of previous speed, don't know how or why it determined that it was necessary, it completed normally as required.

> Have you replaced any of the fans internally?

I cleaned the fans many times over the years, which required unscrewing and them putting them back, but I did not replace them.

> The fan could even be physically temp sensitive. Laptops are designed in bios to increase fan speed as temps climb and to also throttle CPU when temps near critical; Does that happen?

I am not sure. Fan seems to start and stop as it pleases, I am not hearing it running continuosly. **fancontrol** documentation warns to use it with caution so I postponed tinkering with it.

From the feeling of it, CPU works at full speed almost always, except that one case I described. If I could configure it to be always limited, and it shutdown stopped, I would accept it as a solution. However, limiting memory in kernel mem=1024M did not help, it barely prolonged the duration without shutdown.

Also, I run smartctl fourth and fifth time, and this time it found some errors. I run e2fsck afterwards on it. Going to try smartctl again. This is another hard drive, on the previous one smartctl completed without error. __Both__ drives suffers shutdowns equally.

https://www.linuxquestions.org/quest...1&d=1604307322

+++ I made sure that CPU throttling and aggressive HDD power management are enabled with laptop_mode. It reduced hddtemp from 40C to 38C but shutdowns still happen (after a minute of video streaming in this case).

+++ **pwmconfig(8)** that is **fancontrol(8)** configuration tool refused to start and gave me:

Quote:

/usr/sbin/pwmconfig: There are no pwm-capable sensor modules installed

I assume it means that my hardware does not support these tools?

computersavvy · 11-02-2020, 11:11 AM

That file shows a seriously large number of errors on the disk.

Code:

Raw_Read_Error_Rate           73876326
Hardware_ECC_Recovered        73876326
Seek_Error_Rate              375650036

That is an average of more than 2875 read errors per hour of operation with more than 14,630 seek errors per hour of operation.
It shows well over 1069 days (25672 hours) of operation.

I suspect the failure is not specifically temperature related but caused by physical drive failure.
The 20 errors displayed show as occurring since the drive reached 509 hours of operation and repeatedly since (617 total) to the current 25672 hours. With only 20 errors of 617 displayed there is no way to know when it actually started, but you have gotten more than average use out of that drive. The test totally failed at 90% read.

I strongly suggest that drive be replaced.

Please run the long test on the other drive and post the results as well so we can compare performance and condition.

vvbond · 11-02-2020, 11:16 AM

Well that's bad. But reboots happened to the other drive as well and it didn't have any errors with **smartctl** after I run it at least two times.

I am going to try to put that healthy drive back I guess.

computersavvy · 11-02-2020, 11:29 AM

Quote:

Originally Posted by vvbond

Well that's bad. But reboots happened to the other drive as well and it didn't have any errors with **smartctl** after I run it at least two times.

I am going to try to put that healthy drive back I guess.

The drive may work fine with windows because the partition is in a different area of the drive, and the failing portion of the disk may be only in the linux partition area.

Since the reboots occurred with both drives I would still like to see the results of smartctl on the other drive after you have run the long test on it.