[SOLVED] My Dell D620 laptop hangs/freezes randomly on slackware64 14.2
SlackwareThis Forum is for the discussion of Slackware Linux.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I stated that I'm 90% sure the cause of your hang/reboot problems is acpi and/or PM(calls) related. I'm running out of ideas and we pretty much exhausted all the possible fixes to tame your system.
I'm also "running out" of knowledge with respect to the acpi system and your next best bet for further investigation would be to engage with the kernel devs -> open a bug report @ kernel.org. You could provide the acpi tables from your /sys/firmware/acpi/tables/ and ask them to inspect/check them.
I believe it's time to consider the other 10% and by going over this whole thread again, I realized that the only period when your system was truly stable was when you disabled acpi for good with the kernel parameter acpi=off. As a result your video card wasn't recognized by the system and the i915 module not loaded, the standard vesa frame buffer driver was used instead.
Apparently there are "well known" stability issues with the i915 driver and I'd suggest to focus on it for the moment.
Cancel all the acpi workarounds in your lilo append line, get it back to the default:
append="vt.default_utf8=0"
And lecture & try the following bug reports and fixes, even if some of them are related to newer HW & kernel versions:
- I took a look at the i915 module parameters (modinfo i915) in search for some interesting ones that could help with your issue. The following are the only ones that I found useful and the first two are already enabled by default. The third one could be useful for extra logging, but then your system hangs and there's no logging at all ...
Code:
parm: reset:Attempt GPU resets (default: true) (bool)
parm: enable_hangcheck:Periodically check GPU activity for detecting hangs. WARNING: Disabling this can cause system wide hangs. (default: true) (bool)
parm: verbose_state_checks:Enable verbose logs (ie. WARN_ON()) in case of unexpected hw state conditions. (bool)
In the Slackware provided kernel the i915 driver is modular and the way to pass these options to the module is by creating a /etc/modprobe.d/i915.conf file with the following content:
First, the Arch Linux page.
It mentioned intel_idle.max_cstate, we already knew that intel_idle does not run on my CPU.
I did not try the X server ideas, because I thought that when X crashes/freezes, I could still use ctrl alt f1 to go to a console. But with my issue, the shortcut does not do anything.
I could not understand most of the i915 options. The only one I could figure, enable_rc6, is already at 0 according to systool. I could confirm that reset, verbose_state_checks, and enable_hangcheck are all set to "Y" in systool.
The manjaro user has graphical issues that somehow still log data. So I gave it a go, and my computer still froze. dmesg only complained that some of the i915 parameters didnt exist, and everything else looked as before.
On the gentoo bug, a file named kernel.log is mentioned. is /var/log/messages its equivalent on Slackware? Also I can see kernel oops messages, so unfortunately my system does not behave quite like that.
On the launchpad bug, unfortunately the TLP suggestions require intel_pstate and that driver needs a sandy bridge or more recent CPU.
I do not understand if intel_pstate and intel_idle mean the same thing.
I learnt here about cpu governors and found that my cpus were in ondemand mode. I did not understand yet how to make these settings persistant over a reboot. Unfortunately, just echoing performance into the govs (without rebooting or anything) didnt stop the computer from freezing. I don't even know if pstates/acpi are even related to selected governors.
On the Arch troubleshooting section:
"the X server ideas" describes: "Some issues with X crashing, GPU hanging, or problems with X freezing, "
If your GPU is hanging, cltr alt f1 won't be of any help, I guess...
Then you have a workaround for "Kernel crashing w/kernels 4.0+ on Broadwell/Core-M chips", setting i915.enable_execlists=0. Try adding:
Code:
options i915 enable_execlists=0
in /etc/modprobe.d/i915.conf
I'd suggest to try anything that is mentioning GPU/System hang or you believe that could cause a hang. Your issue is peculiar enough because you don't get anything in the kernel log, no info/clue. Be a little more flexible&creative
Last edited by abga; 10-25-2019 at 07:46 PM.
Reason: typo
@Richard : I didn't think of doing that. I opened a ssh terminal then went and made the laptop hang and I got a putty fatal error ssaying "network error software caused connection abort", and then I couldnt connect back.
@abga : I do not know if I should try all options together or one at a time.
I tried this in one go and it still hung:
@Richard : I didn't think of doing that. I opened a ssh terminal then went and made the laptop hang and I got a putty fatal error ssaying "network error software caused connection abort", and then I couldnt connect back.
Ok, that's a very hard laptop hang; I've had X go crazy and then respond to neither keyboard nor mouse but the machine was still more-or-less operational via my ssh login. (Normally less)
What Richard Cranium suggested, to connect remotely through secure shell, is useful for troubleshooting a system that has the console dysfunctional (could be also due to the graphic driver). I haven't considered this approach, mainly because in your OP you stated "the hardware network switch does nothing", which (together with the other details you provided) led me to believe that the whole system is frozen/crashed.
With my statement "Be a little more flexible&creative", I just wanted to propose to be more flexible in your understanding (situation&implications) and creative in your approaches. Again, it's a really weird situation you have there and it's worth to try whatever workarounds you find related (even distantly) to your issue. At least until you still have time & patience with it
(and don't have one of these handy: https://en.wikipedia.org/wiki/Sledgehammer )
I'd approach the workarounds in a more sequential fashion, not trying all of them together at once. You'd be able to identify the ones that do something from the others with no effect.
TBH, I had some hopes from the i915.enable_execlists=0, sorry to hear that it doesn't help.
Besides, you don't need to provide the driver options as kernel boot parameter if the driver is built modular. It will work, the kernel boot parameters will be passed for both built-in and modular drivers but there are easier ways to achieve that and if the driver is modular, a reboot is not always required.
Now, since the i915 driver is built modular in the Slackware kernel, you could boot clean and then try all these module options:
- statically, with the help of /etc/modprobe.d/i915.conf, like I suggested in #48 & #50 and only unload & reload the module.
Code:
/sbin/rmmod i915
# then add the options you want to try to /etc/modprobe.d/i915.conf and reload the module
/sbin/modprobe i915
- (an even simpler method) manually providing the module parameters. First unloading the i915 module and reloading it with the preferred parameters:
(you should also try that chain of module options form #48 (originally from the manjaro thread))
To check if/how the module was loaded - inspect dmesg. For identifying what parameters & values are loaded, use:
Code:
# the proper tool
/usr/bin/systool -v -m i915
# a "hack"
grep -H '' /sys/module/i915/parameters/*
I've noticed those and suggested to check the HDD connection through posts #8 & #9. Then in post #16 OP managed to crash the system without a HDD connected (booting from USB).
Are you sure those "* exception Emask * frozen", "soft resetting link" and " HSM Violation" are signs for a failing HDD? Maybe they are caused by the fact that the system is not operating in AHCI mode. OP is not able to set the SATA Mode - mentioned in post #12.
@abga - No, I am not sure. But random hangs accompanied by log messages about hard drive errors preceding total hard disk drive failure I can confirm from personal experience on more than one occasion.
As an aside, I have a nephew who is good at the local equivalent of dumpster diving. Resuscitating old laptops has been a bit of a hobby. These laptops have generally been running Windows, and the process of cleaning them out and installing updates is a good disk stress test. Hard disk problems would explain why they were discarded. If the hardware makes it worthwhile, installing a new SSD gets a usable laptop with significant performance gains.
The laptop I own has had its original HDD replaced by a used kingston 64GB ssd in 2012 or 2013 (can't remember). It was fully functional (but slow) before the disk swap, and then didn't have any faults during its time on Windows XP.
The disk may be toast, but I've got nothing of value to lose on it.
Besides, I managed to successfully boot the drive on another PC without issues.
@abga : thanks for helping me understand the module subtilities. When using systool, how do you know if the module is loaded? refcount I guess?
In any case, my system didn't like rmmod -f i915 (because module was in use) and screen went black. CTRL ALT backspace did nothing. I could do like that but at that point rebooting is less bothersome
As my modus operandi to reproduce the problem is to open a PDF in okular, I will attempt to strace it thru ssh and see if there is a pattern.
Here are two strace results on the okular process that I'm interacting with when the laptop hangs : trace 1 trace 2
I would like to gather a trace of the kernel itself, if such a thing is possible.
Using lsmod (lsmod | grep i915) is the easiest way to check if a module is loaded.
Well, if X is started it's obvious that the i915 module is in use. All my suggestions/instructions about the i915 module should be executed on console without X running.
If you want to have dmesg (kernel log) constantly showing updates, you could open a tty (text mode console) and dedicate it for this purpose, run:
Code:
/bin/dmesg -wT
It might be useful to open a SSH from a remote system and monitor the kernel messages (again with /bin/dmesg -wT ), maybe you can catch something interesting during the crash, something that isn't written in the logs.
Ctr+Alt+Backspace should end the X session (shutdown X Server) in a normal situation.
If you have an active SSH connection to your system, and the system isn't frozen (totally), you could kill X (as root):
Code:
/usr/bin/killall xinit
Couldn't find anything interesting in your okular traces.
I mentioned earlier that I rarely use a laptop and the number one reason why is HEAT! I wouldn't have mentioned this until I saw Nille Kungen's post regarding failure of nVidia Quadro which is almost certainly a heat issue. So I web searched and see that Dell 620s are notoriously HOT! and mostly from two issues - 1) A constriction point that collects dust quickly , and 2) Horrible glob of thermal paste factory installed by default. Incidentally heat can also cause hdd hangs. I've seen posts by people who logged common CPU temps of 100C during very light loads. This is unacceptable and dangerous to hardware and software by extension.
I strongly recommend you load lmsensors and run the setup "sensors-detect" if you haven't already. Since yours does not apparently have a separate chip for graphics but is combined in a single CPU/GPU chip it is extremely likely your temps are extreme and quite possibly the cause of hard hangs.
It is essential to clear air passages. It is best to have as thin a film of thermal grease as possible, not thick, not baked, a thin greasy film with hard physical contact between source (chips) and heatsink(s). It takes an hour or so but it isn't "rocket surgery" Here .... https://www.youtube.com/watch?v=Bm7KWt87eT0 ... is a good example of what it takes. Properly done D620s commonly do not exceed 60C with heavy loads like kernel compiling. Heat is not merely uncomfortable. It is the enemy of electronics.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.