Random crashes -- suspect hardware issues (NVIDIA graphics).
Hello all,
I've been experiencing several possibly-related problems on my machine. I'm running Debian Squeeze (mostly, I've changed it to unstable and back once or twice in the past year and a half) on a Sony Vaio F1. Recently, I've been experiencing random crashes on my previously stable machine. The machine suddenly stops and shuts down without warning or going through the usual shutdown procedure. It usually has to recover the journal when I start it back up. These usually occur when I'm doing something intense (often on the GPU). In /var/log/messages, I usually get this for a normal shutdown: Code:
Apr 23 08:16:31 debian-philip shutdown[7393]: shutting down for system halt Code:
Apr 29 12:59:33 debian-philip shutdown[3896]: shutting down for system halt All my important stuff was already backed up, but when I started experiencing these issues I figured I might as well back up the rest of my stuff. When doing this (via tar -czf), my first tarball seemed to be corrupted, since I was unable to copy the entire file (it stopped about three-quarters of the way through). Looking throught /var/log/messages, I believe these are the errors it gave: Code:
Apr 27 20:04:54 debian-philip kernel: [ 434.724033] ata1.00: configured for UDMA/133 Finally, and this may be unrelated, but my wireless mouse, which has worked perfectly since I got it about six months ago, stops working randomly until I pull out the receiver and put it back in. These are the symptoms; here are a few of my theories. One thing worth noting is that the left hinge was damaged some time around when these symptoms started up. Correlation does not prove causation, but, in the words of Randall Munroe, "it does waggle its eyebrows suggestively and gesture furtively while mouthing 'look over there'." The hinge makes awful cracking noises when I open the lid, and it pulls the bezel away from the screen. Since the cables to the screen pass through the hinges, I wonder if that could cause some kind of hardware malfunction that confused the NVIDIA drivers enough to crash the system. I'm not much of a hardware guy, so I'm a little out of my depth there. Another (possibly-related) possibility is overheating. By monitoring with lm-sensors, I usually see something like this: Code:
acpitz-virtual-0 A final possiblity would be that the filesystem corruption was actually the cause, and not a symptom, and that the random crashes are from this. So, my question is, then, do you guys think that repairing the hinge would eliminate these problems? Should I just reinstall Debian (I've been planning to clean out my system for a while now)? I don't want to go through that trouble if the problems will still reappear. At any rate, I can't do anything drastic until after finals are done a week from Tuesday. Any other suggestions? Code:
root@debian-philip:/home/philip# uname -a Code:
root@debian-philip:/home# lspci | grep -i vga |
This seems indeed to be overheating, why this didn't happen before is most likely a very simple reason: Before the cooling system was clean enough to keep the system below the shutdown temperature. The first thing you should do is to clean out the cooling system and you should do it fast, running an overheating system will degrade your hardware.
After you have done that (and checked the temperatures) I also would recommend to run the harddisk manufacturer's diagnosis tool, it seems that your disk also is damaged. |
Thank you for your response.
As noted, I'm not much of a hardware guy. To clean out the cooling system, do I just unscrew most of the panels on the back of the laptop and blow it out with pressurized air? A couple of things that occurred to me: Someone I've talked to in the past saw that I had a habit of placing my laptop on top of papers. He thought it might be more likely to clog the cooling system. I've since avoided doing that, but is it possible that that was part of the cause? Also, I've noticed that every time except once, it has crashed while I'm actually using it on my lap (rather than on a desk). Is that likely to cause overheating as well? I'm just trying to find out what I can do to stop this from happening again. |
Quote:
Quote:
Tip: For using my laptop on soft undergrounds (like my lap or my bed) I have a simple piece of wood in the size of the laptop that serves as mini-table for the machine to prevent blocking the vents. |
The nouveau driver can cause overheating
on my new gforce ctx card it dose it also causes the old gforce2 card to overheat |
Okay, I opened up everything I could and blew it all out. Temps are now around 55, going up to around 69 while stress testing, so it looks like that helped a lot.
I've been unable to get Hitachi's tool to work (I have a Hitachi drive), but I've had success (or, rather, failure :) ) with smartmontools. In particular, the selftests are reporting a number of errors: Code:
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error @JohnVV: I'm using the NVIDIA driver, so I don't think that's the problem. |
Quote:
|
Okay, thanks for all your help. I'll probably cross my fingers until after finals, and then I'll get a new hard drive. Maybe this is a good opportunity to make the jump to an SSD.
|
Make sure to always have a backup on an external storage device, your internal disk is not trustworthy anymore.
And good luck with your finals! |
All times are GMT -5. The time now is 12:56 PM. |