LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)
-   -   Random crashes -- suspect hardware issues (NVIDIA graphics). (https://www.linuxquestions.org/questions/linux-hardware-18/random-crashes-suspect-hardware-issues-nvidia-graphics-4175460068/)

pcm 04-29-2013 04:34 PM

Random crashes -- suspect hardware issues (NVIDIA graphics).
 
Hello all,

I've been experiencing several possibly-related problems on my machine. I'm running Debian Squeeze (mostly, I've changed it to unstable and back once or twice in the past year and a half) on a Sony Vaio F1.

Recently, I've been experiencing random crashes on my previously stable machine. The machine suddenly stops and shuts down without warning or going through the usual shutdown procedure. It usually has to recover the journal when I start it back up. These usually occur when I'm doing something intense (often on the GPU). In /var/log/messages, I usually get this for a normal shutdown:

Code:

Apr 23 08:16:31 debian-philip shutdown[7393]: shutting down for system halt
Apr 23 08:16:39 debian-philip kernel: [493260.058706] cfg80211: Calling CRDA to update world regulatory domain
Apr 23 08:16:40 debian-philip kernel: Kernel logging (proc) stopped.
Apr 23 08:16:40 debian-philip rsyslogd: [origin software="rsyslogd" swVersion="4.6.4" x-pid="1422" x-info="http://www.rsyslog.com"] exiting on signal 15.

When it crashes, I get this:

Code:

Apr 29 12:59:33 debian-philip shutdown[3896]: shutting down for system halt
Apr 29 12:59:33 debian-philip shutdown[3899]: shutting down for system halt
Apr 29 12:59:36 debian-philip Failed to open the panel socket
Apr 29 12:59:37 debian-philip Failed to open the panel socket
Apr 29 12:59:37 debian-philip Error creating socket
Apr 29 12:59:37 debian-philip : socket
Apr 29 12:59:37 debian-philip syscall failed
Apr 29 12:59:37 debian-philip :
Apr 29 12:59:37 debian-philip Conseguido
Apr 29 12:59:37 debian-philip
Apr 29 12:59:37 debian-philip Failed to open the panel socket
Apr 29 12:59:38 debian-philip Failed to open the panel socket
Apr 29 12:59:38 debian-philip kernel: [147665.099057] cfg80211: Calling CRDA to update world regulatory domain
Apr 29 12:59:39 debian-philip kernel: Kernel logging (proc) stopped.
Apr 29 12:59:39 debian-philip rsyslogd: [origin software="rsyslogd" swVersion="4.6.4" x-pid="1550" x-info="http://www.rsyslog.com"] exiting on signal 15.

When I start the machine up again, I've sometimes had some problems. Once, the filesystem for my /home partition had multiply-claimed blocks, so I had to manually fix it (I just deleted an unimportant file that claimed the blocks and then rebooted). Incidentally, this was discovered by a regularly scheduled fsck -- I could still mount the partition without it throwing me errors.

All my important stuff was already backed up, but when I started experiencing these issues I figured I might as well back up the rest of my stuff. When doing this (via tar -czf), my first tarball seemed to be corrupted, since I was unable to copy the entire file (it stopped about three-quarters of the way through). Looking throught /var/log/messages, I believe these are the errors it gave:

Code:

Apr 27 20:04:54 debian-philip kernel: [  434.724033] ata1.00: configured for UDMA/133
Apr 27 20:04:54 debian-philip kernel: [  434.724049] ata1: EH complete
Apr 27 20:04:57 debian-philip kernel: [  437.864759] ata1.00: configured for UDMA/133
Apr 27 20:04:57 debian-philip kernel: [  437.864784] ata1: EH complete
Apr 27 20:05:01 debian-philip kernel: [  441.074238] ata1.00: configured for UDMA/133
Apr 27 20:05:01 debian-philip kernel: [  441.074261] ata1: EH complete
Apr 27 20:05:04 debian-philip kernel: [  444.229128] ata1.00: configured for UDMA/133
Apr 27 20:05:04 debian-philip kernel: [  444.229147] ata1: EH complete
Apr 27 20:05:07 debian-philip kernel: [  447.429426] ata1.00: configured for UDMA/133
Apr 27 20:05:07 debian-philip kernel: [  447.429444] ata1: EH complete
Apr 27 20:05:10 debian-philip kernel: [  450.568967] ata1.00: configured for UDMA/133
Apr 27 20:05:10 debian-philip kernel: [  450.568991] sd 0:0:0:0: [sda] Unhandled sense code
Apr 27 20:05:10 debian-philip kernel: [  450.568996] sd 0:0:0:0: [sda]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 27 20:05:10 debian-philip kernel: [  450.569003] sd 0:0:0:0: [sda]  Sense Key : Medium Error [current] [descriptor]
Apr 27 20:05:10 debian-philip kernel: [  450.569011] Descriptor sense data with sense descriptors (in hex):
Apr 27 20:05:10 debian-philip kernel: [  450.569015]        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Apr 27 20:05:10 debian-philip kernel: [  450.569030]        21 a7 c6 89
Apr 27 20:05:10 debian-philip kernel: [  450.569037] sd 0:0:0:0: [sda]  Add. Sense: Unrecovered read error - auto reallocate failed
Apr 27 20:05:10 debian-philip kernel: [  450.569045] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 21 a7 c4 90 00 02 00 00
Apr 27 20:05:10 debian-philip kernel: [  450.569091] ata1: EH complete
Apr 27 20:05:14 debian-philip kernel: [  454.234946] ata1.00: configured for UDMA/133
Apr 27 20:05:14 debian-philip kernel: [  454.234964] ata1: EH complete
Apr 27 20:05:17 debian-philip kernel: [  457.383465] ata1.00: configured for UDMA/133
Apr 27 20:05:17 debian-philip kernel: [  457.383506] ata1: EH complete
Apr 27 20:05:20 debian-philip kernel: [  460.630580] ata1.00: configured for UDMA/133
Apr 27 20:05:20 debian-philip kernel: [  460.630615] ata1: EH complete
Apr 27 20:05:23 debian-philip kernel: [  463.783669] ata1.00: configured for UDMA/133
Apr 27 20:05:23 debian-philip kernel: [  463.783696] ata1: EH complete
Apr 27 20:05:27 debian-philip kernel: [  466.969666] ata1.00: configured for UDMA/133
Apr 27 20:05:27 debian-philip kernel: [  466.969682] ata1: EH complete
Apr 27 20:05:30 debian-philip kernel: [  470.145186] ata1.00: configured for UDMA/133
Apr 27 20:05:30 debian-philip kernel: [  470.145202] sd 0:0:0:0: [sda] Unhandled sense code
Apr 27 20:05:30 debian-philip kernel: [  470.145204] sd 0:0:0:0: [sda]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 27 20:05:30 debian-philip kernel: [  470.145206] sd 0:0:0:0: [sda]  Sense Key : Medium Error [current] [descriptor]
Apr 27 20:05:30 debian-philip kernel: [  470.145210] Descriptor sense data with sense descriptors (in hex):
Apr 27 20:05:30 debian-philip kernel: [  470.145211]        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Apr 27 20:05:30 debian-philip kernel: [  470.145216]        21 a7 c6 89
Apr 27 20:05:30 debian-philip kernel: [  470.145219] sd 0:0:0:0: [sda]  Add. Sense: Unrecovered read error - auto reallocate failed
Apr 27 20:05:30 debian-philip kernel: [  470.145222] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 21 a7 c6 88 00 00 08 00
Apr 27 20:05:30 debian-philip kernel: [  470.145240] ata1: EH complete
Apr 27 20:05:34 debian-philip kernel: [  473.956012] ata1.00: configured for UDMA/133
Apr 27 20:05:34 debian-philip kernel: [  473.956021] ata1: EH complete
Apr 27 20:05:37 debian-philip kernel: [  477.077193] ata1.00: configured for UDMA/133
Apr 27 20:05:37 debian-philip kernel: [  477.077209] ata1: EH complete
Apr 27 20:05:40 debian-philip kernel: [  480.176145] ata1.00: configured for UDMA/133
Apr 27 20:05:40 debian-philip kernel: [  480.176154] ata1: EH complete
Apr 27 20:05:43 debian-philip kernel: [  483.300293] ata1.00: configured for UDMA/133
Apr 27 20:05:43 debian-philip kernel: [  483.300307] ata1: EH complete
Apr 27 20:05:46 debian-philip kernel: [  486.415735] ata1.00: configured for UDMA/133
Apr 27 20:05:46 debian-philip kernel: [  486.415751] ata1: EH complete
Apr 27 20:05:49 debian-philip kernel: [  489.538812] ata1.00: configured for UDMA/133
Apr 27 20:05:49 debian-philip kernel: [  489.538832] sd 0:0:0:0: [sda] Unhandled sense code
Apr 27 20:05:49 debian-philip kernel: [  489.538837] sd 0:0:0:0: [sda]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 27 20:05:49 debian-philip kernel: [  489.538844] sd 0:0:0:0: [sda]  Sense Key : Medium Error [current] [descriptor]
Apr 27 20:05:49 debian-philip kernel: [  489.538852] Descriptor sense data with sense descriptors (in hex):
Apr 27 20:05:49 debian-philip kernel: [  489.538856]        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Apr 27 20:05:49 debian-philip kernel: [  489.538870]        21 a7 c6 89
Apr 27 20:05:49 debian-philip kernel: [  489.538877] sd 0:0:0:0: [sda]  Add. Sense: Unrecovered read error - auto reallocate failed
Apr 27 20:05:49 debian-philip kernel: [  489.538886] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 21 a7 c6 88 00 00 08 00
Apr 27 20:05:49 debian-philip kernel: [  489.539940] ata1: EH complete

Additionally, several times (but not always), I have been unable to login with gdm3. It brings up the login screen but seems to stop just before it shows the list of users and becomes unresponsive.

Finally, and this may be unrelated, but my wireless mouse, which has worked perfectly since I got it about six months ago, stops working randomly until I pull out the receiver and put it back in.

These are the symptoms; here are a few of my theories. One thing worth noting is that the left hinge was damaged some time around when these symptoms started up. Correlation does not prove causation, but, in the words of Randall Munroe, "it does waggle its eyebrows suggestively and gesture furtively while mouthing 'look over there'." The hinge makes awful cracking noises when I open the lid, and it pulls the bezel away from the screen. Since the cables to the screen pass through the hinges, I wonder if that could cause some kind of hardware malfunction that confused the NVIDIA drivers enough to crash the system. I'm not much of a hardware guy, so I'm a little out of my depth there.

Another (possibly-related) possibility is overheating. By monitoring with lm-sensors, I usually see something like this:

Code:

acpitz-virtual-0
Adapter: Virtual device
temp1:      +72.0°C  (crit = +98.0°C)
temp2:      +72.0°C  (crit = +98.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0:      +71.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1:      +71.0°C  (high = +84.0°C, crit = +100.0°C)
Core 2:      +71.0°C  (high = +84.0°C, crit = +100.0°C)
Core 3:      +73.0°C  (high = +84.0°C, crit = +100.0°C)

By stress testing the CPU (with "dd if=/dev/urandom of=/dev/null" four times (one for each core of my i7)) I was able to raise these temps into the upper eighties, but no further. When I ran Unigine Tropics for the GPU, though, I was able to get the first two lines (from acpitz-virtual-0) up to 97.0, momentarily. The question, then, would be why is this happening now when it has not happened before in the two years I've owned this laptop (most of which has been spent using Debian). This could be related to the damaged hinge.

A final possiblity would be that the filesystem corruption was actually the cause, and not a symptom, and that the random crashes are from this.

So, my question is, then, do you guys think that repairing the hinge would eliminate these problems? Should I just reinstall Debian (I've been planning to clean out my system for a while now)? I don't want to go through that trouble if the problems will still reappear. At any rate, I can't do anything drastic until after finals are done a week from Tuesday.

Any other suggestions?

Code:

root@debian-philip:/home/philip# uname -a
Linux debian-philip 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.32-1~bpo60+1 x86_64 GNU/Linux

Code:

root@debian-philip:/home# lspci | grep -i vga
01:00.0 VGA compatible controller: nVidia Corporation GT218 [GeForce 310M] (rev a2)


TobiSGD 04-29-2013 04:57 PM

This seems indeed to be overheating, why this didn't happen before is most likely a very simple reason: Before the cooling system was clean enough to keep the system below the shutdown temperature. The first thing you should do is to clean out the cooling system and you should do it fast, running an overheating system will degrade your hardware.
After you have done that (and checked the temperatures) I also would recommend to run the harddisk manufacturer's diagnosis tool, it seems that your disk also is damaged.

pcm 04-29-2013 05:12 PM

Thank you for your response.

As noted, I'm not much of a hardware guy. To clean out the cooling system, do I just unscrew most of the panels on the back of the laptop and blow it out with pressurized air?

A couple of things that occurred to me: Someone I've talked to in the past saw that I had a habit of placing my laptop on top of papers. He thought it might be more likely to clog the cooling system. I've since avoided doing that, but is it possible that that was part of the cause? Also, I've noticed that every time except once, it has crashed while I'm actually using it on my lap (rather than on a desk). Is that likely to cause overheating as well? I'm just trying to find out what I can do to stop this from happening again.

TobiSGD 04-29-2013 05:41 PM

Quote:

Originally Posted by pcm (Post 4941452)
As noted, I'm not much of a hardware guy. To clean out the cooling system, do I just unscrew most of the panels on the back of the laptop and blow it out with pressurized air?

It comes down to that, but you should find a service manual for your laptop on the manufacturer's website that shows you which screws need to be removed. If you are uncomfortable with opening the laptop I would recommend to let a service technician do the job.

Quote:

A couple of things that occurred to me: Someone I've talked to in the past saw that I had a habit of placing my laptop on top of papers. He thought it might be more likely to clog the cooling system. I've since avoided doing that, but is it possible that that was part of the cause? Also, I've noticed that every time except once, it has crashed while I'm actually using it on my lap (rather than on a desk). Is that likely to cause overheating as well? I'm just trying to find out what I can do to stop this from happening again.
If you use the laptop on your lap you may block the vents, which will heat up the system due to the lack of airflow through the case. This is only a temporary overheat that immediately is solved with unblocking the vents.
Tip: For using my laptop on soft undergrounds (like my lap or my bed) I have a simple piece of wood in the size of the laptop that serves as mini-table for the machine to prevent blocking the vents.

John VV 04-29-2013 08:17 PM

The nouveau driver can cause overheating

on my new gforce ctx card it dose
it also causes the old gforce2 card to overheat

pcm 04-29-2013 11:36 PM

Okay, I opened up everything I could and blew it all out. Temps are now around 55, going up to around 69 while stress testing, so it looks like that helped a lot.

I've been unable to get Hitachi's tool to work (I have a Hitachi drive), but I've had success (or, rather, failure :) ) with smartmontools. In particular, the selftests are reporting a number of errors:

Code:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Selective offline  Completed: read failure      60%    15420        562307480
# 2  Selective offline  Completed: read failure      90%    15419        584102672
# 3  Selective offline  Completed: read failure      90%    15419        564616602
# 4  Selective offline  Completed: read failure      90%    15419        564613724
# 5  Selective offline  Completed: read failure      90%    15419        562307480
# 6  Selective offline  Completed without error      00%    15419        -
# 7  Selective offline  Completed: read failure      90%    15418        562307480
# 8  Selective offline  Aborted by host              90%    15418        -
# 9  Selective offline  Completed: read failure      90%    15417        7026831
#10  Selective offline  Completed: read failure      90%    15417        5499834
#11  Selective offline  Completed: read failure      90%    15417        1104445
#12  Selective offline  Completed: read failure      90%    15417        408128
#13  Selective offline  Completed: read failure      90%    15417        372513
#14  Extended offline    Completed: read failure      90%    15417        368694
#15  Extended offline    Completed: read failure      90%    15417        368694
#16  Short offline      Completed: read failure      70%    15417        368694
#17  Extended offline    Completed: read failure      90%    15417        368694
#18  Short offline      Completed: read failure      60%    15417        368694
#19  Short offline      Completed: read failure      60%    15417        368694

My /home partition starts around 500000000, and no errors showed up when I started the test at 600000000. It also tells me that there are 43 "Current_Pending_Sector"s. It looks like there's a number of errors in the first partition, (which I believe is an old non-functioning Windows 7 installation) and then several more errors in the beginning of the /home partition -- could that mean the journal got messed up somehow?

@JohnVV: I'm using the NVIDIA driver, so I don't think that's the problem.

TobiSGD 04-30-2013 12:33 AM

Quote:

Originally Posted by pcm (Post 4941584)
It looks like there's a number of errors in the first partition, (which I believe is an old non-functioning Windows 7 installation) and then several more errors in the beginning of the /home partition -- could that mean the journal got messed up somehow?

smartmontools doesn't care at all about partitions or file-systems (and with that journals). This means that your disk is dying and should be replaced.

pcm 04-30-2013 12:56 AM

Okay, thanks for all your help. I'll probably cross my fingers until after finals, and then I'll get a new hard drive. Maybe this is a good opportunity to make the jump to an SSD.

TobiSGD 04-30-2013 03:06 AM

Make sure to always have a backup on an external storage device, your internal disk is not trustworthy anymore.
And good luck with your finals!


All times are GMT -5. The time now is 12:56 PM.