Is this a harddisk or a software error?

TTL_2 · 04-12-2009, 03:46 AM

Hello,
yesterday I found the following message in my syslog

Code:

Apr 11 23:35:01 pc2 /USR/SBIN/CRON[13202]: (root) CMD (if [ -x /usr/bin/vnstat ] && [ `ls /var/lib/vnstat/ | wc -l` -ge 1 ]; then /usr/bin/vnstat -u; fi)
Apr 11 23:40:23 pc2 kernel: [ 8753.037306] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
Apr 11 23:40:23 pc2 kernel: [ 8753.037306] end_request: I/O error, dev sda, sector 449453684
Apr 11 23:40:23 pc2 kernel: [ 8911.949475] BUG: soft lockup - CPU#1 stuck for 148s! [Xorg:2998]
Apr 11 23:40:23 pc2 kernel: [ 8911.949475] Modules linked in: cpufreq_userspace ppdev lp fglrx(P) ipv6 fuse ext2 sha256_generic aes_i586 aes_generic cbc dm_crypt crypto_blkcipher dm_snapshot dm_mirror dm_log dm_mod it87 hwmon_vid eeprom powernow_k8 freq_table pktcdvd pl2303 usbserial parport_pc parport snd_pcsp k8temp snd_hda_intel i2c_piix4 snd_pcm_oss snd_mixer_oss i2c_core snd_pcm snd_timer snd soundcore snd_page_alloc shpchp pci_hotplug ati_agp agpgart button evdev ext3 jbd mbcache usbhid hid ff_memless ide_cd_mod cdrom ata_generic atiixp sd_mod r8169 ide_pci_generic ide_core ehci_hcd ohci_hcd ahci libata usbcore scsi_mod dock thermal processor fan thermal_sys
Apr 11 23:40:23 pc2 kernel: [ 8911.949475] 
Apr 11 23:40:23 pc2 kernel: [ 8911.949475] Pid: 2998, comm: Xorg Tainted: P          (2.6.26-1-686 #1)
Apr 11 23:40:23 pc2 kernel: [ 8911.949475] EIP: 0073:[<b78ac8e1>] EFLAGS: 00203286 CPU: 1
Apr 11 23:40:23 pc2 kernel: [ 8911.949475] EIP is at 0xb78ac8e1
Apr 11 23:40:23 pc2 kernel: [ 8911.949475] EAX: 00020002 EBX: b7b58340 ECX: 00001827 EDX: b772e000
Apr 11 23:40:23 pc2 kernel: [ 8911.949475] ESI: bf8a3620 EDI: 09b31158 EBP: bf8a35d8 ESP: bf8a35bc
Apr 11 23:40:23 pc2 kernel: [ 8911.949475]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
Apr 11 23:40:23 pc2 kernel: [ 8911.949475] CR0: 8005003b CR2: b7ab5ecc CR3: 3781e000 CR4: 000006d0
Apr 11 23:40:23 pc2 kernel: [ 8911.949475] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Apr 11 23:40:23 pc2 kernel: [ 8911.949475] DR6: ffff0ff0 DR7: 00000400
Apr 11 23:40:23 pc2 kernel: [ 8911.949475]  =======================
Apr 11 23:40:23 pc2 kernel: [ 8755.973639] Aborting journal on device dm-2.
Apr 11 23:40:23 pc2 kernel: [ 8911.985543] ext3_abort called.
Apr 11 23:40:23 pc2 kernel: [ 8911.985549] EXT3-fs error (device dm-2): ext3_journal_start_sb: Detected aborted journal
Apr 11 23:40:23 pc2 kernel: [ 8911.985556] Remounting filesystem read-only
Apr 11 23:40:41 pc2 /USR/SBIN/CRON[13214]: (root) CMD (  [ -d /var/lib/php4 ] && find /var/lib/php4/ -type f -cmin +$(/usr/lib/php4/maxlifetime) -print0 | xargs -r -0 rm)
Apr 11 23:40:41 pc2 /USR/SBIN/CRON[13222]: (root) CMD (if [ -x /usr/bin/vnstat ] && [ `ls /var/lib/vnstat/ | wc -l` -ge 1 ]; then /usr/bin/vnstat -u; fi)
Apr 11 23:41:21 pc2 kernel: [ 8975.167488] [fglrx] It's not necessary to adjust system aperture on this ASIC 
Apr 11 23:41:28 pc2 kdm: :0[13294]: Can't update authorization file in home dir /home/<my home>
Apr 11 23:42:14 pc2 shutdown[13299]: shutting down for system reboot
Apr 11 23:42:14 pc2 init: Switching to runlevel: 6
Apr 11 23:42:26 pc2 kernel: [ 9042.156366] fuse exit
Apr 11 23:42:29 pc2 lwresd[2966]: shutting down
Apr 11 23:42:29 pc2 lwresd[2966]: exiting
Apr 11 23:42:29 pc2 avahi-daemon[2948]: Got SIGTERM, quitting.
Apr 11 23:42:29 pc2 avahi-daemon[2948]: Leaving mDNS multicast group on interface eth1.IPv6 with address fe80::21a:4dff:fe83:354.
Apr 11 23:42:29 pc2 avahi-daemon[2948]: Leaving mDNS multicast group on interface eth1.IPv4 with address 192.168.1.23.
Apr 11 23:42:29 pc2 kernel: Kernel logging (proc) stopped.
Apr 11 23:42:29 pc2 kernel: Kernel log daemon terminating.
Apr 11 23:42:30 pc2 exiting on signal 15
Apr 11 23:49:00 pc2 syslogd 1.5.0#5: restart.

Now I wonder, is this a hard drive, a cable, a motherboard error or a bug in Linux?
And what does the Xorg process have to do with the hard drive?
I never had an 'Aborting journal on device' error before. Note that the dm-2 device is an encrypted partition mounted on sda.
However the messages 'end_request: I/O error' remember me on my old hard drive which got bad sectors (since then I bought two new drives from different brands and use the one for backups of the other).

TTL

1337 · 04-13-2009, 09:40 AM

It looks hardware related, the device sent a timeout message to the device driver, meaning the device timed out, probably while reading/writing at sector 449453684. Without being able
to debug it I can't really be sure its caused by hardware, but I have my suspicions.

This may be helpful...
http://tldp.org/HOWTO/archived/SCSI-...-HOWTO-21.html

TTL_2 · 04-13-2009, 03:45 PM

Thank you for your answer. The tables of the website are interesting.
In the meantime I ran a long SMART test on the drive, which did not found any error an I could rsync many GB with my second drive without any problems. ...Now the bad news:
1. While playing with smartctrl one of the (as far as I remember harmless) SMART commands caused the drive to perform a reset, Linux recovered it continued running normal.
2. While playing a 3D game (bzflag), some connection problems happened. After exiting the game the CPU did not clocked down as normal and commands like "top" were not executed properly. However dmesg did not show up any problems. At least I was able to reboot the system normally.

Then, I guessed this could be a heat problem as the graphic card (ATI with fglrx) is an onboard one and right before the first harddrive problem I used 3D acceleration too and there are several reports of other users for my motherboard saying that the northbridge gets very hot. I started a to play bzflag again but this time while watching the temperatures (ssh +X to my 2. pc) but nothing went above 50°C (and no problems occurred this time).

thorkelljarl · 04-13-2009, 06:58 PM

It would do no harm...

You might clean the machine all the same, especially the leaves or fins of the various coolers to see that the breeze gets down there where it chills.

You can also try StressLinux and Memtest. A question might be how old are the components and how many hours have you gamed on them?

TTL_2 · 05-04-2009, 02:49 PM

Ok, I ran memtest86+ for a little bit more than an hour, there weren't any errors. Then I cleaned the heat spreaders with a vacuum cleaner. I let run two instances of burnK7 for twenty minutes, without any problem.
I played games again and at least two or three times nothing happened. - Until now.
Again I had a similar locking of programs, as reported previous. But this time I did not reboot and waited some minutes. After that the system suddenly continued to operate normally again. And at this point the following in the dmesg appears:

Code:

[10663.755822] ata1.00: exception Emask 0x0 SAct 0x4 SErr 0x0 action 0x6 frozen
[10663.755822] ata1.00: cmd 60/08:10:18:a1:e6/00:00:23:00:00/40 tag 2 ncq 4096 in
[10663.755822]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[10663.755822] ata1.00: status: { DRDY }
[10663.755822] ata1: hard resetting link
===== At this point the system worked normal again ====
[10793.284084] ata1: softreset failed (device not ready)
[10793.284084] ata1: failed due to HW bug, retry pmp=0
[10793.284084] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[10793.284084] ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
[10793.284084] ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
[10793.284084] ata1.00: configured for UDMA/133
[10793.284084] ata1: EH complete
[10497.338267] sd 0:0:0:0: [sda] 625140335 512-byte hardware sectors (320072 MB)
[10497.338267] sd 0:0:0:0: [sda] Write Protect is off
[10497.338267] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[10497.338267] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[10793.284084] BUG: soft lockup - CPU#0 stuck for 276s! [kate:4327]
[10793.284084] Modules linked in: cpufreq_userspace ppdev lp fglrx(P) ipv6 fuse ext2 sha256_generic aes_i586 aes_generic cbc dm_crypt crypto_blkcipher dm_snapshot dm_mirror dm_log dm_mod it87 hwmon_vid eeprom powernow_k8 freq_table pktcdvd parport_pc parport k8temp snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm i2c_piix4 i2c_core snd_timer snd soundcore snd_page_alloc button ati_agp agpgart shpchp pci_hotplug evdev ext3 jbd mbcache ide_cd_mod cdrom ata_generic usbhid hid ff_memless sd_mod atiixp r8169 ide_pci_generic ide_core ehci_hcd ahci ohci_hcd libata scsi_mod dock usbcore thermal processor fan thermal_sys
[10793.284084]
[10793.284084] Pid: 4327, comm: kate Tainted: P          (2.6.26-2-686 #1)
[10793.284084] EIP: 0073:[<b7769e45>] EFLAGS: 00200296 CPU: 0
[10793.284084] EIP is at 0xb7769e45
[10793.284084] EAX: bf87ca48 EBX: b78faee8 ECX: b7dd6160 EDX: bf87ca48
[10793.284084] ESI: bf87ca48 EDI: bf87caa6 EBP: bf87ca28 ESP: bf87ca10
[10793.284084]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[10793.284084] CR0: 80050033 CR2: b7bb6800 CR3: 348f5000 CR4: 000006d0
[10793.284084] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[10793.284084] DR6: ffff0ff0 DR7: 00000400
[10793.284084]  =======================

The motherboard is ~20 months old I am playing 3D games perhaps an hour every two or three days for the last year.

thorkelljarl · 05-04-2009, 06:18 PM

This is not my speciality...

The memtest you made was no enough to tell you very much. You have to run one RAM module at a time, and for many hours. Overnight, that is 8 to 12 hours is good. The same time duration would apply to stressing the CPU.

Have you had any success in finding what the error messages in the first four lines mean? You have had no software or other problems when you are not gaming?

Try using the command "lshw" as root to find the make and model of you HDD, then try to find the makers HDD utility and run it. It may pick up a fault that SMART did not.

The problem could also be a component on the motherboard that is now subject to sporadic heat failure. The real problem is that it could be so many things, and only a few can be readily tested.

nini09 · 05-04-2009, 06:54 PM

Based on your last dmesg, your hard disk got trouble at first after playing game. After for a while, hard disk driver does reset and then system come back. I guess that problem is hard disk.