Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Now I wonder, is this a hard drive, a cable, a motherboard error or a bug in Linux?
And what does the Xorg process have to do with the hard drive?
I never had an 'Aborting journal on device' error before. Note that the dm-2 device is an encrypted partition mounted on sda.
However the messages 'end_request: I/O error' remember me on my old hard drive which got bad sectors (since then I bought two new drives from different brands and use the one for backups of the other).
TTL
Last edited by TTL_2; 04-12-2009 at 04:01 AM.
Reason: spelling
It looks hardware related, the device sent a timeout message to the device driver, meaning the device timed out, probably while reading/writing at sector 449453684. Without being able
to debug it I can't really be sure its caused by hardware, but I have my suspicions.
Thank you for your answer. The tables of the website are interesting.
In the meantime I ran a long SMART test on the drive, which did not found any error an I could rsync many GB with my second drive without any problems. ...Now the bad news:
1. While playing with smartctrl one of the (as far as I remember harmless) SMART commands caused the drive to perform a reset, Linux recovered it continued running normal.
2. While playing a 3D game (bzflag), some connection problems happened. After exiting the game the CPU did not clocked down as normal and commands like "top" were not executed properly. However dmesg did not show up any problems. At least I was able to reboot the system normally.
Then, I guessed this could be a heat problem as the graphic card (ATI with fglrx) is an onboard one and right before the first harddrive problem I used 3D acceleration too and there are several reports of other users for my motherboard saying that the northbridge gets very hot. I started a to play bzflag again but this time while watching the temperatures (ssh +X to my 2. pc) but nothing went above 50°C (and no problems occurred this time).
Ok, I ran memtest86+ for a little bit more than an hour, there weren't any errors. Then I cleaned the heat spreaders with a vacuum cleaner. I let run two instances of burnK7 for twenty minutes, without any problem.
I played games again and at least two or three times nothing happened. - Until now.
Again I had a similar locking of programs, as reported previous. But this time I did not reboot and waited some minutes. After that the system suddenly continued to operate normally again. And at this point the following in the dmesg appears:
Code:
[10663.755822] ata1.00: exception Emask 0x0 SAct 0x4 SErr 0x0 action 0x6 frozen
[10663.755822] ata1.00: cmd 60/08:10:18:a1:e6/00:00:23:00:00/40 tag 2 ncq 4096 in
[10663.755822] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[10663.755822] ata1.00: status: { DRDY }
[10663.755822] ata1: hard resetting link
===== At this point the system worked normal again ====
[10793.284084] ata1: softreset failed (device not ready)
[10793.284084] ata1: failed due to HW bug, retry pmp=0
[10793.284084] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[10793.284084] ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
[10793.284084] ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
[10793.284084] ata1.00: configured for UDMA/133
[10793.284084] ata1: EH complete
[10497.338267] sd 0:0:0:0: [sda] 625140335 512-byte hardware sectors (320072 MB)
[10497.338267] sd 0:0:0:0: [sda] Write Protect is off
[10497.338267] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[10497.338267] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[10793.284084] BUG: soft lockup - CPU#0 stuck for 276s! [kate:4327]
[10793.284084] Modules linked in: cpufreq_userspace ppdev lp fglrx(P) ipv6 fuse ext2 sha256_generic aes_i586 aes_generic cbc dm_crypt crypto_blkcipher dm_snapshot dm_mirror dm_log dm_mod it87 hwmon_vid eeprom powernow_k8 freq_table pktcdvd parport_pc parport k8temp snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm i2c_piix4 i2c_core snd_timer snd soundcore snd_page_alloc button ati_agp agpgart shpchp pci_hotplug evdev ext3 jbd mbcache ide_cd_mod cdrom ata_generic usbhid hid ff_memless sd_mod atiixp r8169 ide_pci_generic ide_core ehci_hcd ahci ohci_hcd libata scsi_mod dock usbcore thermal processor fan thermal_sys
[10793.284084]
[10793.284084] Pid: 4327, comm: kate Tainted: P (2.6.26-2-686 #1)
[10793.284084] EIP: 0073:[<b7769e45>] EFLAGS: 00200296 CPU: 0
[10793.284084] EIP is at 0xb7769e45
[10793.284084] EAX: bf87ca48 EBX: b78faee8 ECX: b7dd6160 EDX: bf87ca48
[10793.284084] ESI: bf87ca48 EDI: bf87caa6 EBP: bf87ca28 ESP: bf87ca10
[10793.284084] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[10793.284084] CR0: 80050033 CR2: b7bb6800 CR3: 348f5000 CR4: 000006d0
[10793.284084] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[10793.284084] DR6: ffff0ff0 DR7: 00000400
[10793.284084] =======================
The motherboard is ~20 months old I am playing 3D games perhaps an hour every two or three days for the last year.
The memtest you made was no enough to tell you very much. You have to run one RAM module at a time, and for many hours. Overnight, that is 8 to 12 hours is good. The same time duration would apply to stressing the CPU.
Have you had any success in finding what the error messages in the first four lines mean? You have had no software or other problems when you are not gaming?
Try using the command "lshw" as root to find the make and model of you HDD, then try to find the makers HDD utility and run it. It may pick up a fault that SMART did not.
The problem could also be a component on the motherboard that is now subject to sporadic heat failure. The real problem is that it could be so many things, and only a few can be readily tested.
Last edited by thorkelljarl; 05-05-2009 at 07:52 AM.
Based on your last dmesg, your hard disk got trouble at first after playing game. After for a while, hard disk driver does reset and then system come back. I guess that problem is hard disk.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.