Permanent filesystem corruption on reiserfs, ext3 and ext4 - disk failure?
I have been having problems with filesystem corruption on my eeepc 1000H for a long time now. I have tried using different filesystems, kernels and distributions (arch, slackware) to no effect. I am starting to grow suspicious that this problem lies somewhere else, as I haven't seen anyone else having similar problems in such a variety of scenarios.
I have tried testing my ram using memtest86+, didn't come up with anything after a full run through. I also have tried using e2fsck -c to check for bad blocks, it finds none. I had a go at using smartctl but wasn't really sure what I was doing. I did a long test and it came up with nothing anyway. This problem is in addition to the problems I've been having with my intel graphics chip and KMS. A lot of the time there are lockups when booting into X, which can only be gotten out of by a hard reset. This is sometimes what causes the original filesystem errors. I've stopped messing around with KMS for now to eliminate this but my current system in unbootable. I'm guessing my disk is wrecked but have as yet seen no definitive proof. Can anyone recommend anything that I should do? I am currently on ext4 with a custom kernel 2.6.33-rc6 (the stock kernel shipping with slackware does not have the elantech extension for psmouse included). When I was using arch, I was just using the stock kernels. Thanks, Tom |
I'm still on ext3 because I want others to have the headaches with ext4 and to have all of them sorted before I go near it.
If things continually find no disk errors, the disk is probably OK. I'd go to 'safe defaults' in the BIOS. Next I'd back up fully elsewhere. Elaborate on your problems with KMS. You can boot from an install cd, or if the kernel boots boot with init=/bin/bash as a kernel option. You'll get a root shell instead of running init. You'll have to work at it from there /sbin/mount /dev/whatsit -t ext4 / a PATH statement, etc. but you can get in and repair things. |
Any error messages you could post ?
It could be something related to the SATA controller ? What kind of HDD do you have ? What southbridge ? What mobo ? |
3 Attachment(s)
HDD model is ST980811AS from hdparm -I. From google I get that it's a Seagate Momentus 80GB
Here's my lspci -v output: Code:
00:00.0 Host bridge: Intel Corporation Mobile 945GME Express Memory Controller Hub (rev 03) I've tried booting from a usb stick and running fsck from there, it returns the same results. When I boot up normally in slackware it always returns me to the single-user shell anyway so that's no different is it? I'll attach a couple of outputs from consecutive runs of Code:
fsck -v -y /dev/sda1 Tom |
How old is this disk ? Can you also post the output of 'smartctl -a /dev/sda'.
|
The disk is a year and a half old. It came with the computer in late August 2008
Code:
smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen |
I think that this drive is reasonably likely to fail. The reasons being:
Code:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE There's also the error listed there. This suggests the drive is old and is wearing out, this may be a reason for the persistent corruption. It also seems that the drive has been overheating, and has overheated at least once. |
That's the second laptop I've killed then. I guess it serves me right for using them more or less as desktops! Thanks a lot for interpreting that smartctl output H_TeXMeX_H. Now seems like as good a time as any to have a look at the world of SSDs!
|
These might be silly questions, but bear with me.
Have you any errors in the logs? particularly /var/log/messages? Is acpid running? Do you hear a fan blowing warm air and does it change speed? My laptop always blows warm air. If the box is lightly used, the fan remains slow. If I'm compiling, the fan speed rises. Is the hard disk spinning down? Lots of good tips on http://www.lesswatts.org If you set up again, install acpitools. Then you can run acpitool -t and find your internal temperatures. Also run lm_sensors, let it set up your sensors, and run acpid with the -l option for a week. Then you can look in the logs and see all the signals the box gives, and set up scripts and events to do what you want with them. Slackware sadly gives you squat there. For instance, my laptop has a hibernate script on the lid switch, so you turn off by closing the lid. |
Thanks for the tips business_kid. There's already a package that I use to handle acpi, it's called eeepc-acpi-scripts, made by Alien Bob. This, along with kde and the eeepc kernel module, handles most powersaving, suspend2ram etc. and I dynamically underclock anyway. I've kept my eye on the temperature in the past but as there's only an intel atom processor inside the eee 1000h heat has not really been an issue in general.
Having said that, I recently sent it back to be repaired for a seperate issue and they (ASUS) replaced my motherboard without putting a heatsink on the processor. I didn't realise that this was the case until it suddenly shut down, having reached the critical temperature. As you can imagine my heartfelt thanks and respect goes out to the helpful folk at ASUS. It is worth noting that I was experiencing these errors before this happened (hence not mentioning it) but that would probably explain why smartctl also shows some high temperatures in the past. Anyway, I've bitten the bullet and ordered myself a 30GB SSD. I think I'm still within my warranty period no the old hard disk but I kinda wanted an SSD anyway! I'll mark this thread as solved when it arrives this weekend, assuming that was the problem of course. Thanks again for both of your help Tom |
All times are GMT -5. The time now is 04:54 AM. |