Ext3 partition read/write hickups
Greets
So, where to begin... I have a debian etch powered ARM architecture Linksys NSLU2 with an external 500GB USB Western Digital Elements drive. The little slugger serves my home network 24/7 with a handful of common services, including network storage. The drive is split up in 3 partitions, being sda1 for swap, sda2 an ext3 with the system, and sda3 an ext3 for storage with most of the drive's space. Bought all of it at, at most, two months ago, and it's been running flawlessly up until a couple of weeks ago, when it locked up and I had to hard shutdown the NSLU2 by physically cutting the power.
What happens since then is that the sda3 partition gets read/write (mostly read) errors frequently, apparently at random, and I've had to "fsck -y" it about 4-5 times already in about a week. A ton of things get fixed and the partition gets marked clean, but the read/write errors continue. I'm hoping it's not a hardware problem, but I can't be sure. The problems seem to be focused solely on the sda3 partition. Also, I haven't tried checking for bad blocks yet because it would take about 35 hours or more to complete the scan.
Specifically, when there's a prolonged read operation going on, such as having a network computer play a video stored in the device, the video hangs often for about 3-5 seconds at random intervals. The video file itself is not corrupted, as I've copied it to another computer and it checked ok, no hangs, no errors, nothing. Immediately after a hang stops, I get this kernel message:
May 21 00:11:02 nslu2 kernel: sda: Current: sense key=0x0
May 21 00:11:02 nslu2 kernel: ASC=0x0 ASCQ=0x0
May 21 00:11:02 nslu2 kernel: Info fld=0x0
...which, according to a google search, doesn't mean much. I also get consecutively, albeit rarely, this type of message:
May 20 06:19:10 nslu2 kernel: attempt to access beyond end of device
May 20 06:19:10 nslu2 kernel: sda3: rw=0, want=1199311376, limit=967964445
...when reading or writing to disk. Maybe some metadata got screwed, but shouldn't fsck fix that too? Currently, the partition is marked clean and I'm getting these messages practically everytime I read/write something exclusively to sda3, be it large or small files. One effect I found strange while playing MP3 files was that it seemed to skip forwards and/or backwards for a couple of seconds, after which it resumed exactly where it left off.
I've tune2fs it to remount-ro when an error occurs, but actual errors, unlike those shown by the above quoted messages, when they happen, warrant an fsck sweep which everytime fixes tens if not hundreds of things, even requiring several fsck iterations. Needless to say, the lost+found directory is crawling with stuff, even though there don't seem to be almost any (noticeable) corrupted user files.
What can I do about this, short of scrapping the partition or drive? What's going on? Any hints?
Thanks
|