SMARTD reported disk sector read error

hsugawar · 10-27-2010, 12:47 PM

2 months ago, I built a dedicated backup server using 4 2TB SATA drives for software RAID5 (6TB total). A few days ago, smartd sent out mail saying it found an unreadable sector on one of the drives. I ran selftest using smartctl and the error persisted. So far, neither software RAID nor ext4fs running on top of the RAID volume have reported errors.

What should I do for the best now?
MOST OPTIMISTIC: It is normal to have a few bad sectors among billions. Software RAID takes care of alternatives. Keep using the system until RAID spits a more serious warning in the future.

MOST PESSIMISTIC: It is a really bad sign to lead a disaster. No good disk drive should have bad sectors especially when it is only 2 months old. The error is simply not detected by software RAID and ext4. Replace the drive immediately and return the bad drive to the supplier.

I will welcome any useful suggestion. Thanks.
hiro

TobiSGD · 10-27-2010, 04:39 PM

Most pessimistic is the way to go. Ecery drive has spare sectors to use if a bad sector is found. The report from SMART usually only comes up, if there are no more spare sectors free. Replace the drive.

H_TeXMeX_H · 10-29-2010, 08:54 AM

Bad sectors are often the first sign of failure, and if it is 2 months old, send it back for a replacement.

hsugawar · 11-02-2010, 11:55 AM

I examined the smartctl report a little bit more carefully and found the sector read error is "pending." "Pending" seems to mean "sector replacement is delayed until a write-error is detected on the sector."

Further long self-tests "completed without error." How should I interpret this?

hiro

catkin · 11-02-2010, 12:21 PM

Quote:

Originally Posted by hsugawar

I examined the smartctl report a little bit more carefully and found the sector read error is "pending." "Pending" seems to mean "sector replacement is delayed until a write-error is detected on the sector."

Correct; bad block remapping only happens when the block is written. SMART has detected that the block is bad and is reporting it. Your problem is that you don't know how significant the data on that block is. A fuller explanation here. It is possible, but non-trivial to find out where the block is and hence its significance to you. Procedure detailed here. If that's too much to take on or you have a good backup of all the files you can use the HDD manufacturer's utility to fix it -- and take the risk that the bad block held something important to you.

hsugawar · 11-02-2010, 07:26 PM

catkin,

Thank you for the good suggestions. Yes, I had read the smartmontools article before coming here. Yeah, it's not trivial, and I wondered if there might be an easy solution.

Below is the dumps for the worried drive. The bad sector seems to lie at the beginning of /dev/sd3 which is a part of a software RAID5 volume (/dev/md2). Fortunately, this volume is only for a backup repository file system and it can be easily set off-line (unmounted). So, I think running a simple program like the following on the first few thousand sectors can detect and trigger auto-remapping of the disk. Do you think I am correct?

fd = open("/dev/sda3", O_RDWR);
for (i = 0; i < 10000; i++) {
n = read(fd, buf, 512);
if (!n) break;
if (n != 512) {
fprintf(stderr, "Bad sector (%d)\n", i);
lseek(fd, -512, SEEK_CUR);
write(fd, buf, 512);
}
}

Thanks,
hiro

[root@shadow ~]# smartctl -l selftest /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 1515 -
# 2 Extended offline Completed without error 00% 1499 -
# 3 Extended offline Completed: read failure 80% 1473 98047576
# 4 Extended offline Aborted by host 10% 1472 -
# 5 Short offline Completed without error 00% 1446 -

[root@shadow ~]# fdisk -lu /dev/sda

Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00042339

Device Boot Start End Blocks Id System
/dev/sda1 * 63 385559 192748+ fd Linux raid autodetect
/dev/sda2 385560 98044694 48829567+ fd Linux raid autodetect
/dev/sda3 98044695 3893609789 1897782547+ fd Linux raid autodetect
/dev/sda4 3893610496 3907028991 6709248 5 Extended
/dev/sda5 3893614592 3894638591 512000 82 Linux swap / Solaris

catkin · 11-03-2010, 08:49 AM

I don't know whether reading the bad sector would return an error -- but it would be interesting to find out

The writes should trigger bad-sector mapping but what about the block contents? Is there any guarantee they would be valid? Might it be safer to remove the affected drive, re-initialise it and let RAID 5 re-load it with valid data? Or is it OK to let the RAID 5 correct the possibly invalid sector? I'm no RAID 5 expert.

H_TeXMeX_H · 11-03-2010, 02:08 PM

If a smart long test came up clean, the drive may be ok. I pretty sure the SMART long test WILL detect bad blocks, the short one will not.

Some useful info:
http://smartmontools.sourceforge.net/badblockhowto.html

hsugawar · 11-11-2010, 05:21 PM

catkin and H_TeXMeX_H,

Thank you very much for the very useful comments.

I tried another long test on the drive and it completed successfully. So, for the time being, I optimistically assume that the fault was temporary or sector remapping is already in effect.

Yes, it would be far better to let SMART perform a write attempt on the suspicious sector than writing my own program. It will be just pulling off the SATA cable for a while and reconnect it. Then the MD daemon should start reconstructing the MD data structures possible with an update on the subject sector.

I will try the let-MD approach if I find the sector read error persists.

Thanks a lot!!