LinuxQuestions.org - degraded raid due to pending sectors?

Hi all,

I'm trying to fix a degraded raid1 array (mdadm) on a Centos 5.10 server. It has 4 raid1 arrays

Code:

# cat /proc/mdstat 

Personalities : [raid1] 

md0 : active raid1 sda1[0] sdb1[1]

      104320 blocks [2/2] [UU]

      

md3 : active raid1 sdb3[1] sda3[0]

      2096384 blocks [2/2] [UU]

      

md2 : active raid1 sda5[0] sdb5[1]

      1930828096 blocks [2/2] [UU]

      

md1 : active raid1 sda2[2](F) sdb2[1]

      20482752 blocks [2/1] [_U]

      

unused devices: <none>

Yesterday I noticed that all arrays apart from md3 (used for swap) had their members from sda missing. So i started adding them with

#mdadm /dev/md0 --add /dev/sda1

and so on. This morning I checked whether it has finished resyncing and noticed that md1 has a faulty member sda2 (see above). I checked the syslog and found these entries

Code:

...

end_request: I/O error, dev sda, sector 36321768

...

end_request: I/O error, dev sda, sector 36321869

...

end_request: I/O error, dev sda, sector 36322168

...

end_request: I/O error, dev sda, sector 36319053

...

end_request: I/O error, dev sda, sector 36322253

...

end_request: I/O error, dev sda, sector 36318925

...

end_request: I/O error, dev sda, sector 36318752

...

end_request: I/O error, dev sda, sector 36318797

...

end_request: I/O error, dev sda, sector 36321741

...

end_request: I/O error, dev sda, sector 36322125

...

end_request: I/O error, dev sda, sector 36318669

...

If fdisk -lu /dev/sda1 uses the same sector numbering, then all those sectors are in /dev/sda2

Code:

# fdisk -lu /dev/sda



Disk /dev/sda: 2000.3 GB, 2000398934016 bytes

255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors

Units = sectors of 1 * 512 = 512 bytes



  Device Boot      Start        End      Blocks  Id  System

/dev/sda1  *          63      208844      104391  fd  Linux raid autodetect

/dev/sda2          208845    41174594    20482875  fd  Linux raid autodetect

/dev/sda3        41174595    45367559    2096482+  fd  Linux raid autodetect

/dev/sda4        45367560  3907024064  1930828252+  5  Extended

/dev/sda5        45367623  3907024064  1930828221  fd  Linux raid autodetect

I then started a long smart test, which seems to have aborted with 90% remaining due to a read failure

Code:

# smartctl -a /dev/sda

smartctl 5.42 2011-10-20 r3458 [i686-linux-2.6.18-238.12.1.el5] (local build)

Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net



=== START OF INFORMATION SECTION ===

Model Family:    SAMSUNG SpinPoint F4 EG (AFT)

Device Model:    SAMSUNG HD204UI

Serial Number:    S2H7JD2B226718

LU WWN Device Id: 5 0024e9 004b88df5

Firmware Version: 1AQ10001

User Capacity:    2,000,398,934,016 bytes [2.00 TB]

Sector Size:      512 bytes logical/physical

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  8

ATA Standard is:  ATA-8-ACS revision 6

Local Time is:    Sun Aug 23 15:47:32 2015 CEST



==> WARNING: Using smartmontools or hdparm with this

drive may result in data loss due to a firmware bug.

****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******

Buggy and fixed firmware report same version number!

See the following web pages for details:

http://www.samsung.com/global/business/hdd/faqView.do?b2b_bbs_msg_id=386

http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks



SMART support is: Available - device has SMART capability.

SMART support is: Enabled



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED



General SMART Values:

Offline data collection status:  (0x00) Offline data collection activity

                                        was never started.

                                        Auto Offline Data Collection: Disabled.

Self-test execution status:      ( 121) The previous self-test completed having

                                        the read element of the test failed.

Total time to complete Offline 

data collection:                (20400) seconds.

Offline data collection

capabilities:                    (0x5b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine 

recommended polling time:        (  2) minutes.

Extended self-test routine

recommended polling time:        ( 255) minutes.

SCT capabilities:              (0x003f) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.



SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  100  100  051    Pre-fail  Always      -      145

  2 Throughput_Performance  0x0026  252  252  000    Old_age  Always      -      0

  3 Spin_Up_Time            0x0023  067  067  025    Pre-fail  Always      -      10056

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      42

  5 Reallocated_Sector_Ct  0x0033  252  252  010    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  252  252  051    Old_age  Always      -      0

  8 Seek_Time_Performance  0x0024  252  252  015    Old_age  Offline      -      0

  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      35397

 10 Spin_Retry_Count        0x0032  252  252  051    Old_age  Always      -      0

 11 Calibration_Retry_Count 0x0032  252  252  000    Old_age  Always      -      0

 12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      43

181 Program_Fail_Cnt_Total  0x0022  099  099  000    Old_age  Always      -      30601315

191 G-Sense_Error_Rate      0x0022  252  252  000    Old_age  Always      -      0

192 Power-Off_Retract_Count 0x0022  252  252  000    Old_age  Always      -      0

194 Temperature_Celsius    0x0002  064  048  000    Old_age  Always      -      28 (Min/Max 11/52)

195 Hardware_ECC_Recovered  0x003a  100  100  000    Old_age  Always      -      0

196 Reallocated_Event_Count 0x0032  252  252  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  100  100  000    Old_age  Always      -      2

198 Offline_Uncorrectable  0x0030  252  252  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0036  098  098  000    Old_age  Always      -      1413

200 Multi_Zone_Error_Rate  0x002a  100  100  000    Old_age  Always      -      0

223 Load_Retry_Count        0x0032  252  252  000    Old_age  Always      -      0

225 Load_Cycle_Count        0x0032  100  100  000    Old_age  Always      -      43



SMART Error Log Version: 1

ATA Error Count: 104 (device log contains only the most recent five errors)

        CR = Command Register [HEX]

        FR = Features Register [HEX]

        SC = Sector Count Register [HEX]

        SN = Sector Number Register [HEX]

        CL = Cylinder Low Register [HEX]

        CH = Cylinder High Register [HEX]

        DH = Device/Head Register [HEX]

        DC = Device Command Register [HEX]

        ER = Error register [HEX]

        ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.



Error 104 occurred at disk power-on lifetime: 11 hours (0 days + 11 hours)

  When the command that caused the error occurred, the device was active or idle.



  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 01 fc 6f 03 e0  Error: ICRC, ABRT at LBA = 0x00036ffc = 225276



  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  ca 00 08 f5 6f 03 e0 08      00:00:01.070  WRITE DMA

  27 00 00 00 00 00 e0 08      00:00:01.070  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 08      00:00:01.070  IDENTIFY DEVICE

  ef 03 42 00 00 00 a0 08      00:00:01.070  SET FEATURES [Set transfer mode]

  27 00 00 00 00 00 e0 08      00:00:01.070  READ NATIVE MAX ADDRESS EXT



Error 103 occurred at disk power-on lifetime: 11 hours (0 days + 11 hours)

  When the command that caused the error occurred, the device was active or idle.



  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 01 fc 6f 03 e0  Error: ICRC, ABRT at LBA = 0x00036ffc = 225276



  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  ca 00 08 f5 6f 03 e0 08      00:00:01.070  WRITE DMA

  27 00 00 00 00 00 e0 08      00:00:01.070  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 08      00:00:01.070  IDENTIFY DEVICE

  ef 03 42 00 00 00 a0 08      00:00:01.070  SET FEATURES [Set transfer mode]

  27 00 00 00 00 00 e0 08      00:00:01.070  READ NATIVE MAX ADDRESS EXT



Error 102 occurred at disk power-on lifetime: 11 hours (0 days + 11 hours)

  When the command that caused the error occurred, the device was active or idle.



  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 01 fc 6f 03 e0  Error: ICRC, ABRT at LBA = 0x00036ffc = 225276



  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  ca 00 08 f5 6f 03 e0 08      00:00:01.069  WRITE DMA

  27 00 00 00 00 00 e0 08      00:00:01.069  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 08      00:00:01.069  IDENTIFY DEVICE

  ef 03 42 00 00 00 a0 08      00:00:01.069  SET FEATURES [Set transfer mode]

  27 00 00 00 00 00 e0 08      00:00:01.069  READ NATIVE MAX ADDRESS EXT



Error 101 occurred at disk power-on lifetime: 11 hours (0 days + 11 hours)

  When the command that caused the error occurred, the device was active or idle.



  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 01 fc 6f 03 e0  Error: ICRC, ABRT at LBA = 0x00036ffc = 225276



  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  ca 00 08 f5 6f 03 e0 08      00:00:01.069  WRITE DMA

  27 00 00 00 00 00 e0 08      00:00:01.069  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 08      00:00:01.069  IDENTIFY DEVICE

  ef 03 42 00 00 00 a0 08      00:00:01.069  SET FEATURES [Set transfer mode]

  27 00 00 00 00 00 e0 08      00:00:01.069  READ NATIVE MAX ADDRESS EXT



Error 100 occurred at disk power-on lifetime: 11 hours (0 days + 11 hours)

  When the command that caused the error occurred, the device was active or idle.



  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 01 fc 6f 03 e0  Error: ICRC, ABRT at LBA = 0x00036ffc = 225276



  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  ca 00 08 f5 6f 03 e0 08      00:00:01.069  WRITE DMA

  27 00 00 00 00 00 e0 08      00:00:01.069  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 08      00:00:01.069  IDENTIFY DEVICE

  ef 03 42 00 00 00 a0 08      00:00:01.069  SET FEATURES [Set transfer mode]

  27 00 00 00 00 00 e0 08      00:00:01.069  READ NATIVE MAX ADDRESS EXT



SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed: read failure      90%    35393        36318760



Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run

SMART Selective self-test log data structure revision number 0

Note: revision number not 1 implies that no selective self-test has ever been run

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Completed_read_failure [90% left] (0-65535)

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

I don't understand why there are 0 reallocated sectors and 0 offline uncorrectable sectors, but only 2 pending sectors. Is it because smartctl didn't finish?

So, now my plan is to remove sda2 from md1 and fill it with zeros

dd if=/dev/zero of=/dev/sda2

in the hope that all pending sectors get reallocated and then re-add sda2 to md1. Does this sound ok, or are there any pitfalls I didn't think of yet?

The problem is, I'm not on-site, I'm not even in the same country at the moment and there is only a Windows support guy on-site with basically no Linux knowledge. If everything fails, I guess I have to get him to buy a new hard disk and replace the failing one.