RAID degraded, partition missing from md0

Ser Olmy · 11-15-2013, 12:23 PM

Must be the drives. There's no way other hardware or software can make a drive report "pending sectors" via S.M.A.R.T. Media error is the only possibility.

reano · 11-15-2013, 01:55 PM

Ok, I'm on the premises. I turned off the server (it was hanging with alot of error messages, like you predicted). I removed sdb (I looked for the serial number on the drive casing, to match the serial number as reported by smartctl on sdb).

Booted up, and it's running now. But here's the really strange thing:

Code:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md3 : active raid1 sdc1[1] sdb1[0]
      2930133824 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sda1[0]
      1464710976 blocks super 1.2 [2/1] [U_]
      
md1 : active (auto-read-only) raid1 sda2[0]
      24006528 blocks super 1.2 [2/1] [U_]
      
md2 : active raid1 sda3[0]
      1441268544 blocks super 1.2 [2/1] [U_]
      
md4 : active raid1 sdd2[0] sde2[1]
      2929939264 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

But I definitely removed sdb. But now sdf is missing, and sdb is there. Also, that mdstat doesn't make any sense, look at it closely... Looks like sdf became sdb, or something. Compare this with how mdstat used to look before:

Code:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[2] sda1[0]
      1464710976 blocks super 1.2 [2/1] [U_]
      [>....................]  recovery =  4.3% (63596480/1464710976) finish=318.5min speed=73315K/sec

md1 : active raid1 sda2[0] sdb2[1]
      24006528 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
      1441268544 blocks super 1.2 [2/2] [UU]

md3 : active raid1 sdc1[0] sdd1[1]
      2930133824 blocks super 1.2 [2/2] [UU]

md4 : active raid1 sdf2[1] sde2[0]
      2929939264 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Btw, my swap partition runs on md1, but it shows as auto read-only?

EDIT: Here are the md device details:

Code:

/dev/md0:
        Version : 1.2
  Creation Time : Sat Dec 29 17:09:45 2012
     Raid Level : raid1
     Array Size : 1464710976 (1396.86 GiB 1499.86 GB)
  Used Dev Size : 1464710976 (1396.86 GiB 1499.86 GB)
   Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Fri Nov 15 22:08:29 2013
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           Name : lia:0  (local to host lia)
           UUID : eb302d19:ff70c7bf:401d63af:ed042d59
         Events : 513922

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       0        0        1      removed

Code:

/dev/md1:
        Version : 1.2
  Creation Time : Sat Dec 29 17:09:50 2012
     Raid Level : raid1
     Array Size : 24006528 (22.89 GiB 24.58 GB)
  Used Dev Size : 24006528 (22.89 GiB 24.58 GB)
   Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Fri Nov 15 15:36:33 2013
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           Name : lia:1  (local to host lia)
           UUID : 1f8dff14:bc317bcb:d3587249:9ffc0b42
         Events : 58

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       0        0        1      removed

Code:

/dev/md2:
        Version : 1.2
  Creation Time : Sat Dec 29 17:09:59 2012
     Raid Level : raid1
     Array Size : 1441268544 (1374.50 GiB 1475.86 GB)
  Used Dev Size : 1441268544 (1374.50 GiB 1475.86 GB)
   Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Fri Nov 15 21:42:19 2013
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           Name : lia:2  (local to host lia)
           UUID : 543b8db0:660e4e18:d388dec8:b9fe81cb
         Events : 103

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       0        0        1      removed

Code:

/dev/md3:
        Version : 1.2
  Creation Time : Sat Dec 29 17:10:04 2012
     Raid Level : raid1
     Array Size : 2930133824 (2794.39 GiB 3000.46 GB)
  Used Dev Size : 2930133824 (2794.39 GiB 3000.46 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Fri Nov 15 21:48:23 2013
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : lia:3  (local to host lia)
           UUID : 2a35faa7:b076b115:f2e45d70:e9e0f885
         Events : 72

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1

Code:

/dev/md4:
        Version : 1.2
  Creation Time : Sat Dec 29 17:10:15 2012
     Raid Level : raid1
     Array Size : 2929939264 (2794.21 GiB 3000.26 GB)
  Used Dev Size : 2929939264 (2794.21 GiB 3000.26 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Fri Nov 15 22:08:50 2013
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : lia:4  (local to host lia)
           UUID : 18cafde6:cdd0d6ad:e80fe7e2:a346e157
         Events : 196

    Number   Major   Minor   RaidDevice State
       0       8       50        0      active sync   /dev/sdd2
       1       8       66        1      active sync   /dev/sde2

I'll post the smartctl stats in the next post, this one is getting a bit long.

reano · 11-15-2013, 02:22 PM

...continued from previous post...

sda:

Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST3000VX000-9YW166
Serial Number:    Z1F0SK6G
LU WWN Device Id: 5 000c50 04dcd6768
Firmware Version: CV13
User Capacity:    3*000*592*982*016 bytes [3,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Nov 15 22:17:20 2013 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  575) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10b9) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       157046752
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       97
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   082   060   030    Pre-fail  Always       -       193004742
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       8982
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       97
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       1
189 High_Fly_Writes         0x003a   001   001   000    Old_age   Always       -       896
190 Airflow_Temperature_Cel 0x0022   063   055   045    Old_age   Always       -       37 (Min/Max 32/37)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       89
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       326
194 Temperature_Celsius     0x0022   037   045   000    Old_age   Always       -       37 (0 16 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

sdb:

Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST3000VX000-9YW166
Serial Number:    Z1F0SN8B
LU WWN Device Id: 5 000c50 04dcd6911
Firmware Version: CV13
User Capacity:    3*000*592*982*016 bytes [3,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Nov 15 22:18:19 2013 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10b9) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       142164536
  3 Spin_Up_Time            0x0003   095   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       97
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   070   060   030    Pre-fail  Always       -       11890152
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       8983
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       97
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   001   001   000    Old_age   Always       -       114
190 Airflow_Temperature_Cel 0x0022   068   059   045    Old_age   Always       -       32 (Min/Max 31/33)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       89
193 Load_Cycle_Count        0x0032   090   090   000    Old_age   Always       -       21074
194 Temperature_Celsius     0x0022   032   041   000    Old_age   Always       -       32 (0 15 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

sdc:

Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST3000VX000-9YW166
Serial Number:    Z1F0SML8
LU WWN Device Id: 5 000c50 04dcd1e8e
Firmware Version: CV13
User Capacity:    3*000*592*982*016 bytes [3,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Nov 15 22:19:47 2013 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  575) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10b9) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   114   098   006    Pre-fail  Always       -       66583096
  3 Spin_Up_Time            0x0003   095   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       97
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   070   060   030    Pre-fail  Always       -       11716429
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       8981
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       97
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       263
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       1
189 High_Fly_Writes         0x003a   001   001   000    Old_age   Always       -       314
190 Airflow_Temperature_Cel 0x0022   066   058   045    Old_age   Always       -       34 (Min/Max 31/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       89
193 Load_Cycle_Count        0x0032   090   090   000    Old_age   Always       -       20770
194 Temperature_Celsius     0x0022   034   042   000    Old_age   Always       -       34 (0 14 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       24
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       24
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 248 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 248 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  40d+22:06:28.428  READ DMA EXT
  27 00 00 00 00 00 e0 00  40d+22:06:28.427  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  40d+22:06:28.419  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  40d+22:06:28.339  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  40d+22:06:28.331  READ NATIVE MAX ADDRESS EXT

Error 247 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  40d+22:06:25.531  READ DMA EXT
  27 00 00 00 00 00 e0 00  40d+22:06:25.531  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  40d+22:06:25.522  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  40d+22:06:25.443  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  40d+22:06:25.435  READ NATIVE MAX ADDRESS EXT

Error 246 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  40d+22:06:22.671  READ DMA EXT
  27 00 00 00 00 00 e0 00  40d+22:06:22.670  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  40d+22:06:22.662  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  40d+22:06:22.590  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  40d+22:06:22.574  READ NATIVE MAX ADDRESS EXT

Error 245 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  40d+22:06:19.803  READ DMA EXT
  27 00 00 00 00 00 e0 00  40d+22:06:19.802  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  40d+22:06:19.794  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  40d+22:06:19.714  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  40d+22:06:19.706  READ NATIVE MAX ADDRESS EXT

Error 244 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  40d+22:06:16.934  READ DMA EXT
  27 00 00 00 00 00 e0 00  40d+22:06:16.933  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  40d+22:06:16.925  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  40d+22:06:16.846  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  40d+22:06:16.830  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

...continue on next post...

reano · 11-15-2013, 02:23 PM

...continued from previous post...

sdd:

Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST3000VX000-9YW166
Serial Number:    Z1F0R4EY
LU WWN Device Id: 5 000c50 04dc4a62e
Firmware Version: CV13
User Capacity:    3 000 592 982 016 bytes [3,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Nov 15 22:20:51 2013 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10b9) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   116   099   006    Pre-fail  Always       -       117184888
  3 Spin_Up_Time            0x0003   095   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       97
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   077   060   030    Pre-fail  Always       -       53608287
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       8988
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       97
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   046   046   000    Old_age   Always       -       54
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   020   020   000    Old_age   Always       -       80
190 Airflow_Temperature_Cel 0x0022   064   059   045    Old_age   Always       -       36 (Min/Max 31/36)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       89
193 Load_Cycle_Count        0x0032   063   063   000    Old_age   Always       -       75120
194 Temperature_Celsius     0x0022   036   041   000    Old_age   Always       -       36 (0 16 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 54 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 54 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  12d+21:36:44.943  READ DMA EXT
  27 00 00 00 00 00 e0 00  12d+21:36:44.942  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  12d+21:36:44.894  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  12d+21:36:44.886  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  12d+21:36:44.886  READ NATIVE MAX ADDRESS EXT

Error 53 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.


  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  12d+21:36:42.102  READ DMA EXT
  27 00 00 00 00 00 e0 00  12d+21:36:42.101  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  12d+21:36:42.094  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  12d+21:36:42.094  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  12d+21:36:42.093  READ NATIVE MAX ADDRESS EXT

Error 52 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  12d+21:36:39.290  READ DMA EXT
  27 00 00 00 00 00 e0 00  12d+21:36:39.289  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  12d+21:36:39.216  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  12d+21:36:39.209  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  12d+21:36:39.209  READ NATIVE MAX ADDRESS EXT

Error 51 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  12d+21:36:36.421  READ DMA EXT
  27 00 00 00 00 00 e0 00  12d+21:36:36.420  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  12d+21:36:36.364  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  12d+21:36:36.356  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  12d+21:36:36.356  READ NATIVE MAX ADDRESS EXT

Error 50 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  12d+21:36:33.584  READ DMA EXT
  27 00 00 00 00 00 e0 00  12d+21:36:33.583  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  12d+21:36:33.576  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  12d+21:36:33.575  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  12d+21:36:33.575  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

sde:

Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST3000VX000-9YW166
Serial Number:    Z1F0SMES
LU WWN Device Id: 5 000c50 04dcd3ad1
Firmware Version: CV13
User Capacity:    3 000 592 982 016 bytes [3,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Nov 15 22:21:56 2013 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10b9) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   113   099   006    Pre-fail  Always       -       54310680
  3 Spin_Up_Time            0x0003   095   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       97
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   077   060   030    Pre-fail  Always       -       55382099
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       8988
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       97
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   001   001   000    Old_age   Always       -       393
190 Airflow_Temperature_Cel 0x0022   066   062   045    Old_age   Always       -       34 (Min/Max 29/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       89
193 Load_Cycle_Count        0x0032   064   064   000    Old_age   Always       -       73794
194 Temperature_Celsius     0x0022   034   040   000    Old_age   Always       -       34 (0 15 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

reano · 11-15-2013, 02:24 PM

...continued from previous post...

As you can see from all the stats in the above 3 posts, the sdb device doesn't have the original sdb serial number. Seems sdf renamed itself to sdb. Bizarre...

Ser Olmy · 11-15-2013, 02:55 PM

It looks like your current /dev/sdc may have issues. You should resync md3 immediately.

I've actually never seen an md device become read-only before. I found a forum post describing what seems to be a similar issue. Are you by any chance accessing Intel software RAID sets with mdadm?

As for the device names, well, welcome to the SCSI system, where device names are assigned by the kernel on a "first come-first serve" basis.

When you removed sdb, that name became vacant. Normally that would mean that every device gets to move one step up the ladder (sdc becomes sdb, sdd becomes sdc and so on), but on some (if not most) distributions, daemons like udev may interfere and try to preserve device-to-node mappings.

Thankfully, it doesn't really matter to the md driver what name is assigned to devices and partitions, as every component is labeled with a UUID. It does, however make it difficult to determine exactly which device has any given device name at any given time. If you start off with six drives:

Code:

Normal setup:

  1     2     3     4     5     6    
[sda] [sdb] [sdc] [sdd] [sde] [sdf]

...and one is hot-removed, the device disappears:

Code:

After hot-removing sdb:

  1     2     3     4     5     6    
[sda] ----- [sdc] [sdd] [sde] [sdf]

But after a reboot, there's always a risk that device names may have been reassigned:

Code:

After a reboot, and after being subjected
to typically inconsistent udev behaviour:

  1     2     3     4     5     6    
[sda] ----- [sdc] [sdd] [sde] [sdb]

It's mostly just a nuisance, unless you're using device names rather than labels or UUIDs in /etc/fstab.

reano · 11-15-2013, 03:11 PM

Quote:

Originally Posted by Ser Olmy

It looks like your current /dev/sdc may have issues. You should resync md3 immediately.

I've actually never seen an md device become read-only before. I found a forum post describing what seems to be a similar issue. Are you by any chance accessing Intel software RAID sets with mdadm?

As for the device names, well, welcome to the SCSI system, where device names are assigned by the kernel on a "first come-first serve" basis.

When you removed sdb, that name became vacant. Normally that would mean that every device gets to move one step up the ladder (sdc becomes sdb, sdd becomes sdc and so on), but on some (if not most) distributions, daemons like udev may interfere and try to preserve device-to-node mappings.

Thankfully, it doesn't really matter to the md driver what name is assigned to devices and partitions, as every component is labeled with a UUID. It does, however make it difficult to determine exactly which device has any given device name at any given time. If you start off with six drives:

Code:

Normal setup:

  1     2     3     4     5     6    
[sda] [sdb] [sdc] [sdd] [sde] [sdf]

...and one is hot-removed, the device disappears:

Code:

After hot-removing sdb:

  1     2     3     4     5     6    
[sda] ----- [sdc] [sdd] [sde] [sdf]

But after a reboot, there's always a risk that device names may have been reassigned:

Code:

After a reboot, and after being subjected
to typically inconsistent udev behaviour:

  1     2     3     4     5     6    
[sda] ----- [sdc] [sdd] [sde] [sdb]

It's mostly just a nuisance, unless you're using device names rather than labels or UUIDs in /etc/fstab.

Thanks for the explanation - I suspected that it's simply a renaming to an empty slot issue, but at this stage I'm so paranoid that I'm pessimistic about anything strange :P Luckily we do use UUID's in the fstab, yes, so it should be all good.

I've read the post you've linked, but am still unsure what resolution to follow regarding the read-only swap md device. Not sure what you mean by Intel software raid - we didn't set up the raid devices using the onboard raid utility, we set them up during the original Linux installation using Ubuntu's software raid. So I guess the answer is no?

Could resyncing md3 result in the same catastrophic crash that we experienced when resyncing md0 earlier today? (Also, how exactly do I resync the "right" way?)

PS: I really owe you for sticking with me through this. Much appreciated!

Ser Olmy · 11-15-2013, 04:34 PM

Quote:

Originally Posted by reano

I've read the post you've linked, but am still unsure what resolution to follow regarding the read-only swap md device. Not sure what you mean by Intel software raid - we didn't set up the raid devices using the onboard raid utility, we set them up during the original Linux installation using Ubuntu's software raid. So I guess the answer is no?

I guess so. The person in the thread ended up destroying and recreating the RAID device/array, and I guess you could do the same, if the device in question is only used for swap (which obviously isn't working now, with the device being read-only).

Quote:

Originally Posted by reano

Could resyncing md3 result in the same catastrophic crash that we experienced when resyncing md0 earlier today? (Also, how exactly do I resync the "right" way?)

A resync is highly unlikely to cause any problems, quite the opposite. The md driver is remarkably tolerant of errors, and will try to rewrite a bad sector several times using data from another device in the array before failing a RAID member.

Your experience with the drive that used to be sdb is very much atypical, but problems can occur if a device is allowed to "bit rot" for an extended period of time. Arrays need to be verified/"scrubbed" regularly, and the S.M.A.R.T. status of all drives should be continuously monitored.

You can resync an md device by writing "check" to /sys/devices/virtual/block/<device>/md/sync_action. In this case, this command should initiate a verify/resync:

Code:

echo check > /sys/devices/virtual/block/md3/md/sync_action

Quote:

Originally Posted by reano

PS: I really owe you for sticking with me through this. Much appreciated!

You're welcome.

reano · 11-15-2013, 05:50 PM

Quote:

Originally Posted by Ser Olmy

I guess so. The person in the thread ended up destroying and recreating the RAID device/array, and I guess you could do the same, if the device in question is only used for swap (which obviously isn't working now, with the device being read-only).

Strange, the read-only flagged disappeared suddenly. I'll see what it does after the next reboot (which will probably only be after md3's resync, and preferably on Monday when I'm onsite again to monitor the boot process.

Quote:

Originally Posted by Ser Olmy

A resync is highly unlikely to cause any problems, quite the opposite. The md driver is remarkably tolerant of errors, and will try to rewrite a bad sector several times using data from another device in the array before failing a RAID member.

Your experience with the drive that used to be sdb is very much atypical, but problems can occur if a device is allowed to "bit rot" for an extended period of time. Arrays need to be verified/"scrubbed" regularly, and the S.M.A.R.T. status of all drives should be continuously monitored.

You can resync an md device by writing "check" to /sys/devices/virtual/block/<device>/md/sync_action. In this case, this command should initiate a verify/resync:

Code:

echo check > /sys/devices/virtual/block/md3/md/sync_action

Thanks, I'll do that. How do I check the progress of the resync? Also via /proc/mdstat?

By the way, I've noticed something else. Every night at 30mins past midnight, the server backs up the contents of the /home directory to a NAS drive. This process usually takes about 20 minutes, but now it lasted over 90mins. /home resides on md3 - why would it take so long this time? I haven't started the resync on md3 yet, so it can't be that?

Ser Olmy · 11-15-2013, 05:59 PM

Quote:

Originally Posted by reano

Thanks, I'll do that. How do I check the progress of the resync? Also via /proc/mdstat?

That, or run mdadm --detail /dev/md3

Quote:

Originally Posted by reano

By the way, I've noticed something else. Every night at 30mins past midnight, the server backs up the contents of the /home directory to a NAS drive. This process usually takes about 20 minutes, but now it lasted over 90mins. /home resides on md3 - why would it take so long this time? I haven't started the resync on md3 yet, so it can't be that?

The md driver implements "read balancing" for RAID 1 sets, so I'd expect read performance to suffer with one device missing.

reano · 11-15-2013, 06:01 PM

Quote:

Originally Posted by Ser Olmy

That, or run mdadm --detail /dev/md3

The md driver implements "read balancing" for RAID 1 sets, so I'd expect read performance to suffer with one device missing.

The device isn't missing though. md3 is running on both devices (sdb1, sdc1). The one device (sdc) has some pending sectors, could it be that?

Ser Olmy · 11-15-2013, 06:04 PM

Quote:

Originally Posted by reano

The device isn't missing though. md3 is running on both devices (sdb1, sdc1). The one device (sdc) has some pending sectors, could it be that?

That could certainly be the reason, in which case you should see read errors in the logs.

reano · 11-15-2013, 06:06 PM

Quote:

Originally Posted by Ser Olmy

That could certainly be the reason, in which case you should see read errors in the logs.

Doing the resync now on md3 - this is going to take a few hours. Perfect excuse to get some sleep, it's about 2:30AM here now and it's (literally and figuratively) been a stormy night. Non-stop lightning since early evening. Seemed extremely appropriate to the situation, too - irony is a right bastard sometimes, hehe.

Okay, so seems both the (old) sdb and the current sdc are faulty. So you'd recommend I replace both those drives, right?

Btw, what do I check for specifically in the SMART status to determine if a drive is going AWOL on me? Only Pending sectors and Reallocated sectors, or is there another red flag to watch out for?

Ser Olmy · 11-15-2013, 06:52 PM

Yes, I would recommend replacing both drives.

A growing number of defects is the first sign of a drive (slowly) going bad. The sectors first show up in Current_Pending_Sectors as the drive lists them for reallocation, and once reallocated they become part of the Reallocated_Sectors statistics.

The problem with S.M.A.R.T. is that the drive has to detect the errors for them to show up among the attributes. A bad sector will go undetected until you attempt to read it. Even regular backups might not cause such a sector to be read, as incremental or delta backups and de-duplication has become common features. That's why regular verify/scrubbing of a RAID array is of the utmost importance.

As for other S.M.A.R.T. attributes, they can usually be ignored unless the drive status changes to "failing". smartd can be configured to send an e-mail whenever an attribute changes, something I would strongly recommend. Combined with mdadm in "--monitor" mode, you'll be informed if there's trouble brewing.

reano · 11-15-2013, 06:53 PM

I'm planning to do a weekly scrubbing/resync of the arrays. My plan is to do it via a cron job (echo check > /sys/devices/virtual/block/<md_device>/md/sync_action) as follows:

- md1 (swap, only a few GB) on Wednesday mornings at 3AM, should finish before 4AM.
- md0 (root filesystem, 1.5TB) on Thursday mornings at 3AM, should finish by about 7AM.
- md2 (shared user data, 1.5TB) on Friday mornings at 3AM, should finish by about 7AM.
- md3 (home directories, 3.0TB) on Saturday mornings at 3AM, should finish by about 11AM.
- md4 (user IMAP mails, 3.0TB) on Saturday afternoons at 12PM, should finish by about 8PM.

Can users use the system, shared resources, mails, homedirs, etc while the resync is taking place? Just incase there are some early-birds starting work before 7AM, or on Saturdays? Or will they experience some serious slowdowns?

Then, further to that, I want to do another cronjob to mail me the smartctl output of all drives, daily, every morning at 8AM.

Does this make sense (and was I more or less correct with my time/duration estimates), or would you recommend any changes to the plan above?