How to recover from raid “corrupted groups descriptors” failure?

jaypifer · 02-12-2017, 10:17 AM

I have a dedicated raid box (readynas ultra 4) that lost power suddenly. Now it will not mount due to a variety of errors and I'm seeking some guidance from anyone who may be able to offer advice as I'm a bit out of my depth.

The setup is a four disk raid 5 array with 3 2TB drives and an older 250GB and has been stable for years allowing my backup skills to degrade so I cannot lose the data. The smaller drive has been throwing errors, but I do not believe I have lost it and the power failure to be the cause.

I have been doing quite a bit of data gathering the past few days, but have done nothing to endanger the data (I hope).

From the syslog, this is the first error.

Code:

Feb  7 12:29:42 kernel: EXT4-fs (dm-0): ext4_check_descriptors: Block bitmap for group 17216 not in group (block 623566614080)!       
Feb  7 12:29:42 kernel: EXT4-fs (dm-0): group descriptors corrupted!

And here is some basic information about the array:

Code:

# cat /proc/mdstat 
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md3 : active raid5 sda6[0] sdc6[2] sdb6[1]
      3418630528 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md2 : active raid5 sda5[0] sdd5[3] sdc5[5] sdb5[4]
      718431744 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md1 : active raid5 sda2[0] sdd2[3] sdc2[5] sdb2[4]
      1572672 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md0 : active raid1 sdc1[5] sdb1[4] sda1[0] sdd1[3]
      4194292 blocks super 1.2 [4/4] [UUUU]

unused devices: <none>

I ran smartctl --test=long on each of the four drives, following are the highlights:

Code:

# for i in /dev/sd[a-d]; do echo $i; smartctl -a $i | egrep "Sector|Hours|Error|Uncorr"; done;
/dev/sda
Sector Size:      512 bytes logical/physical
Error logging capability:        (0x01) Error logging supported.
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   059   059   000    Old_age   Always       -       30361
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
SMART Error Log Version: 1
No Errors Logged
/dev/sdb
Sector Sizes:     512 bytes logical, 4096 bytes physical
Error logging capability:        (0x01) Error logging supported.
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   059   059   000    Old_age   Always       -       30253
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
SMART Error Log Version: 1
No Errors Logged
/dev/sdc
Sector Size:      512 bytes logical/physical
Error logging capability:        (0x01) Error logging supported.
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   063   063   000    Old_age   Always       -       27111
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
SMART Error Log Version: 1
No Errors Logged
/dev/sdd
Sector Size:      512 bytes logical/physical
Error logging capability:        (0x01) Error logging supported.
  1 Raw_Read_Error_Rate     0x000b   094   094   016    Pre-fail  Always       -       655376
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       37
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0012   088   088   000    Old_age   Always       -       89338
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       46
SMART Error Log Version: 1
ATA Error Count: 7 (device log contains only the most recent five errors)
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    ER = Error register [HEX]
Error 7 occurred at disk power-on lifetime: 5351 hours (222 days + 23 hours)
  84 51 00 07 00 00 40  Error: ABRT at LBA = 0x00000007 = 7
Error 6 occurred at disk power-on lifetime: 5351 hours (222 days + 23 hours)
  84 51 00 00 00 00 40  Error: ABRT at LBA = 0x00000000 = 0
Error 5 occurred at disk power-on lifetime: 5351 hours (222 days + 23 hours)
  00 51 00 00 00 00 40  Error:  at LBA = 0x00000000 = 0
Error 4 occurred at disk power-on lifetime: 5351 hours (222 days + 23 hours)
  84 51 00 43 00 00 40  Error: ABRT at LBA = 0x00000043 = 67
Error 3 occurred at disk power-on lifetime: 5351 hours (222 days + 23 hours)
  84 51 00 00 00 00 40  Error: ABRT at LBA = 0x00000000 = 0

An e2fsck -n /dev/c/c creates a large amount of output starting like below and would like to relocate ~25,000 blocks.

Code:

ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
e2fsck: Group descriptors look bad... trying backup blocks...
Block bitmap for group 5504 is not in group.  (block 18375699958922960920)
Relocate? no

Inode bitmap for group 5504 is not in group.  (block 408137612634012487)
Relocate? no

Inode table for group 5504 is not in group.  (block 9710113478488063446)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no
...

Mount gives this error

Code:

# cat /etc/fstab | grep /c  
/dev/c/c           /c                   ext4       defaults,acl,user_xattr,usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv1,noatime,nodiratime      0      2
# mount /dev/c/c /c 
    mount: wrong fs type, bad option, bad superblock on /dev/c/c,
           missing codepage or helper program, or other error
           In some cases useful info is found in syslog - try
           dmesg | tail  or so

I had thought I had a corrupt superblock and tried mounting the backups I found with dumpe2fs with no success. I'd like to back up the data I have before doing any rescue attempts but assume I would need to buy four 2TB disks to dd over the data.

My current line of thinking is that I need to run e2fsck -f -y, but I've continued research to ensure I've covered all my options before modifying the disk. Also, I think I may be missing something obvious or elements of a raid that I haven't tried as I am not familiar with the mdadm commands. For example, perhaps I could break and rebuild the array with the three 2TB drives?

Anyway, I appreciate it if you've read this far and am happy to hear about any ideas you may have.

smallpond · 02-12-2017, 04:13 PM

Your four smartctl results are the same, which seems suspicious.

jaypifer · 02-12-2017, 04:59 PM

Quote:

Originally Posted by smallpond

Your four smartctl results are the same, which seems suspicious.

Apologies, in my attempt at brevity I lost clarity. These are the full logs:

Code:

# for i in /dev/sd[a-d]; do echo $i; smartctl -a $i; done;                                    
/dev/sda
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.37.6.RNx86_64.2.4] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD2002FAEX-007BA0
Serial Number:    WD-WCAY00925674
LU WWN Device Id: 5 0014ee 2b283bb04
Firmware Version: 05.01D05
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sun Feb 12 17:54:56 2017 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(30900) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3037)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       8583
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       116
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   059   059   000    Old_age   Always       -       30369
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       114
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       54
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       61
194 Temperature_Celsius     0x0022   116   103   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     30300         -
# 2  Short offline       Completed without error       00%     30295         -
# 3  Short offline       Completed without error       00%         9         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/dev/sdb
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.37.6.RNx86_64.2.4] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD20EZRX-00DC0B0
Serial Number:    WD-WMC300372748
LU WWN Device Id: 5 0014ee 058c587d6
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ACS-2 (revision not indicated)
Local Time is:    Sun Feb 12 17:54:56 2017 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(28020) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x70b5)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   175   174   021    Pre-fail  Always       -       4233
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       109
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   059   059   000    Old_age   Always       -       30262
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       109
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       45
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       2818628
194 Temperature_Celsius     0x0022   115   104   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     30198         -
# 2  Short offline       Completed without error       00%        97         -
# 3  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/dev/sdc
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.37.6.RNx86_64.2.4] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD2002FAEX-007BA0
Serial Number:    WD-WCAY01040283
LU WWN Device Id: 5 0014ee 207ed9e21
Firmware Version: 05.01D05
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sun Feb 12 17:54:56 2017 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(30000) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3037)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       8741
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       97
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   063   063   000    Old_age   Always       -       27119
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       95
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       44
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       52
194 Temperature_Celsius     0x0022   116   105   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     27065         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/dev/sdd
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.37.6.RNx86_64.2.4] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar T7K250
Device Model:     HDT722525DLA380
Serial Number:    VDB41AT4C2BLBA
LU WWN Device Id: 5 000cca 20bc11442
Firmware Version: V44OA60A
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 1
Local Time is:    Sun Feb 12 17:54:56 2017 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		( 4797) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (  80) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   094   094   016    Pre-fail  Always       -       655376
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   106   106   024    Pre-fail  Always       -       329 (Average 328)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       2314
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       37
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   020    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   088   088   000    Old_age   Always       -       89346
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       427
192 Power-Off_Retract_Count 0x0032   096   096   050    Old_age   Always       -       5739
193 Load_Cycle_Count        0x0012   096   096   050    Old_age   Always       -       5739
194 Temperature_Celsius     0x0002   157   157   000    Old_age   Always       -       35 (Min/Max 15/62)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       41
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       46

SMART Error Log Version: 1
ATA Error Count: 7 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 7 occurred at disk power-on lifetime: 5351 hours (222 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 07 00 00 40  Error: ABRT at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c4 ff 08 00 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 01 00 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 01 00 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 01 00 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 04 40 00 00 40 00   1d+15:08:42.100  READ MULTIPLE

Error 6 occurred at disk power-on lifetime: 5351 hours (222 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 40  Error: ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c4 ff 01 00 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 01 00 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 04 40 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 01 00 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 01 00 00 00 40 00   1d+15:08:11.600  READ MULTIPLE

Error 5 occurred at disk power-on lifetime: 5351 hours (222 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 51 00 00 00 00 40  Error:  at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c4 ff 01 00 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 01 00 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 04 40 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 01 00 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 01 00 00 00 40 00   1d+15:08:11.600  READ MULTIPLE

Error 4 occurred at disk power-on lifetime: 5351 hours (222 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 43 00 00 40  Error: ABRT at LBA = 0x00000043 = 67

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c4 ff 04 40 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 01 00 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 01 00 00 00 40 00   1d+15:08:11.600  READ MULTIPLE
  c5 ff 08 2f 00 5e 40 00   1d+15:06:11.600  WRITE MULTIPLE
  c5 ff 08 47 00 5e 40 00   1d+15:06:11.600  WRITE MULTIPLE

Error 3 occurred at disk power-on lifetime: 5351 hours (222 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 40  Error: ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c4 ff 01 00 00 00 40 00   1d+15:08:42.100  READ MULTIPLE
  c4 ff 01 00 00 00 40 00   1d+15:08:11.600  READ MULTIPLE
  c5 ff 08 2f 00 5e 40 00   1d+15:06:11.600  WRITE MULTIPLE
  c5 ff 08 47 00 5e 40 00   1d+15:06:11.600  WRITE MULTIPLE
  c5 ff 10 7f 63 5e 40 00   1d+15:06:11.600  WRITE MULTIPLE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     23758         -
# 2  Short offline       Completed without error       00%     39677         -

Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum.
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

#

syg00 · 02-12-2017, 05:21 PM

Not at all - perhaps you would like to explain further @smallpond ?.

To me it looks like the array itself is ok, but the filesystem is hosed - so apparently I am in agreement with the OP.

Quote:

Originally Posted by jaypifer

I'd like to back up the data I have before doing any rescue attempts but assume I would need to buy four 2TB disks to dd over the data.

No like about it - you must get a backup of all the data before you start playing around. Work on the copy only. Usually it's actually preferable to have two copies so you can simply re-copy (from the second copy) to try different scenarios, but in your case the hardware looks ok (that 250 drive might be a worry), so you can recopy from the array in need.

Quote:

My current line of thinking is that I need to run e2fsck -f -y, but I've continued research to ensure I've covered all my options before modifying the disk. Also, I think I may be missing something obvious or elements of a raid that I haven't tried as I am not familiar with the mdadm commands. For example, perhaps I could break and rebuild the array with the three 2TB drives

fsck is for fixing the filesystem so it is consistent. Your data may be compromised, as you have been warned. You may get all the fragments saved in lost+found. Or you may not.
And even if you do, you may not be able to ascertain which files were affected. Or how.

You can't break the RAID up as you suggested - you would have to re-create the array, and lose all your data.

RAID is not a substitute for good backups. There is no substitute.
Once you have a image to work on, maybe try something like photorec - it does more than just photos. It's a scraper, and will take forever, but it may do what you need. There are other similar forensic tools, but they will all add significantly to the time, and may not add much benefit.

jaypifer · 02-12-2017, 08:34 PM

Quote:

Originally Posted by syg00

No like about it - you must get a backup of all the data before you start playing around. Work on the copy only. Usually it's actually preferable to have two copies so you can simply re-copy (from the second copy) to try different scenarios, but in your case the hardware looks ok (that 250 drive might be a worry), so you can recopy from the array in need.fsck is for fixing the filesystem so it is consistent. Your data may be compromised, as you have been warned. You may get all the fragments saved in lost+found. Or you may not.

Okay, sounds like I didn't miss something obvious nor a simple fix.

I'll order the four drives and roll up my sleeves in a few days after they arrive. It's a good idea to have the copies around for several failures until success is achieved. I'll update here on progress.

I understand that the data may have issues, but hope that all is well. Nothing was written to the drives since failure, nor do I believe any files were in use at the time.

rknichols · 02-13-2017, 08:47 AM

Quote:

Originally Posted by jaypifer

It's a good idea to have the copies around for several failures until success is achieved.

Definitely! Running "fsck -y" always has the possibility of unrecoverable data loss. Note the "WARNING: SEVERE DATA LOSS POSSIBLE" in one of your "fsck -n" runs. It's the job of fsck to make the filesystem consistent, and sometimes that comes at the expense of user data. There are programs like testdisk that can recover files from damaged filesystems, but that can become much harder once that damage has been "fixed".

jaypifer · 02-24-2017, 11:09 AM

As an update, I got the four drives and ended up taking the drives out of the array one by one and cloning them using my desktop. A bit of research told me to use ddrescue rather than dd. Carefully checking each drive as I swapped them out I was able to run this:

Code:

sudo ddrescue -f /dev/sdc /dev/sdb drive1.log
sudo ddrescue -f /dev/sdc /dev/sdb drive2.log
sudo ddrescue -f /dev/sdc /dev/sdb drive3.log
sudo ddrescue -f /dev/sdc /dev/sdb drive4.log

Each 2TB drive took 8 hours to copy. Now I've kicked off:

Code:

e2fsck -y -f -v /dev/c/c

In hindsight, I could and should have used the -C flag to know what's going on. It has been running three days now with the NAS CPU pegged at 99.7%. I tried to killall -USR1 e2fsck to no avail. I've checked /sys/block/sd*/stat and read and writes seem to be happening so I guess I'll just wait a few weeks to see if it finishes.

jaypifer · 03-20-2017, 09:09 AM

Just wanted to close this out for anyone that may have the same issue. After five days, I did end up killing the process and restarting it. I did some analysis and saw that it had been working the whole time, but slowly. I restarted with the -C and enjoyed about three hours of additional progress reporting before the CPU pegged at 99.8%.

I waited another six days and the process finished. More analysis, then I rebooted and mounted. From what I can tell, almost everything is safe and sound. I do have 991 items in lost+found/, but they appear to be directory listing or possibly things that were previously in the recycle bin. I'll go through those slowly at a later time.

Thanks for the advice!