mdadm RAID5 degraded, recreated, partition problems??

codemastermm · 01-08-2013, 08:34 AM

So I was at work yesterday and received the following automated e-mail from my home server about my RAID array:

Code:

This is an automatically generated mail message from mdadm
running on cyrus

A Fail event had been detected on md device /dev/md/2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid5 sdf2[2] sdg2[3] sde2[1] sdd2[0]
      11002368 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      
md2 : active raid5 sdf3[5] sdg3[3] sde3[1] sdd3[4](F)
      2185873920 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]
      
md0 : active raid5 sdf1[2] sdg1[3] sde1[1] sdd1[0]
      437760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      
unused devices: <none>

I rebooted the system into a Live CD and attempted to re-add the failed drive back into the array (sometimes the cable becomes loose). --re-add simply wouldn't work, so I figured I had to re-create the array and assume-clean (after all, the array was working fine!)

I checked using gparted to find that... nope. It was unable to detect my partition in there. Crap.

So I attempted on at least getting it back to the way it was: I tried doing a create with assume-clean again, but with the "failed" drive missing (/dev/sdd3). Again, no partition listed in gparted. Lovely.

So now here I am asking you all for help!
Current ideas I have so far:
- the drive numbering I have for some reason is a bit weird (1,3,4,5?). Maybe that is related or can help with assembling this correctly somehow...
- I was reading/googling around and saw one guy mentioned zero-ing out the superblock and having mdadm attempt to reassemble the drives. I haven't tried this yet, though..

Any help on this whatsoever would be a huge lifesaver. Thanks so much guys!

Edit: Still reading up on this issue. One possible idea is that perhaps I can run a create with (missing /dev/sde3 missing /dev/sdg3), which would assign slots 1 and 3 appropriately and then add in /dev/sdd3 (4) and then /dev/sdf3 (5)? Not too sure if that would be the best way to do it, but certainly that is a thought to get the slot numbers right....

codemastermm · 01-08-2013, 10:36 AM

Reading over a bunch of articles, it seems that a possibility is as follows:
- Create the mdadm array with missing disks for the missing disk slots (mdadm --create /dev/md2 --assume-clean missing /dev/sde3 missing /dev/sdg3)
- Add the disks that are in the later slots of 4 and 5 (mdadm --add /dev/md2 /dev/sdd3; mdadm --add /dev/md2 /dev/sdf3)

...Possibly this will work? I'm curious if anyone can give me a sanity check before I try it out

Edit: Ended up trying this in a crazy sort of way:
sudo mdadm --create /dev/md2 --assume-clean --level=5 --verbose --force --raid-devices=4 /dev/sdd3 /dev/sde3 /dev/sdf3 /dev/sdg3
sudo mdadm --manage /dev/md2 --fail /dev/sdd3
sudo mdadm --manage /dev/md2 --fail /dev/sdf3
sudo mdadm --zero-superblock /dev/sdd3
sudo mdadm --zero-superblock /dev/sdf3
sudo mdadm --manage /dev/md2 --add /dev/sdd3
sudo mdadm --manage /dev/md2 --add /dev/sdf3

In the end, this gave me the following when I run a `cat /proc/mdstat`:
md2 : active raid5 sdf3[5](S) sdd3[4](S) sdg3[3] sde3[1]
2186264064 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2] [_U_U]

gparted, however, showed -no- partition on the data which... sucks.
I wonder, though, is that because the two other drives (sdf3 and sdd3) haven't synced back into the array yet?
What should my next two steps be??

Thanks again.

tunixman · 01-08-2013, 12:06 PM

First check the SMART logs for the device. It looks to me like the device has actually legitimately failed and should be replaced. However, recreating the partition table as it was before it vanished would at least make the RAID partition visible to the system and it should get added back and resynced.

codemastermm · 01-08-2013, 12:12 PM

Yup, that's my plan! At least get my data back so I can go out and buy a new drive to toss it on

The drives are getting a bit old, so I wouldn't be too surprised if one is dying/dead. Literally yesterday morning I was planning a backup scheme out right before this happened...

Here's my smartctl output, with /dev/sdd on top (the device that decided to jump out of the array in question):

Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10
Device Model:     ST3750640AS
Serial Number:    5QD334BQ
Firmware Version: 3.AAE
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Jan  8 18:09:00 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (  41)	The self-test routine was interrupted
					by the host with a hard or soft reset.
Total time to complete Offline 
data collection: 		(  430) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 202) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   099   069   006    Pre-fail  Always       -       6582973
  3 Spin_Up_Time            0x0003   096   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       256
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       33
  7 Seek_Error_Rate         0x000f   090   060   030    Pre-fail  Always       -       981902468
  9 Power_On_Hours          0x0032   074   074   000    Old_age   Always       -       23228
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       466
187 Reported_Uncorrect      0x0032   078   078   000    Old_age   Always       -       22
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   061   024   045    Old_age   Always   In_the_past 39 (0 22 43 37)
194 Temperature_Celsius     0x0022   039   076   000    Old_age   Always       -       39 (0 14 0 0)
195 Hardware_ECC_Recovered  0x001a   063   052   000    Old_age   Always       -       241173855
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       2
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       11
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 22 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 22 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:52.009  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:51.946  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:51.930  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:51.917  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:51.858  READ NATIVE MAX ADDRESS EXT

Error 21 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:52.009  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:51.946  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:51.930  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:51.917  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:51.858  READ NATIVE MAX ADDRESS EXT

Error 20 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:47.637  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:47.583  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:47.579  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:45.610  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:45.540  READ NATIVE MAX ADDRESS EXT

Error 19 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:47.637  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:47.583  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:47.579  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:45.610  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:45.540  READ NATIVE MAX ADDRESS EXT

Error 18 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:43.448  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:43.447  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:43.435  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:45.610  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:45.540  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Interrupted (host reset)      90%      6579         -
# 2  Short offline       Completed without error       00%      6492         -
# 3  Short offline       Completed without error       00%      6471         -
# 4  Short offline       Completed without error       00%      5931         -
# 5  Short offline       Completed without error       00%      4676         -
# 6  Short offline       Completed without error       00%      4400         -
# 7  Short offline       Completed without error       00%      4289         -
# 8  Short offline       Completed without error       00%      2396         -
# 9  Short offline       Completed without error       00%      2239         -
#10  Short offline       Completed without error       00%       797         -
#11  Short offline       Interrupted (host reset)      90%       795         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

And just for clarity, I suppose, here's the rest:
/dev/sde:

Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10
Device Model:     ST3750640AS
Serial Number:    5QD40LEC
Firmware Version: 3.AAE
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Jan  8 18:10:58 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (  41)	The self-test routine was interrupted
					by the host with a hard or soft reset.
Total time to complete Offline 
data collection: 		(  430) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 202) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   091   076   006    Pre-fail  Always       -       76035097
  3 Spin_Up_Time            0x0003   096   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       262
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       15
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       -       34385094
  9 Power_On_Hours          0x0032   074   074   000    Old_age   Always       -       23284
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       470
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   065   022   045    Old_age   Always   In_the_past 35 (0 98 39 33)
194 Temperature_Celsius     0x0022   035   078   000    Old_age   Always       -       35 (0 13 0 0)
195 Hardware_ECC_Recovered  0x001a   051   049   000    Old_age   Always       -       73134178
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Interrupted (host reset)      90%      6571         -
# 2  Short offline       Completed without error       00%      6485         -
# 3  Short offline       Completed without error       00%      5924         -
# 4  Short offline       Completed without error       00%      4676         -
# 5  Short offline       Completed without error       00%      4400         -
# 6  Short offline       Completed without error       00%      4289         -
# 7  Short offline       Completed without error       00%      2396         -
# 8  Short offline       Completed without error       00%      2239         -
# 9  Short offline       Completed without error       00%       797         -
#10  Short offline       Interrupted (host reset)      90%       795         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/dev/sdf:

Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10
Device Model:     ST3750640AS
Serial Number:    5QD3P86D
Firmware Version: 3.AAE
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Jan  8 18:11:30 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (  25)	The self-test routine was aborted by
					the host.
Total time to complete Offline 
data collection: 		(  430) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 202) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   102   069   006    Pre-fail  Always       -       164141535
  3 Spin_Up_Time            0x0003   096   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       166
  5 Reallocated_Sector_Ct   0x0033   098   098   036    Pre-fail  Always       -       85
  7 Seek_Error_Rate         0x000f   082   055   030    Pre-fail  Always       -       13480475557
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       16824
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       295
187 Reported_Uncorrect      0x0032   098   098   000    Old_age   Always       -       2
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   065   022   045    Old_age   Always   In_the_past 35 (0 90 38 33)
194 Temperature_Celsius     0x0022   035   078   000    Old_age   Always       -       35 (0 13 0 0)
195 Hardware_ECC_Recovered  0x001a   058   046   000    Old_age   Always       -       126512467
197 Current_Pending_Sector  0x0012   100   098   000    Old_age   Always       -       19
198 Offline_Uncorrectable   0x0010   100   098   000    Old_age   Offline      -       19
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 2
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 14587 hours (607 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d3 bd b8 e0  Error: UNC at LBA = 0x00b8bdd3 = 12107219

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 68 bb b8 e0 00      04:18:22.540  READ DMA EXT
  27 00 00 00 00 00 e0 00      04:18:20.249  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      04:18:20.159  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      04:18:20.151  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      04:18:20.097  READ NATIVE MAX ADDRESS EXT

Error 1 occurred at disk power-on lifetime: 14587 hours (607 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d3 bd b8 e0  Error: UNC at LBA = 0x00b8bdd3 = 12107219

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 68 bb b8 e0 00      04:18:17.515  READ DMA EXT
  25 00 08 68 c2 b8 e0 00      04:18:20.249  READ DMA EXT
  25 00 08 60 c2 b8 e0 00      04:18:20.159  READ DMA EXT
  27 00 00 00 00 00 e0 00      04:18:20.151  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      04:18:20.097  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Aborted by host               90%     13738         -
# 2  Short offline       Interrupted (host reset)      90%       117         -
# 3  Short offline       Completed without error       00%         6         -
# 4  Extended offline    Completed without error       00%         4         -
# 5  Short offline       Completed without error       00%         0         -
# 6  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/dev/sdg:

Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.12
Device Model:     ST3750528AS
Serial Number:    9VP1KKD9
LU WWN Device Id: 5 000c50 0159f1da1
Firmware Version: CC38
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Jan  8 18:11:55 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  617) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 154) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       219266637
  3 Spin_Up_Time            0x0003   096   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       223
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       18404170
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3518
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       106
183 Runtime_Bad_Block       0x0032   099   099   000    Old_age   Always       -       1
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       5
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   068   040   045    Old_age   Always   In_the_past 32 (7 62 35 30)
194 Temperature_Celsius     0x0022   032   060   000    Old_age   Always       -       32 (0 20 0 0)
195 Hardware_ECC_Recovered  0x001a   050   038   000    Old_age   Always       -       219266637
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       17871358922355
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2880802208
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       833055224

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

tunixman · 01-08-2013, 12:12 PM

Also, there's probably no reason to completely reconfigure the array because of a failed device. Troubleshooting efforts should probably be focused on why the device has failed and not why the RAID array has disabled the failed device, since that's actually what it's meant to do.

tunixman · 01-08-2013, 12:14 PM

Is your data inaccessible now? It should actually be still available even with a failed drive unless more has gone on since the original drive failed.

tunixman · 01-08-2013, 12:18 PM

I'll admit I'm having a lot of trouble following what's going on here. With a failed drive the RAID should continue working. I did see you mentioned something about running "create" again, and that would certainly cause data loss if it actually succeeded, but I think it requires lots of extra options to be convinced to actually run on physical devices with RAID superblocks.

codemastermm · 01-08-2013, 12:40 PM

My create did succeed - I made sure to run it with --assume-clean only, though, which I believe only changes the superblock and not any actual data on the drive. I have never seen it starting to sync in /proc/mdstat either, which I believe is a GOOD thing in this case.

One thing to still note, the slot numbers on the rives in my RAID are a bit wonky (skipping over 0 and 2). I still think this might be causing half of my woes, but I am uncertain.

tunixman · 01-08-2013, 12:49 PM

I'm having a lot of trouble following this, but it does sound a lot like a drive failed, the partition was no longer visible, RAID took it out of the array, then it got forced back in with the partition still not visible, but maybe if there's some way you could be less vague about what the order of events were it would really help.

I don't think the "slot numbers" matter (I'm not quite sure what you're talking about either), since the superblock records which drives get which stripes.

Also, whether create succeeded or failed, if there's a drive that's not good and you force it back in without checking it first that's usually a very efficient way to lose data, and that may be what's going on here.

tunixman · 01-08-2013, 12:50 PM

Also this seems like it may help: http://ubuntuforums.org/showthread.php?t=1979610

codemastermm · 01-08-2013, 12:57 PM

What's happened is I had a drive seemingly fail and get knocked out of the RAID. I tried to force it back in (via --create --assume-clean, kind of like the link you posted -> http://ubuntuforums.org/showpost.php...79&postcount=3). This way, I figured, it wouldn't be mucking with the data actually on the drives.

I launched gparted and it showed a broken partition (see: http://i.imgur.com/UA2CT.png), so I figured I'd attempt to have it full repair itself. (Now that I think about it, perhaps I should have gone with that and ran fsck instead?)

I then started reading around online, trying to find better information on how to approach this. I noticed, at that time, that my disk slots were a very odd numbering (md2 : active raid5 sdf3[5] sdg3[3] sde3[1] sdd3[4](F), notice they're disk 1, 3, 4, and 5; skipping over 0 and 2). I found a bit of information (http://serverfault.com/questions/447...re-raid-arrays) that spoke about creating the mdadm array and then adding the drives back in. I attempted that and was met with no partition at all, which is probably even worse. Thankfully, I again used --assume-clean, so that mdadm wouldn't start resyncing the drives.

My thoughts are now, maybe I should try going back to the point where my partition is somewhat shown and try a fsck?

tunixman · 01-08-2013, 01:01 PM

I'd try assembling the array with all the drives, but keeping the failed drive failed, so it's running degraded, and then see if the partition on the RAID device is still present. (I'm not sure still which partition you're talking about not being visible, the physical disk partition or the RAID partition). Then run fsck, and replace the drive that came out of the array. The slot numbering shouldn't matter since the striping information is determined by the superblock.

codemastermm · 01-08-2013, 01:04 PM

Oh, my apologies. I meant the RAID partition.
Upon assembling the drives, I still see the "unknown partition" (http://i.imgur.com/UA2CT.png), so would you recommend running fsck on that?
If so, it's starting to seem a bit more hopeful!!

tunixman · 01-08-2013, 01:07 PM

I wouldn't recommend fsck on that unknown partition. It sounds like the underlying RAID is reading data incorrectly, probably because it's been reorganized. What does /proc/mdstat look like?

codemastermm · 01-08-2013, 01:07 PM

Code:

Personalities : [raid6] [raid5] [raid4] 
md2 : active raid5 sdg3[3] sdf3[2] sde3[1]
      2186264064 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]