LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)
-   -   mdadm RAID5 degraded, recreated, partition problems?? (https://www.linuxquestions.org/questions/linux-server-73/mdadm-raid5-degraded-recreated-partition-problems-4175444650/)

codemastermm 01-08-2013 08:34 AM

mdadm RAID5 degraded, recreated, partition problems??
 
So I was at work yesterday and received the following automated e-mail from my home server about my RAID array:

Code:

This is an automatically generated mail message from mdadm
running on cyrus

A Fail event had been detected on md device /dev/md/2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sdf2[2] sdg2[3] sde2[1] sdd2[0]
      11002368 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
     
md2 : active raid5 sdf3[5] sdg3[3] sde3[1] sdd3[4](F)
      2185873920 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]
     
md0 : active raid5 sdf1[2] sdg1[3] sde1[1] sdd1[0]
      437760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
     
unused devices: <none>

I rebooted the system into a Live CD and attempted to re-add the failed drive back into the array (sometimes the cable becomes loose). --re-add simply wouldn't work, so I figured I had to re-create the array and assume-clean (after all, the array was working fine!)

I checked using gparted to find that... nope. It was unable to detect my partition in there. Crap.

So I attempted on at least getting it back to the way it was: I tried doing a create with assume-clean again, but with the "failed" drive missing (/dev/sdd3). Again, no partition listed in gparted. Lovely.

So now here I am asking you all for help!
Current ideas I have so far:
- the drive numbering I have for some reason is a bit weird (1,3,4,5?). Maybe that is related or can help with assembling this correctly somehow...
- I was reading/googling around and saw one guy mentioned zero-ing out the superblock and having mdadm attempt to reassemble the drives. I haven't tried this yet, though..

Any help on this whatsoever would be a huge lifesaver. Thanks so much guys!

Edit: Still reading up on this issue. One possible idea is that perhaps I can run a create with (missing /dev/sde3 missing /dev/sdg3), which would assign slots 1 and 3 appropriately and then add in /dev/sdd3 (4) and then /dev/sdf3 (5)? Not too sure if that would be the best way to do it, but certainly that is a thought to get the slot numbers right....

codemastermm 01-08-2013 10:36 AM

Reading over a bunch of articles, it seems that a possibility is as follows:
- Create the mdadm array with missing disks for the missing disk slots (mdadm --create /dev/md2 --assume-clean missing /dev/sde3 missing /dev/sdg3)
- Add the disks that are in the later slots of 4 and 5 (mdadm --add /dev/md2 /dev/sdd3; mdadm --add /dev/md2 /dev/sdf3)

...Possibly this will work? I'm curious if anyone can give me a sanity check before I try it out

Edit: Ended up trying this in a crazy sort of way:
sudo mdadm --create /dev/md2 --assume-clean --level=5 --verbose --force --raid-devices=4 /dev/sdd3 /dev/sde3 /dev/sdf3 /dev/sdg3
sudo mdadm --manage /dev/md2 --fail /dev/sdd3
sudo mdadm --manage /dev/md2 --fail /dev/sdf3
sudo mdadm --zero-superblock /dev/sdd3
sudo mdadm --zero-superblock /dev/sdf3
sudo mdadm --manage /dev/md2 --add /dev/sdd3
sudo mdadm --manage /dev/md2 --add /dev/sdf3

In the end, this gave me the following when I run a `cat /proc/mdstat`:
md2 : active raid5 sdf3[5](S) sdd3[4](S) sdg3[3] sde3[1]
2186264064 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2] [_U_U]

gparted, however, showed -no- partition on the data which... sucks.
I wonder, though, is that because the two other drives (sdf3 and sdd3) haven't synced back into the array yet?
What should my next two steps be??

Thanks again.

tunixman 01-08-2013 12:06 PM

SMART Logs
 
First check the SMART logs for the device. It looks to me like the device has actually legitimately failed and should be replaced. However, recreating the partition table as it was before it vanished would at least make the RAID partition visible to the system and it should get added back and resynced.

codemastermm 01-08-2013 12:12 PM

Yup, that's my plan! At least get my data back so I can go out and buy a new drive to toss it on :)
The drives are getting a bit old, so I wouldn't be too surprised if one is dying/dead. Literally yesterday morning I was planning a backup scheme out right before this happened...

Here's my smartctl output, with /dev/sdd on top (the device that decided to jump out of the array in question):
Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:    Seagate Barracuda 7200.10
Device Model:    ST3750640AS
Serial Number:    5QD334BQ
Firmware Version: 3.AAE
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Jan  8 18:09:00 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)        Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  41)        The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline
data collection:                (  430) seconds.
Offline data collection
capabilities:                          (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003)        Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01)        Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:          (  1) minutes.
Extended self-test routine
recommended polling time:          ( 202) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  099  069  006    Pre-fail  Always      -      6582973
  3 Spin_Up_Time            0x0003  096  093  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      256
  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      33
  7 Seek_Error_Rate        0x000f  090  060  030    Pre-fail  Always      -      981902468
  9 Power_On_Hours          0x0032  074  074  000    Old_age  Always      -      23228
 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      466
187 Reported_Uncorrect      0x0032  078  078  000    Old_age  Always      -      22
189 High_Fly_Writes        0x003a  100  100  000    Old_age  Always      -      0
190 Airflow_Temperature_Cel 0x0022  061  024  045    Old_age  Always  In_the_past 39 (0 22 43 37)
194 Temperature_Celsius    0x0022  039  076  000    Old_age  Always      -      39 (0 14 0 0)
195 Hardware_ECC_Recovered  0x001a  063  052  000    Old_age  Always      -      241173855
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      2
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      2
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      11
200 Multi_Zone_Error_Rate  0x0000  100  253  000    Old_age  Offline      -      0
202 Data_Address_Mark_Errs  0x0032  100  253  000    Old_age  Always      -      0

SMART Error Log Version: 1
ATA Error Count: 22 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 22 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:52.009  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:51.946  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:51.930  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:51.917  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:51.858  READ NATIVE MAX ADDRESS EXT

Error 21 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:52.009  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:51.946  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:51.930  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:51.917  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:51.858  READ NATIVE MAX ADDRESS EXT

Error 20 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:47.637  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:47.583  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:47.579  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:45.610  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:45.540  READ NATIVE MAX ADDRESS EXT

Error 19 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:47.637  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:47.583  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:47.579  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:45.610  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:45.540  READ NATIVE MAX ADDRESS EXT

Error 18 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:43.448  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:43.447  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:43.435  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:45.610  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:45.540  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Interrupted (host reset)      90%      6579        -
# 2  Short offline      Completed without error      00%      6492        -
# 3  Short offline      Completed without error      00%      6471        -
# 4  Short offline      Completed without error      00%      5931        -
# 5  Short offline      Completed without error      00%      4676        -
# 6  Short offline      Completed without error      00%      4400        -
# 7  Short offline      Completed without error      00%      4289        -
# 8  Short offline      Completed without error      00%      2396        -
# 9  Short offline      Completed without error      00%      2239        -
#10  Short offline      Completed without error      00%      797        -
#11  Short offline      Interrupted (host reset)      90%      795        -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

And just for clarity, I suppose, here's the rest:
/dev/sde:
Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:    Seagate Barracuda 7200.10
Device Model:    ST3750640AS
Serial Number:    5QD40LEC
Firmware Version: 3.AAE
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Jan  8 18:10:58 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)        Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  41)        The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline
data collection:                (  430) seconds.
Offline data collection
capabilities:                          (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003)        Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01)        Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:          (  1) minutes.
Extended self-test routine
recommended polling time:          ( 202) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  091  076  006    Pre-fail  Always      -      76035097
  3 Spin_Up_Time            0x0003  096  093  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      262
  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      15
  7 Seek_Error_Rate        0x000f  075  060  030    Pre-fail  Always      -      34385094
  9 Power_On_Hours          0x0032  074  074  000    Old_age  Always      -      23284
 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      470
187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0
189 High_Fly_Writes        0x003a  100  100  000    Old_age  Always      -      0
190 Airflow_Temperature_Cel 0x0022  065  022  045    Old_age  Always  In_the_past 35 (0 98 39 33)
194 Temperature_Celsius    0x0022  035  078  000    Old_age  Always      -      35 (0 13 0 0)
195 Hardware_ECC_Recovered  0x001a  051  049  000    Old_age  Always      -      73134178
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0000  100  253  000    Old_age  Offline      -      0
202 Data_Address_Mark_Errs  0x0032  100  253  000    Old_age  Always      -      0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Interrupted (host reset)      90%      6571        -
# 2  Short offline      Completed without error      00%      6485        -
# 3  Short offline      Completed without error      00%      5924        -
# 4  Short offline      Completed without error      00%      4676        -
# 5  Short offline      Completed without error      00%      4400        -
# 6  Short offline      Completed without error      00%      4289        -
# 7  Short offline      Completed without error      00%      2396        -
# 8  Short offline      Completed without error      00%      2239        -
# 9  Short offline      Completed without error      00%      797        -
#10  Short offline      Interrupted (host reset)      90%      795        -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/dev/sdf:
Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:    Seagate Barracuda 7200.10
Device Model:    ST3750640AS
Serial Number:    5QD3P86D
Firmware Version: 3.AAE
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Jan  8 18:11:30 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)        Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  25)        The self-test routine was aborted by
                                        the host.
Total time to complete Offline
data collection:                (  430) seconds.
Offline data collection
capabilities:                          (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003)        Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01)        Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:          (  1) minutes.
Extended self-test routine
recommended polling time:          ( 202) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  102  069  006    Pre-fail  Always      -      164141535
  3 Spin_Up_Time            0x0003  096  093  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      166
  5 Reallocated_Sector_Ct  0x0033  098  098  036    Pre-fail  Always      -      85
  7 Seek_Error_Rate        0x000f  082  055  030    Pre-fail  Always      -      13480475557
  9 Power_On_Hours          0x0032  081  081  000    Old_age  Always      -      16824
 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      295
187 Reported_Uncorrect      0x0032  098  098  000    Old_age  Always      -      2
189 High_Fly_Writes        0x003a  100  100  000    Old_age  Always      -      0
190 Airflow_Temperature_Cel 0x0022  065  022  045    Old_age  Always  In_the_past 35 (0 90 38 33)
194 Temperature_Celsius    0x0022  035  078  000    Old_age  Always      -      35 (0 13 0 0)
195 Hardware_ECC_Recovered  0x001a  058  046  000    Old_age  Always      -      126512467
197 Current_Pending_Sector  0x0012  100  098  000    Old_age  Always      -      19
198 Offline_Uncorrectable  0x0010  100  098  000    Old_age  Offline      -      19
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0000  100  253  000    Old_age  Offline      -      0
202 Data_Address_Mark_Errs  0x0032  100  253  000    Old_age  Always      -      0

SMART Error Log Version: 1
ATA Error Count: 2
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 14587 hours (607 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d3 bd b8 e0  Error: UNC at LBA = 0x00b8bdd3 = 12107219

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 68 bb b8 e0 00      04:18:22.540  READ DMA EXT
  27 00 00 00 00 00 e0 00      04:18:20.249  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      04:18:20.159  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      04:18:20.151  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      04:18:20.097  READ NATIVE MAX ADDRESS EXT

Error 1 occurred at disk power-on lifetime: 14587 hours (607 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d3 bd b8 e0  Error: UNC at LBA = 0x00b8bdd3 = 12107219

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 68 bb b8 e0 00      04:18:17.515  READ DMA EXT
  25 00 08 68 c2 b8 e0 00      04:18:20.249  READ DMA EXT
  25 00 08 60 c2 b8 e0 00      04:18:20.159  READ DMA EXT
  27 00 00 00 00 00 e0 00      04:18:20.151  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      04:18:20.097  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Aborted by host              90%    13738        -
# 2  Short offline      Interrupted (host reset)      90%      117        -
# 3  Short offline      Completed without error      00%        6        -
# 4  Extended offline    Completed without error      00%        4        -
# 5  Short offline      Completed without error      00%        0        -
# 6  Short offline      Completed without error      00%        0        -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/dev/sdg:
Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:    Seagate Barracuda 7200.12
Device Model:    ST3750528AS
Serial Number:    9VP1KKD9
LU WWN Device Id: 5 000c50 0159f1da1
Firmware Version: CC38
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Jan  8 18:11:55 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)        Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  0)        The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  617) seconds.
Offline data collection
capabilities:                          (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003)        Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01)        Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:          (  1) minutes.
Extended self-test routine
recommended polling time:          ( 154) minutes.
Conveyance self-test routine
recommended polling time:          (  2) minutes.
SCT capabilities:                (0x103f)        SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  119  099  006    Pre-fail  Always      -      219266637
  3 Spin_Up_Time            0x0003  096  095  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      223
  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000f  072  060  030    Pre-fail  Always      -      18404170
  9 Power_On_Hours          0x0032  096  096  000    Old_age  Always      -      3518
 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      106
183 Runtime_Bad_Block      0x0032  099  099  000    Old_age  Always      -      1
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0
188 Command_Timeout        0x0032  100  098  000    Old_age  Always      -      5
189 High_Fly_Writes        0x003a  100  100  000    Old_age  Always      -      0
190 Airflow_Temperature_Cel 0x0022  068  040  045    Old_age  Always  In_the_past 32 (7 62 35 30)
194 Temperature_Celsius    0x0022  032  060  000    Old_age  Always      -      32 (0 20 0 0)
195 Hardware_ECC_Recovered  0x001a  050  038  000    Old_age  Always      -      219266637
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0
240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      17871358922355
241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      2880802208
242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      833055224

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


tunixman 01-08-2013 12:12 PM

Also, there's probably no reason to completely reconfigure the array because of a failed device. Troubleshooting efforts should probably be focused on why the device has failed and not why the RAID array has disabled the failed device, since that's actually what it's meant to do.

tunixman 01-08-2013 12:14 PM

Is your data inaccessible now? It should actually be still available even with a failed drive unless more has gone on since the original drive failed.

tunixman 01-08-2013 12:18 PM

I'll admit I'm having a lot of trouble following what's going on here. With a failed drive the RAID should continue working. I did see you mentioned something about running "create" again, and that would certainly cause data loss if it actually succeeded, but I think it requires lots of extra options to be convinced to actually run on physical devices with RAID superblocks.

codemastermm 01-08-2013 12:40 PM

My create did succeed - I made sure to run it with --assume-clean only, though, which I believe only changes the superblock and not any actual data on the drive. I have never seen it starting to sync in /proc/mdstat either, which I believe is a GOOD thing in this case.

One thing to still note, the slot numbers on the rives in my RAID are a bit wonky (skipping over 0 and 2). I still think this might be causing half of my woes, but I am uncertain.

tunixman 01-08-2013 12:49 PM

I'm having a lot of trouble following this, but it does sound a lot like a drive failed, the partition was no longer visible, RAID took it out of the array, then it got forced back in with the partition still not visible, but maybe if there's some way you could be less vague about what the order of events were it would really help.

I don't think the "slot numbers" matter (I'm not quite sure what you're talking about either), since the superblock records which drives get which stripes.

Also, whether create succeeded or failed, if there's a drive that's not good and you force it back in without checking it first that's usually a very efficient way to lose data, and that may be what's going on here.

tunixman 01-08-2013 12:50 PM

Also this seems like it may help: http://ubuntuforums.org/showthread.php?t=1979610

codemastermm 01-08-2013 12:57 PM

What's happened is I had a drive seemingly fail and get knocked out of the RAID. I tried to force it back in (via --create --assume-clean, kind of like the link you posted -> http://ubuntuforums.org/showpost.php...79&postcount=3). This way, I figured, it wouldn't be mucking with the data actually on the drives.

I launched gparted and it showed a broken partition (see: http://i.imgur.com/UA2CT.png), so I figured I'd attempt to have it full repair itself. (Now that I think about it, perhaps I should have gone with that and ran fsck instead?)

I then started reading around online, trying to find better information on how to approach this. I noticed, at that time, that my disk slots were a very odd numbering (md2 : active raid5 sdf3[5] sdg3[3] sde3[1] sdd3[4](F), notice they're disk 1, 3, 4, and 5; skipping over 0 and 2). I found a bit of information (http://serverfault.com/questions/447...re-raid-arrays) that spoke about creating the mdadm array and then adding the drives back in. I attempted that and was met with no partition at all, which is probably even worse. Thankfully, I again used --assume-clean, so that mdadm wouldn't start resyncing the drives.

My thoughts are now, maybe I should try going back to the point where my partition is somewhat shown and try a fsck?

tunixman 01-08-2013 01:01 PM

I'd try assembling the array with all the drives, but keeping the failed drive failed, so it's running degraded, and then see if the partition on the RAID device is still present. (I'm not sure still which partition you're talking about not being visible, the physical disk partition or the RAID partition). Then run fsck, and replace the drive that came out of the array. The slot numbering shouldn't matter since the striping information is determined by the superblock.

codemastermm 01-08-2013 01:04 PM

Oh, my apologies. I meant the RAID partition.
Upon assembling the drives, I still see the "unknown partition" (http://i.imgur.com/UA2CT.png), so would you recommend running fsck on that?
If so, it's starting to seem a bit more hopeful!!

tunixman 01-08-2013 01:07 PM

I wouldn't recommend fsck on that unknown partition. It sounds like the underlying RAID is reading data incorrectly, probably because it's been reorganized. What does /proc/mdstat look like?

codemastermm 01-08-2013 01:07 PM

Code:

Personalities : [raid6] [raid5] [raid4]
md2 : active raid5 sdg3[3] sdf3[2] sde3[1]
      2186264064 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]


tunixman 01-08-2013 01:13 PM

Well. It's definitely different geometry. And I think this means that the RAID system has moved the drives around in the RAID, so all of the stripe/parity information will be invalid. You've overridden the original slot information, so you may need to run create again, but with the drives specified in the right order and leaving slot 2 missing. But there's been quite a bit of creative troubleshooting here and we're rapidly leaving what I've done with mdadm, so I may not be able to do much except read man pages to you past this point.

codemastermm 01-08-2013 03:14 PM

Yup, that was my thought - having to recreate the array with everything in the exact same position as mdadm had them previously (which is why I was starting to think that the disk slot # was an issue).

I've so far tried creating a RAID5 device purposely degraded (as to leave disk 1 and 2 out and add them in as 4 and 5), but I am receiving a "RUN_ARRAY failed: Input/output error" which isn't too comforting. Here's what I ran:

Code:

sudo mdadm --create --assume-clean /dev/md2 --level=5 --raid-devices=4 missing /dev/sde3 missing /dev/sdg3
mdadm: /dev/sde3 appears to contain an ext2fs file system
    size=728756224K  mtime=Thu Jan  1 00:00:00 1970
mdadm: /dev/sde3 appears to be part of a raid array:
    level=raid5 devices=4 ctime=Tue Jan  8 19:07:39 2013
mdadm: /dev/sdg3 appears to be part of a raid array:
    level=raid5 devices=4 ctime=Tue Jan  8 19:07:39 2013
Continue creating array? yes
mdadm: Defaulting to version 1.2 metadata
mdadm: RUN_ARRAY failed: Input/output error



All times are GMT -5. The time now is 12:58 PM.