LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices

Reply
 
Search this Thread
Old 01-08-2013, 08:34 AM   #1
codemastermm
LQ Newbie
 
Registered: Jan 2013
Posts: 8

Rep: Reputation: Disabled
Unhappy mdadm RAID5 degraded, recreated, partition problems??


So I was at work yesterday and received the following automated e-mail from my home server about my RAID array:

Code:
This is an automatically generated mail message from mdadm
running on cyrus

A Fail event had been detected on md device /dev/md/2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid5 sdf2[2] sdg2[3] sde2[1] sdd2[0]
      11002368 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      
md2 : active raid5 sdf3[5] sdg3[3] sde3[1] sdd3[4](F)
      2185873920 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]
      
md0 : active raid5 sdf1[2] sdg1[3] sde1[1] sdd1[0]
      437760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      
unused devices: <none>
I rebooted the system into a Live CD and attempted to re-add the failed drive back into the array (sometimes the cable becomes loose). --re-add simply wouldn't work, so I figured I had to re-create the array and assume-clean (after all, the array was working fine!)

I checked using gparted to find that... nope. It was unable to detect my partition in there. Crap.

So I attempted on at least getting it back to the way it was: I tried doing a create with assume-clean again, but with the "failed" drive missing (/dev/sdd3). Again, no partition listed in gparted. Lovely.

So now here I am asking you all for help!
Current ideas I have so far:
- the drive numbering I have for some reason is a bit weird (1,3,4,5?). Maybe that is related or can help with assembling this correctly somehow...
- I was reading/googling around and saw one guy mentioned zero-ing out the superblock and having mdadm attempt to reassemble the drives. I haven't tried this yet, though..

Any help on this whatsoever would be a huge lifesaver. Thanks so much guys!

Edit: Still reading up on this issue. One possible idea is that perhaps I can run a create with (missing /dev/sde3 missing /dev/sdg3), which would assign slots 1 and 3 appropriately and then add in /dev/sdd3 (4) and then /dev/sdf3 (5)? Not too sure if that would be the best way to do it, but certainly that is a thought to get the slot numbers right....

Last edited by codemastermm; 01-08-2013 at 08:56 AM. Reason: new ideas?
 
Old 01-08-2013, 10:36 AM   #2
codemastermm
LQ Newbie
 
Registered: Jan 2013
Posts: 8

Original Poster
Rep: Reputation: Disabled
Reading over a bunch of articles, it seems that a possibility is as follows:
- Create the mdadm array with missing disks for the missing disk slots (mdadm --create /dev/md2 --assume-clean missing /dev/sde3 missing /dev/sdg3)
- Add the disks that are in the later slots of 4 and 5 (mdadm --add /dev/md2 /dev/sdd3; mdadm --add /dev/md2 /dev/sdf3)

...Possibly this will work? I'm curious if anyone can give me a sanity check before I try it out

Edit: Ended up trying this in a crazy sort of way:
sudo mdadm --create /dev/md2 --assume-clean --level=5 --verbose --force --raid-devices=4 /dev/sdd3 /dev/sde3 /dev/sdf3 /dev/sdg3
sudo mdadm --manage /dev/md2 --fail /dev/sdd3
sudo mdadm --manage /dev/md2 --fail /dev/sdf3
sudo mdadm --zero-superblock /dev/sdd3
sudo mdadm --zero-superblock /dev/sdf3
sudo mdadm --manage /dev/md2 --add /dev/sdd3
sudo mdadm --manage /dev/md2 --add /dev/sdf3

In the end, this gave me the following when I run a `cat /proc/mdstat`:
md2 : active raid5 sdf3[5](S) sdd3[4](S) sdg3[3] sde3[1]
2186264064 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2] [_U_U]

gparted, however, showed -no- partition on the data which... sucks.
I wonder, though, is that because the two other drives (sdf3 and sdd3) haven't synced back into the array yet?
What should my next two steps be??

Thanks again.

Last edited by codemastermm; 01-08-2013 at 11:52 AM.
 
Old 01-08-2013, 12:06 PM   #3
tunixman
LQ Newbie
 
Registered: Jan 2013
Posts: 9

Rep: Reputation: Disabled
SMART Logs

First check the SMART logs for the device. It looks to me like the device has actually legitimately failed and should be replaced. However, recreating the partition table as it was before it vanished would at least make the RAID partition visible to the system and it should get added back and resynced.
 
Old 01-08-2013, 12:12 PM   #4
codemastermm
LQ Newbie
 
Registered: Jan 2013
Posts: 8

Original Poster
Rep: Reputation: Disabled
Yup, that's my plan! At least get my data back so I can go out and buy a new drive to toss it on
The drives are getting a bit old, so I wouldn't be too surprised if one is dying/dead. Literally yesterday morning I was planning a backup scheme out right before this happened...

Here's my smartctl output, with /dev/sdd on top (the device that decided to jump out of the array in question):
Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10
Device Model:     ST3750640AS
Serial Number:    5QD334BQ
Firmware Version: 3.AAE
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Jan  8 18:09:00 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (  41)	The self-test routine was interrupted
					by the host with a hard or soft reset.
Total time to complete Offline 
data collection: 		(  430) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 202) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   099   069   006    Pre-fail  Always       -       6582973
  3 Spin_Up_Time            0x0003   096   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       256
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       33
  7 Seek_Error_Rate         0x000f   090   060   030    Pre-fail  Always       -       981902468
  9 Power_On_Hours          0x0032   074   074   000    Old_age   Always       -       23228
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       466
187 Reported_Uncorrect      0x0032   078   078   000    Old_age   Always       -       22
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   061   024   045    Old_age   Always   In_the_past 39 (0 22 43 37)
194 Temperature_Celsius     0x0022   039   076   000    Old_age   Always       -       39 (0 14 0 0)
195 Hardware_ECC_Recovered  0x001a   063   052   000    Old_age   Always       -       241173855
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       2
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       11
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 22 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 22 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:52.009  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:51.946  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:51.930  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:51.917  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:51.858  READ NATIVE MAX ADDRESS EXT

Error 21 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:52.009  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:51.946  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:51.930  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:51.917  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:51.858  READ NATIVE MAX ADDRESS EXT

Error 20 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:47.637  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:47.583  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:47.579  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:45.610  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:45.540  READ NATIVE MAX ADDRESS EXT

Error 19 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:47.637  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:47.583  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:47.579  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:45.610  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:45.540  READ NATIVE MAX ADDRESS EXT

Error 18 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 40 2f 0e e0  Error: UNC at LBA = 0x000e2f40 = 929600

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 50 2d 0e e0 00      06:22:43.448  READ DMA EXT
  27 00 00 00 00 00 e0 00      06:22:43.447  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      06:22:43.435  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      06:22:45.610  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      06:22:45.540  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Interrupted (host reset)      90%      6579         -
# 2  Short offline       Completed without error       00%      6492         -
# 3  Short offline       Completed without error       00%      6471         -
# 4  Short offline       Completed without error       00%      5931         -
# 5  Short offline       Completed without error       00%      4676         -
# 6  Short offline       Completed without error       00%      4400         -
# 7  Short offline       Completed without error       00%      4289         -
# 8  Short offline       Completed without error       00%      2396         -
# 9  Short offline       Completed without error       00%      2239         -
#10  Short offline       Completed without error       00%       797         -
#11  Short offline       Interrupted (host reset)      90%       795         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
And just for clarity, I suppose, here's the rest:
/dev/sde:
Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10
Device Model:     ST3750640AS
Serial Number:    5QD40LEC
Firmware Version: 3.AAE
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Jan  8 18:10:58 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (  41)	The self-test routine was interrupted
					by the host with a hard or soft reset.
Total time to complete Offline 
data collection: 		(  430) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 202) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   091   076   006    Pre-fail  Always       -       76035097
  3 Spin_Up_Time            0x0003   096   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       262
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       15
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       -       34385094
  9 Power_On_Hours          0x0032   074   074   000    Old_age   Always       -       23284
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       470
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   065   022   045    Old_age   Always   In_the_past 35 (0 98 39 33)
194 Temperature_Celsius     0x0022   035   078   000    Old_age   Always       -       35 (0 13 0 0)
195 Hardware_ECC_Recovered  0x001a   051   049   000    Old_age   Always       -       73134178
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Interrupted (host reset)      90%      6571         -
# 2  Short offline       Completed without error       00%      6485         -
# 3  Short offline       Completed without error       00%      5924         -
# 4  Short offline       Completed without error       00%      4676         -
# 5  Short offline       Completed without error       00%      4400         -
# 6  Short offline       Completed without error       00%      4289         -
# 7  Short offline       Completed without error       00%      2396         -
# 8  Short offline       Completed without error       00%      2239         -
# 9  Short offline       Completed without error       00%       797         -
#10  Short offline       Interrupted (host reset)      90%       795         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
/dev/sdf:
Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10
Device Model:     ST3750640AS
Serial Number:    5QD3P86D
Firmware Version: 3.AAE
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Jan  8 18:11:30 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (  25)	The self-test routine was aborted by
					the host.
Total time to complete Offline 
data collection: 		(  430) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 202) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   102   069   006    Pre-fail  Always       -       164141535
  3 Spin_Up_Time            0x0003   096   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       166
  5 Reallocated_Sector_Ct   0x0033   098   098   036    Pre-fail  Always       -       85
  7 Seek_Error_Rate         0x000f   082   055   030    Pre-fail  Always       -       13480475557
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       16824
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       295
187 Reported_Uncorrect      0x0032   098   098   000    Old_age   Always       -       2
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   065   022   045    Old_age   Always   In_the_past 35 (0 90 38 33)
194 Temperature_Celsius     0x0022   035   078   000    Old_age   Always       -       35 (0 13 0 0)
195 Hardware_ECC_Recovered  0x001a   058   046   000    Old_age   Always       -       126512467
197 Current_Pending_Sector  0x0012   100   098   000    Old_age   Always       -       19
198 Offline_Uncorrectable   0x0010   100   098   000    Old_age   Offline      -       19
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 2
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 14587 hours (607 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d3 bd b8 e0  Error: UNC at LBA = 0x00b8bdd3 = 12107219

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 68 bb b8 e0 00      04:18:22.540  READ DMA EXT
  27 00 00 00 00 00 e0 00      04:18:20.249  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      04:18:20.159  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02      04:18:20.151  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      04:18:20.097  READ NATIVE MAX ADDRESS EXT

Error 1 occurred at disk power-on lifetime: 14587 hours (607 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d3 bd b8 e0  Error: UNC at LBA = 0x00b8bdd3 = 12107219

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 68 bb b8 e0 00      04:18:17.515  READ DMA EXT
  25 00 08 68 c2 b8 e0 00      04:18:20.249  READ DMA EXT
  25 00 08 60 c2 b8 e0 00      04:18:20.159  READ DMA EXT
  27 00 00 00 00 00 e0 00      04:18:20.151  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      04:18:20.097  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Aborted by host               90%     13738         -
# 2  Short offline       Interrupted (host reset)      90%       117         -
# 3  Short offline       Completed without error       00%         6         -
# 4  Extended offline    Completed without error       00%         4         -
# 5  Short offline       Completed without error       00%         0         -
# 6  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
/dev/sdg:
Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.12
Device Model:     ST3750528AS
Serial Number:    9VP1KKD9
LU WWN Device Id: 5 000c50 0159f1da1
Firmware Version: CC38
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Jan  8 18:11:55 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  617) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 154) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       219266637
  3 Spin_Up_Time            0x0003   096   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       223
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       18404170
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3518
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       106
183 Runtime_Bad_Block       0x0032   099   099   000    Old_age   Always       -       1
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       5
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   068   040   045    Old_age   Always   In_the_past 32 (7 62 35 30)
194 Temperature_Celsius     0x0022   032   060   000    Old_age   Always       -       32 (0 20 0 0)
195 Hardware_ECC_Recovered  0x001a   050   038   000    Old_age   Always       -       219266637
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       17871358922355
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2880802208
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       833055224

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Old 01-08-2013, 12:12 PM   #5
tunixman
LQ Newbie
 
Registered: Jan 2013
Posts: 9

Rep: Reputation: Disabled
Also, there's probably no reason to completely reconfigure the array because of a failed device. Troubleshooting efforts should probably be focused on why the device has failed and not why the RAID array has disabled the failed device, since that's actually what it's meant to do.
 
Old 01-08-2013, 12:14 PM   #6
tunixman
LQ Newbie
 
Registered: Jan 2013
Posts: 9

Rep: Reputation: Disabled
Is your data inaccessible now? It should actually be still available even with a failed drive unless more has gone on since the original drive failed.
 
Old 01-08-2013, 12:18 PM   #7
tunixman
LQ Newbie
 
Registered: Jan 2013
Posts: 9

Rep: Reputation: Disabled
I'll admit I'm having a lot of trouble following what's going on here. With a failed drive the RAID should continue working. I did see you mentioned something about running "create" again, and that would certainly cause data loss if it actually succeeded, but I think it requires lots of extra options to be convinced to actually run on physical devices with RAID superblocks.
 
Old 01-08-2013, 12:40 PM   #8
codemastermm
LQ Newbie
 
Registered: Jan 2013
Posts: 8

Original Poster
Rep: Reputation: Disabled
My create did succeed - I made sure to run it with --assume-clean only, though, which I believe only changes the superblock and not any actual data on the drive. I have never seen it starting to sync in /proc/mdstat either, which I believe is a GOOD thing in this case.

One thing to still note, the slot numbers on the rives in my RAID are a bit wonky (skipping over 0 and 2). I still think this might be causing half of my woes, but I am uncertain.

Last edited by codemastermm; 01-08-2013 at 12:42 PM.
 
Old 01-08-2013, 12:49 PM   #9
tunixman
LQ Newbie
 
Registered: Jan 2013
Posts: 9

Rep: Reputation: Disabled
I'm having a lot of trouble following this, but it does sound a lot like a drive failed, the partition was no longer visible, RAID took it out of the array, then it got forced back in with the partition still not visible, but maybe if there's some way you could be less vague about what the order of events were it would really help.

I don't think the "slot numbers" matter (I'm not quite sure what you're talking about either), since the superblock records which drives get which stripes.

Also, whether create succeeded or failed, if there's a drive that's not good and you force it back in without checking it first that's usually a very efficient way to lose data, and that may be what's going on here.
 
Old 01-08-2013, 12:50 PM   #10
tunixman
LQ Newbie
 
Registered: Jan 2013
Posts: 9

Rep: Reputation: Disabled
Also this seems like it may help: http://ubuntuforums.org/showthread.php?t=1979610
 
Old 01-08-2013, 12:57 PM   #11
codemastermm
LQ Newbie
 
Registered: Jan 2013
Posts: 8

Original Poster
Rep: Reputation: Disabled
What's happened is I had a drive seemingly fail and get knocked out of the RAID. I tried to force it back in (via --create --assume-clean, kind of like the link you posted -> http://ubuntuforums.org/showpost.php...79&postcount=3). This way, I figured, it wouldn't be mucking with the data actually on the drives.

I launched gparted and it showed a broken partition (see: http://i.imgur.com/UA2CT.png), so I figured I'd attempt to have it full repair itself. (Now that I think about it, perhaps I should have gone with that and ran fsck instead?)

I then started reading around online, trying to find better information on how to approach this. I noticed, at that time, that my disk slots were a very odd numbering (md2 : active raid5 sdf3[5] sdg3[3] sde3[1] sdd3[4](F), notice they're disk 1, 3, 4, and 5; skipping over 0 and 2). I found a bit of information (http://serverfault.com/questions/447...re-raid-arrays) that spoke about creating the mdadm array and then adding the drives back in. I attempted that and was met with no partition at all, which is probably even worse. Thankfully, I again used --assume-clean, so that mdadm wouldn't start resyncing the drives.

My thoughts are now, maybe I should try going back to the point where my partition is somewhat shown and try a fsck?
 
Old 01-08-2013, 01:01 PM   #12
tunixman
LQ Newbie
 
Registered: Jan 2013
Posts: 9

Rep: Reputation: Disabled
I'd try assembling the array with all the drives, but keeping the failed drive failed, so it's running degraded, and then see if the partition on the RAID device is still present. (I'm not sure still which partition you're talking about not being visible, the physical disk partition or the RAID partition). Then run fsck, and replace the drive that came out of the array. The slot numbering shouldn't matter since the striping information is determined by the superblock.
 
Old 01-08-2013, 01:04 PM   #13
codemastermm
LQ Newbie
 
Registered: Jan 2013
Posts: 8

Original Poster
Rep: Reputation: Disabled
Oh, my apologies. I meant the RAID partition.
Upon assembling the drives, I still see the "unknown partition" (http://i.imgur.com/UA2CT.png), so would you recommend running fsck on that?
If so, it's starting to seem a bit more hopeful!!
 
Old 01-08-2013, 01:07 PM   #14
tunixman
LQ Newbie
 
Registered: Jan 2013
Posts: 9

Rep: Reputation: Disabled
I wouldn't recommend fsck on that unknown partition. It sounds like the underlying RAID is reading data incorrectly, probably because it's been reorganized. What does /proc/mdstat look like?
 
Old 01-08-2013, 01:07 PM   #15
codemastermm
LQ Newbie
 
Registered: Jan 2013
Posts: 8

Original Poster
Rep: Reputation: Disabled
Code:
Personalities : [raid6] [raid5] [raid4] 
md2 : active raid5 sdg3[3] sdf3[2] sde3[1]
      2186264064 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Can't resize ext4 partition after mdadm Raid5 grow to 4 disk jhon614 Linux - Software 0 11-02-2012 09:39 PM
Can't create partition on mdadm raid5 and images mounted over loopback emat Linux - Software 2 06-11-2011 01:23 PM
Trying to introduce new hdd into mdadm RAID5, having problems (Debian) ayv Linux - Newbie 1 03-02-2011 03:42 PM
mdadm RAID5 degraded/rebuild access issues. dbrazeau Linux - Software 5 04-15-2010 12:12 PM


All times are GMT -5. The time now is 06:02 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration