Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
So I was at work yesterday and received the following automated e-mail from my home server about my RAID array:
Code:
This is an automatically generated mail message from mdadm
running on cyrus
A Fail event had been detected on md device /dev/md/2.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sdf2[2] sdg2[3] sde2[1] sdd2[0]
11002368 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
md2 : active raid5 sdf3[5] sdg3[3] sde3[1] sdd3[4](F)
2185873920 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]
md0 : active raid5 sdf1[2] sdg1[3] sde1[1] sdd1[0]
437760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
unused devices: <none>
I rebooted the system into a Live CD and attempted to re-add the failed drive back into the array (sometimes the cable becomes loose). --re-add simply wouldn't work, so I figured I had to re-create the array and assume-clean (after all, the array was working fine!)
I checked using gparted to find that... nope. It was unable to detect my partition in there. Crap.
So I attempted on at least getting it back to the way it was: I tried doing a create with assume-clean again, but with the "failed" drive missing (/dev/sdd3). Again, no partition listed in gparted. Lovely.
So now here I am asking you all for help!
Current ideas I have so far:
- the drive numbering I have for some reason is a bit weird (1,3,4,5?). Maybe that is related or can help with assembling this correctly somehow...
- I was reading/googling around and saw one guy mentioned zero-ing out the superblock and having mdadm attempt to reassemble the drives. I haven't tried this yet, though..
Any help on this whatsoever would be a huge lifesaver. Thanks so much guys!
Edit: Still reading up on this issue. One possible idea is that perhaps I can run a create with (missing /dev/sde3 missing /dev/sdg3), which would assign slots 1 and 3 appropriately and then add in /dev/sdd3 (4) and then /dev/sdf3 (5)? Not too sure if that would be the best way to do it, but certainly that is a thought to get the slot numbers right....
Last edited by codemastermm; 01-08-2013 at 08:56 AM.
Reason: new ideas?
Reading over a bunch of articles, it seems that a possibility is as follows:
- Create the mdadm array with missing disks for the missing disk slots (mdadm --create /dev/md2 --assume-clean missing /dev/sde3 missing /dev/sdg3)
- Add the disks that are in the later slots of 4 and 5 (mdadm --add /dev/md2 /dev/sdd3; mdadm --add /dev/md2 /dev/sdf3)
...Possibly this will work? I'm curious if anyone can give me a sanity check before I try it out
In the end, this gave me the following when I run a `cat /proc/mdstat`:
md2 : active raid5 sdf3[5](S) sdd3[4](S) sdg3[3] sde3[1]
2186264064 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2] [_U_U]
gparted, however, showed -no- partition on the data which... sucks.
I wonder, though, is that because the two other drives (sdf3 and sdd3) haven't synced back into the array yet?
What should my next two steps be??
Thanks again.
Last edited by codemastermm; 01-08-2013 at 11:52 AM.
First check the SMART logs for the device. It looks to me like the device has actually legitimately failed and should be replaced. However, recreating the partition table as it was before it vanished would at least make the RAID partition visible to the system and it should get added back and resynced.
Yup, that's my plan! At least get my data back so I can go out and buy a new drive to toss it on
The drives are getting a bit old, so I wouldn't be too surprised if one is dying/dead. Literally yesterday morning I was planning a backup scheme out right before this happened...
Here's my smartctl output, with /dev/sdd on top (the device that decided to jump out of the array in question):
Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST3750640AS
Serial Number: 5QD334BQ
Firmware Version: 3.AAE
User Capacity: 750,156,374,016 bytes [750 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Tue Jan 8 18:09:00 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 41) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 202) minutes.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 099 069 006 Pre-fail Always - 6582973
3 Spin_Up_Time 0x0003 096 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 256
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 33
7 Seek_Error_Rate 0x000f 090 060 030 Pre-fail Always - 981902468
9 Power_On_Hours 0x0032 074 074 000 Old_age Always - 23228
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 466
187 Reported_Uncorrect 0x0032 078 078 000 Old_age Always - 22
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 061 024 045 Old_age Always In_the_past 39 (0 22 43 37)
194 Temperature_Celsius 0x0022 039 076 000 Old_age Always - 39 (0 14 0 0)
195 Hardware_ECC_Recovered 0x001a 063 052 000 Old_age Always - 241173855
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 2
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 2
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 11
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 22 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 22 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 40 2f 0e e0 Error: UNC at LBA = 0x000e2f40 = 929600
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 00 50 2d 0e e0 00 06:22:52.009 READ DMA EXT
27 00 00 00 00 00 e0 00 06:22:51.946 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 06:22:51.930 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 02 06:22:51.917 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 06:22:51.858 READ NATIVE MAX ADDRESS EXT
Error 21 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 40 2f 0e e0 Error: UNC at LBA = 0x000e2f40 = 929600
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 00 50 2d 0e e0 00 06:22:52.009 READ DMA EXT
27 00 00 00 00 00 e0 00 06:22:51.946 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 06:22:51.930 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 02 06:22:51.917 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 06:22:51.858 READ NATIVE MAX ADDRESS EXT
Error 20 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 40 2f 0e e0 Error: UNC at LBA = 0x000e2f40 = 929600
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 00 50 2d 0e e0 00 06:22:47.637 READ DMA EXT
27 00 00 00 00 00 e0 00 06:22:47.583 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 06:22:47.579 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 02 06:22:45.610 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 06:22:45.540 READ NATIVE MAX ADDRESS EXT
Error 19 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 40 2f 0e e0 Error: UNC at LBA = 0x000e2f40 = 929600
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 00 50 2d 0e e0 00 06:22:47.637 READ DMA EXT
27 00 00 00 00 00 e0 00 06:22:47.583 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 06:22:47.579 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 02 06:22:45.610 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 06:22:45.540 READ NATIVE MAX ADDRESS EXT
Error 18 occurred at disk power-on lifetime: 22835 hours (951 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 40 2f 0e e0 Error: UNC at LBA = 0x000e2f40 = 929600
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 00 50 2d 0e e0 00 06:22:43.448 READ DMA EXT
27 00 00 00 00 00 e0 00 06:22:43.447 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 06:22:43.435 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 02 06:22:45.610 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 06:22:45.540 READ NATIVE MAX ADDRESS EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Interrupted (host reset) 90% 6579 -
# 2 Short offline Completed without error 00% 6492 -
# 3 Short offline Completed without error 00% 6471 -
# 4 Short offline Completed without error 00% 5931 -
# 5 Short offline Completed without error 00% 4676 -
# 6 Short offline Completed without error 00% 4400 -
# 7 Short offline Completed without error 00% 4289 -
# 8 Short offline Completed without error 00% 2396 -
# 9 Short offline Completed without error 00% 2239 -
#10 Short offline Completed without error 00% 797 -
#11 Short offline Interrupted (host reset) 90% 795 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
And just for clarity, I suppose, here's the rest:
/dev/sde:
Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST3750640AS
Serial Number: 5QD40LEC
Firmware Version: 3.AAE
User Capacity: 750,156,374,016 bytes [750 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Tue Jan 8 18:10:58 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 41) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 202) minutes.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 091 076 006 Pre-fail Always - 76035097
3 Spin_Up_Time 0x0003 096 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 262
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 15
7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always - 34385094
9 Power_On_Hours 0x0032 074 074 000 Old_age Always - 23284
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 470
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 065 022 045 Old_age Always In_the_past 35 (0 98 39 33)
194 Temperature_Celsius 0x0022 035 078 000 Old_age Always - 35 (0 13 0 0)
195 Hardware_ECC_Recovered 0x001a 051 049 000 Old_age Always - 73134178
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Interrupted (host reset) 90% 6571 -
# 2 Short offline Completed without error 00% 6485 -
# 3 Short offline Completed without error 00% 5924 -
# 4 Short offline Completed without error 00% 4676 -
# 5 Short offline Completed without error 00% 4400 -
# 6 Short offline Completed without error 00% 4289 -
# 7 Short offline Completed without error 00% 2396 -
# 8 Short offline Completed without error 00% 2239 -
# 9 Short offline Completed without error 00% 797 -
#10 Short offline Interrupted (host reset) 90% 795 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
/dev/sdf:
Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST3750640AS
Serial Number: 5QD3P86D
Firmware Version: 3.AAE
User Capacity: 750,156,374,016 bytes [750 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Tue Jan 8 18:11:30 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 25) The self-test routine was aborted by
the host.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 202) minutes.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 102 069 006 Pre-fail Always - 164141535
3 Spin_Up_Time 0x0003 096 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 166
5 Reallocated_Sector_Ct 0x0033 098 098 036 Pre-fail Always - 85
7 Seek_Error_Rate 0x000f 082 055 030 Pre-fail Always - 13480475557
9 Power_On_Hours 0x0032 081 081 000 Old_age Always - 16824
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 295
187 Reported_Uncorrect 0x0032 098 098 000 Old_age Always - 2
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 065 022 045 Old_age Always In_the_past 35 (0 90 38 33)
194 Temperature_Celsius 0x0022 035 078 000 Old_age Always - 35 (0 13 0 0)
195 Hardware_ECC_Recovered 0x001a 058 046 000 Old_age Always - 126512467
197 Current_Pending_Sector 0x0012 100 098 000 Old_age Always - 19
198 Offline_Uncorrectable 0x0010 100 098 000 Old_age Offline - 19
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 2
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2 occurred at disk power-on lifetime: 14587 hours (607 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 d3 bd b8 e0 Error: UNC at LBA = 0x00b8bdd3 = 12107219
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 00 68 bb b8 e0 00 04:18:22.540 READ DMA EXT
27 00 00 00 00 00 e0 00 04:18:20.249 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 04:18:20.159 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 02 04:18:20.151 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 04:18:20.097 READ NATIVE MAX ADDRESS EXT
Error 1 occurred at disk power-on lifetime: 14587 hours (607 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 d3 bd b8 e0 Error: UNC at LBA = 0x00b8bdd3 = 12107219
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 00 68 bb b8 e0 00 04:18:17.515 READ DMA EXT
25 00 08 68 c2 b8 e0 00 04:18:20.249 READ DMA EXT
25 00 08 60 c2 b8 e0 00 04:18:20.159 READ DMA EXT
27 00 00 00 00 00 e0 00 04:18:20.151 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 04:18:20.097 IDENTIFY DEVICE
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Aborted by host 90% 13738 -
# 2 Short offline Interrupted (host reset) 90% 117 -
# 3 Short offline Completed without error 00% 6 -
# 4 Extended offline Completed without error 00% 4 -
# 5 Short offline Completed without error 00% 0 -
# 6 Short offline Completed without error 00% 0 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
/dev/sdg:
Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.12
Device Model: ST3750528AS
Serial Number: 9VP1KKD9
LU WWN Device Id: 5 000c50 0159f1da1
Firmware Version: CC38
User Capacity: 750,156,374,016 bytes [750 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Tue Jan 8 18:11:55 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 617) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 154) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 219266637
3 Spin_Up_Time 0x0003 096 095 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 223
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 18404170
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3518
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 106
183 Runtime_Bad_Block 0x0032 099 099 000 Old_age Always - 1
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 098 000 Old_age Always - 5
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 068 040 045 Old_age Always In_the_past 32 (7 62 35 30)
194 Temperature_Celsius 0x0022 032 060 000 Old_age Always - 32 (0 20 0 0)
195 Hardware_ECC_Recovered 0x001a 050 038 000 Old_age Always - 219266637
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 17871358922355
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2880802208
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 833055224
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Also, there's probably no reason to completely reconfigure the array because of a failed device. Troubleshooting efforts should probably be focused on why the device has failed and not why the RAID array has disabled the failed device, since that's actually what it's meant to do.
I'll admit I'm having a lot of trouble following what's going on here. With a failed drive the RAID should continue working. I did see you mentioned something about running "create" again, and that would certainly cause data loss if it actually succeeded, but I think it requires lots of extra options to be convinced to actually run on physical devices with RAID superblocks.
My create did succeed - I made sure to run it with --assume-clean only, though, which I believe only changes the superblock and not any actual data on the drive. I have never seen it starting to sync in /proc/mdstat either, which I believe is a GOOD thing in this case.
One thing to still note, the slot numbers on the rives in my RAID are a bit wonky (skipping over 0 and 2). I still think this might be causing half of my woes, but I am uncertain.
Last edited by codemastermm; 01-08-2013 at 12:42 PM.
I'm having a lot of trouble following this, but it does sound a lot like a drive failed, the partition was no longer visible, RAID took it out of the array, then it got forced back in with the partition still not visible, but maybe if there's some way you could be less vague about what the order of events were it would really help.
I don't think the "slot numbers" matter (I'm not quite sure what you're talking about either), since the superblock records which drives get which stripes.
Also, whether create succeeded or failed, if there's a drive that's not good and you force it back in without checking it first that's usually a very efficient way to lose data, and that may be what's going on here.
What's happened is I had a drive seemingly fail and get knocked out of the RAID. I tried to force it back in (via --create --assume-clean, kind of like the link you posted -> http://ubuntuforums.org/showpost.php...79&postcount=3). This way, I figured, it wouldn't be mucking with the data actually on the drives.
I launched gparted and it showed a broken partition (see: http://i.imgur.com/UA2CT.png), so I figured I'd attempt to have it full repair itself. (Now that I think about it, perhaps I should have gone with that and ran fsck instead?)
I then started reading around online, trying to find better information on how to approach this. I noticed, at that time, that my disk slots were a very odd numbering (md2 : active raid5 sdf3[5] sdg3[3] sde3[1] sdd3[4](F), notice they're disk 1, 3, 4, and 5; skipping over 0 and 2). I found a bit of information (http://serverfault.com/questions/447...re-raid-arrays) that spoke about creating the mdadm array and then adding the drives back in. I attempted that and was met with no partition at all, which is probably even worse. Thankfully, I again used --assume-clean, so that mdadm wouldn't start resyncing the drives.
My thoughts are now, maybe I should try going back to the point where my partition is somewhat shown and try a fsck?
I'd try assembling the array with all the drives, but keeping the failed drive failed, so it's running degraded, and then see if the partition on the RAID device is still present. (I'm not sure still which partition you're talking about not being visible, the physical disk partition or the RAID partition). Then run fsck, and replace the drive that came out of the array. The slot numbering shouldn't matter since the striping information is determined by the superblock.
Oh, my apologies. I meant the RAID partition.
Upon assembling the drives, I still see the "unknown partition" (http://i.imgur.com/UA2CT.png), so would you recommend running fsck on that?
If so, it's starting to seem a bit more hopeful!!
I wouldn't recommend fsck on that unknown partition. It sounds like the underlying RAID is reading data incorrectly, probably because it's been reorganized. What does /proc/mdstat look like?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.