Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Check the /dev directory and see if the /dev/sdb1 device actually exists. If it doesn't, you'll need to recreate it with fdisk, parted or whatever tool you prefer to use to manage partitions.
If the device is missing but the partition seems to be there, try running partprobe then check the /dev directory again.
Check the /dev directory and see if the /dev/sdb1 device actually exists. If it doesn't, you'll need to recreate it with fdisk, parted or whatever tool you prefer to use to manage partitions.
If the device is missing but the partition seems to be there, try running partprobe then check the /dev directory again.
The next step is to figure out why mdadm returns an error message when you try to reference /dev/sdb1. See what
Code:
mdadm --examine /dev/sdb1
has to say about that partition.
According to /proc/mdstat (in your first post), /deb/md0 only has one member, /dev/sda1. As long as the /dev/sdb1 partition is valid and identical in size to /dev/sda1 (which fdisk -l /dev/sdb or parted /dev/sdb print should be able to confirm or deny), you should be able to re-add /dev/sdb1 with the following command:
Code:
mdadm --manage /dev/md0 --add /dev/sdb1
You may also want to check the health of /dev/sdb with:
Code:
smartctl -a /dev/sdb
In particular, examine the Reallocated_Sector_Count and Current_Pending_Sector attributes. There has to be a reason why the partition was dropped from the RAID device.
Model: ATA ST3000VX000-9YW1 (scsi)
Disk /dev/sda: 3001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Number Start End Size File system Name Flags
1 1049kB 1500GB 1500GB ext4 raid
2 1500GB 1525GB 24,6GB raid
3 1525GB 3001GB 1476GB raid
parted /dev/sdb print:
Code:
Model: ATA ST3000VX000-9YW1 (scsi)
Disk /dev/sdb: 3001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Number Start End Size File system Name Flags
1 1049kB 1500GB 1500GB raid
2 1500GB 1525GB 24,6GB raid
3 1525GB 3001GB 1476GB raid
mdadm --manage /dev/md0 --add /dev/sdb1:
Code:
mdadm: add new device failed for /dev/sdb1 as 2: Invalid argument
smartctl -a /dev/sdb:
Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: ST3000VX000-9YW166
Serial Number: W1F0VJ95
LU WWN Device Id: 5 000c50 052d36854
Firmware Version: CV13
User Capacity: 3Â*000Â*592Â*982Â*016 bytes [3,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Thu Nov 14 11:17:32 2013 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 584) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x10b9) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 111 099 006 Pre-fail Always - 34112212
3 Spin_Up_Time 0x0003 095 095 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 97
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 082 060 030 Pre-fail Always - 189255078
9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 8951
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 97
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 032 032 000 Old_age Always - 68
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 12885098499
189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 264
190 Airflow_Temperature_Cel 0x0022 063 059 045 Old_age Always - 37 (Min/Max 34/38)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 89
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1191
194 Temperature_Celsius 0x0022 037 041 000 Old_age Always - 37 (0 16 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 15
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 15
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 3338 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 3338 occurred at disk power-on lifetime: 8951 hours (372 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 09 08 00 00 Error: UNC at LBA = 0x00000809 = 2057
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 08 08 00 e0 00 7d+02:46:03.171 READ DMA
27 00 00 00 00 00 e0 00 7d+02:46:03.159 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 7d+02:46:03.151 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 7d+02:46:03.103 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 7d+02:46:03.087 READ NATIVE MAX ADDRESS EXT
Error 3337 occurred at disk power-on lifetime: 8951 hours (372 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 09 08 00 00 Error: UNC at LBA = 0x00000809 = 2057
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 08 08 00 e0 00 7d+02:46:03.171 READ DMA
27 00 00 00 00 00 e0 00 7d+02:46:03.159 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 7d+02:46:03.151 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 7d+02:46:03.103 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 7d+02:46:03.087 READ NATIVE MAX ADDRESS EXT
Error 3336 occurred at disk power-on lifetime: 8951 hours (372 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 09 08 00 00 Error: UNC at LBA = 0x00000809 = 2057
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 08 08 00 e0 00 7d+02:46:02.819 READ DMA
27 00 00 00 00 00 e0 00 7d+02:46:02.807 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 7d+02:46:02.799 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 7d+02:46:02.727 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 7d+02:46:02.707 READ NATIVE MAX ADDRESS EXT
Error 3335 occurred at disk power-on lifetime: 8951 hours (372 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 09 08 00 00 Error: UNC at LBA = 0x00000809 = 2057
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 08 08 00 e0 00 7d+02:46:02.819 READ DMA
27 00 00 00 00 00 e0 00 7d+02:46:02.807 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 7d+02:46:02.799 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 7d+02:46:02.727 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 7d+02:46:02.707 READ NATIVE MAX ADDRESS EXT
Error 3334 occurred at disk power-on lifetime: 8951 hours (372 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 09 08 00 00 Error: UNC at LBA = 0x00000809 = 2057
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 08 08 00 e0 00 7d+02:46:02.436 READ DMA
27 00 00 00 00 00 e0 00 7d+02:46:02.435 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 7d+02:46:02.427 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 7d+02:46:02.371 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 7d+02:46:02.363 READ NATIVE MAX ADDRESS EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 8933 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Quote:
Originally Posted by Ser Olmy
The next step is to figure out why mdadm returns an error message when you try to reference /dev/sdb1. See what
Code:
mdadm --examine /dev/sdb1
has to say about that partition.
According to /proc/mdstat (in your first post), /deb/md0 only has one member, /dev/sda1. As long as the /dev/sdb1 partition is valid and identical in size to /dev/sda1 (which fdisk -l /dev/sdb or parted /dev/sdb print should be able to confirm or deny), you should be able to re-add /dev/sdb1 with the following command:
Code:
mdadm --manage /dev/md0 --add /dev/sdb1
You may also want to check the health of /dev/sdb with:
Code:
smartctl -a /dev/sdb
In particular, examine the Reallocated_Sector_Count and Current_Pending_Sector attributes. There has to be a reason why the partition was dropped from the RAID device.
The /dev/sdb device has 15 "pending" sectors, meaning it's waiting for a write command to reallocate whose sectors. While 15 is not an alarmingly large number, the fact that they're all "pending" rather than "reallocated", suggests the defects may have appeared at approximately the same time, which could be an indication of drive failure. You should run badblocks -ns on /dev/sdb1 before proceeding, and check the S.M.A.R.T. status for /dev/sdb again when it's done.
The "invalid argument" error is usually caused by a non-removed device. The "--add" command is only valid if the array is online and can be expanded, or if a device has been removed. However, the output from mdadm --detail /dev/md0 in post #8 does indeed show the second device as "removed". Strange.
Could you port the output from:
Code:
ls /sys/block/md0/md/
Also, do any messages appear in the logs when you try to add back /dev/sdb1 to the array?
The /dev/sdb device has 15 "pending" sectors, meaning it's waiting for a write command to reallocate whose sectors. While 15 is not an alarmingly large number, the fact that they're all "pending" rather than "reallocated", suggests the defects may have appeared at approximately the same time, which could be an indication of drive failure. You should run badblocks -ns on /dev/sdb1 before proceeding, and check the S.M.A.R.T. status for /dev/sdb again when it's done.
The "invalid argument" error is usually caused by a non-removed device. The "--add" command is only valid if the array is online and can be expanded, or if a device has been removed. However, the output from mdadm --detail /dev/md0 in post #8 does indeed show the second device as "removed". Strange.
Could you port the output from:
Code:
ls /sys/block/md0/md/
Also, do any messages appear in the logs when you try to add back /dev/sdb1 to the array?
Do a tail -f /var/log/messages in one terminal window while you attempt to add /dev/sdb1 to md0 in another.
The files in /sys/block/md0/md confirms that there's no reference from md0 to anything other than /dev/sda1. It should be possible to add another device/partition.
I don't have a /var/log/messages, but I did do a tail on the syslog, and it showed the following while trying to add the partition back to md0:
Code:
Nov 15 08:38:25 lia kernel: [674827.954967] ata1: EH complete
Nov 15 08:38:25 lia kernel: [674828.187410] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Nov 15 08:38:25 lia kernel: [674828.187416] ata1.01: failed command: READ DMA
Nov 15 08:38:25 lia kernel: [674828.187422] ata1.01: cmd c8/00:08:08:08:00/00:00:00:00:00/f0 tag 0 dma 4096 in
Nov 15 08:38:25 lia kernel: [674828.187424] res 51/40:00:09:08:00/00:00:00:00:00/10 Emask 0x9 (media error)
Nov 15 08:38:25 lia kernel: [674828.187427] ata1.01: status: { DRDY ERR }
Nov 15 08:38:25 lia kernel: [674828.187430] ata1.01: error: { UNC }
Nov 15 08:38:25 lia kernel: [674828.242074] ata1.00: configured for UDMA/133
It seems the md driver ran into one of the bad sectors on the drive. If you can't run badblocks, try using dd to overwrite the partition with zeros:
Code:
dd if=/dev/zero of=/dev/sdb1 bs=8192 oflag=direct
That should trigger a reallocation of any bad sectors.
The "oflag=direct" parameter bypasses the cache, and has the effect of slowing the process down significantly. With any luck, the other users won't notice anything. The real reason it's there, however, is to prevent cache management from doing read-ahead, as that would cause it to attempt to read the bad sectors, which in turn would cause dd to abort.
It seems the md driver ran into one of the bad sectors on the drive. If you can't run badblocks, try using dd to overwrite the partition with zeros:
Code:
dd if=/dev/zero of=/dev/sdb1 bs=8192 oflag=direct
That should trigger a reallocation of any bad sectors.
The "oflag=direct" parameter bypasses the cache, and has the effect of slowing the process down significantly. With any luck, the other users won't notice anything. The real reason it's there, however, is to prevent cache management from doing read-ahead, as that would cause it to attempt to read the bad sectors, which in turn would cause dd to abort.
Thank you! I'll do that now. Once it's done, is there anything specific I need to do BEFORE trying to --add the sdb1 partition to md0 again?
I'd check the S.M.A.R.T. status again. The Current_Pending_Sector counter should show a number lower than 15 (0, ideally).
Other than that, there's nothing in particular you need to consider before attempting to add the partition to the RAID array again.
Thank you so much. After the process completed, there were 0 pending sectors. I then successfully re-added sdb1 to md0, and it is now busy with recovery!
I just hope the recovery process completes without any issues. I'll let you know!
One thing that strikes me as a bit weird though: in all the arrays, the disks are ID's 0 and 1. But on md0, sda1 is id 0, and the re-added sdb1 is id 2, not id 1. Does that make a difference?
Output of cat /proc/mdstat:
Code:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[2] sda1[0]
1464710976 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 4.3% (63596480/1464710976) finish=318.5min speed=73315K/sec
md1 : active raid1 sda2[0] sdb2[1]
24006528 blocks super 1.2 [2/2] [UU]
md2 : active raid1 sdb3[1] sda3[0]
1441268544 blocks super 1.2 [2/2] [UU]
md3 : active raid1 sdc1[0] sdd1[1]
2930133824 blocks super 1.2 [2/2] [UU]
md4 : active raid1 sdf2[1] sde2[0]
2929939264 blocks super 1.2 [2/2] [UU]
unused devices: <none>
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.