LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   RAID1: can't replace faulty spare (marked again as 'faulty spare' within seconds) (https://www.linuxquestions.org/questions/linux-general-1/raid1-cant-replace-faulty-spare-marked-again-as-faulty-spare-within-seconds-4175483746/)

Thambry 11-07-2013 10:19 AM

RAID1: can't replace faulty spare (marked again as 'faulty spare' within seconds)
 
I got a problem that I cannot solve: Our fileserver runs XUbuntu and 3 RAID1s. One has a problem since monday: it consists of sdb and sdc. sdb was marked as faulty by mdadm for unknown reasons. I used --remove to remove it from the RAID and then to add it by --add. All was fine, re-syncing started but never got above 0% and after a few seconds, sdb was again marked as 'faulty spare' (and therefore the RAID degraded, but clean).
So I saved the first 512 byte of the old sdb to a file, bought a new HDD of same size (4TB), shut down the computer and replaced sdb physically, switched the computer back on and wrote the 512 byte back to the new drive to have the same partition info as the old drive (both are the same type, from same company). But the new drive shows the same behaviour as the old: I can add, re-syncing starts and after a few seconds its marked as 'faulty spare'.
Here exactly what i did:

mdadm --remove /dev/md/1 /dev/sdb
maadm --detail /dev/md/1 gives me:

/dev/md/1:
Version : 1.2
Creation Time : Sat Jun 8 22:32:05 2013
Raid Level : raid1
Array Size : 3906887360 (3725.90 GiB 4000.65 GB)
Used Dev Size : 3906887360 (3725.90 GiB 4000.65 GB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent

Update Time : Thu Nov 7 06:56:13 2013
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0

Name : File-Server:1 (local to host File-Server)
UUID : 44ed561f:b733e946:e69820f4:aba9b223
Events : 2424

Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 32 1 active sync /dev/sdc

mdadm --add /dev/md/1 /dev/sdb
mdadm --detail /dev/md/1 gives me:


Version : 1.2
Creation Time : Sat Jun 8 22:32:05 2013
Raid Level : raid1
Array Size : 3906887360 (3725.90 GiB 4000.65 GB)
Used Dev Size : 3906887360 (3725.90 GiB 4000.65 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent

Update Time : Thu Nov 7 06:57:49 2013
State : clean, degraded, recovering
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0

Rebuild Status : 0% complete

Name : File-Server:1 (local to host File-Server)
UUID : 44ed561f:b733e946:e69820f4:aba9b223
Events : 2431

Number Major Minor RaidDevice State
2 8 16 0 faulty spare rebuilding /dev/sdb
1 8 32 1 active sync /dev/sdc

and after a few seconds:
/dev/md/1:
Version : 1.2
Creation Time : Sat Jun 8 22:32:05 2013
Raid Level : raid1
Array Size : 3906887360 (3725.90 GiB 4000.65 GB)
Used Dev Size : 3906887360 (3725.90 GiB 4000.65 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent

Update Time : Thu Nov 7 06:57:50 2013
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0

Name : File-Server:1 (local to host File-Server)
UUID : 44ed561f:b733e946:e69820f4:aba9b223
Events : 2436

Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 32 1 active sync /dev/sdc
2 8 16 - faulty spare /dev/sdb


same behaviour if I zero the superblock (mdadm --zero-superblock /dev/sdb) before adding sdb.
I do all commands as root and the system holds 3 more 4TB drives, ie the mainboard can handle them. The old harddrive was checked for errors using badblocks, but all is fine.

Does anybody have any idea, what the problem is?

Ser Olmy 11-11-2013 10:21 AM

Have you ruled out the controller as a possible source?

Any error messages in /var/log/messages or /var/log/syslog?

Thambry 11-14-2013 07:31 AM

In the meantime I (kind of) figured it out:
for unknows reasons, the original drive got kicked out of the RAID and its UUID blacklisted. Therefore I could not format it in the fileserver. By copying the first 512 bytes to the new drive, i probably transfered the blacklisted UUID to the new drive also causing it to be kicked out again.
Only when I formated the old drive on a different computer (and checked it with badblocks for errors) and put it back in the fileserver (without copying the 512 bytes on it), it worked fine and i was able to add it back to the RAID without any hickups.
Copying the 512 bytes for sure caused problems - i read it in some posts but this probably referred to old HDDs below 2TB size (FAT instad of GUID table). Fact is that i will never do that again
Does this explanation make sense?


All times are GMT -5. The time now is 08:40 PM.