RAID1: can't replace faulty spare (marked again as 'faulty spare' within seconds)
I got a problem that I cannot solve: Our fileserver runs XUbuntu and 3 RAID1s. One has a problem since monday: it consists of sdb and sdc. sdb was marked as faulty by mdadm for unknown reasons. I used --remove to remove it from the RAID and then to add it by --add. All was fine, re-syncing started but never got above 0% and after a few seconds, sdb was again marked as 'faulty spare' (and therefore the RAID degraded, but clean).
So I saved the first 512 byte of the old sdb to a file, bought a new HDD of same size (4TB), shut down the computer and replaced sdb physically, switched the computer back on and wrote the 512 byte back to the new drive to have the same partition info as the old drive (both are the same type, from same company). But the new drive shows the same behaviour as the old: I can add, re-syncing starts and after a few seconds its marked as 'faulty spare'. Here exactly what i did: mdadm --remove /dev/md/1 /dev/sdb maadm --detail /dev/md/1 gives me: /dev/md/1: Version : 1.2 Creation Time : Sat Jun 8 22:32:05 2013 Raid Level : raid1 Array Size : 3906887360 (3725.90 GiB 4000.65 GB) Used Dev Size : 3906887360 (3725.90 GiB 4000.65 GB) Raid Devices : 2 Total Devices : 1 Persistence : Superblock is persistent Update Time : Thu Nov 7 06:56:13 2013 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Name : File-Server:1 (local to host File-Server) UUID : 44ed561f:b733e946:e69820f4:aba9b223 Events : 2424 Number Major Minor RaidDevice State 0 0 0 0 removed 1 8 32 1 active sync /dev/sdc mdadm --add /dev/md/1 /dev/sdb mdadm --detail /dev/md/1 gives me: Version : 1.2 Creation Time : Sat Jun 8 22:32:05 2013 Raid Level : raid1 Array Size : 3906887360 (3725.90 GiB 4000.65 GB) Used Dev Size : 3906887360 (3725.90 GiB 4000.65 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Thu Nov 7 06:57:49 2013 State : clean, degraded, recovering Active Devices : 1 Working Devices : 1 Failed Devices : 1 Spare Devices : 0 Rebuild Status : 0% complete Name : File-Server:1 (local to host File-Server) UUID : 44ed561f:b733e946:e69820f4:aba9b223 Events : 2431 Number Major Minor RaidDevice State 2 8 16 0 faulty spare rebuilding /dev/sdb 1 8 32 1 active sync /dev/sdc and after a few seconds: /dev/md/1: Version : 1.2 Creation Time : Sat Jun 8 22:32:05 2013 Raid Level : raid1 Array Size : 3906887360 (3725.90 GiB 4000.65 GB) Used Dev Size : 3906887360 (3725.90 GiB 4000.65 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Thu Nov 7 06:57:50 2013 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 1 Spare Devices : 0 Name : File-Server:1 (local to host File-Server) UUID : 44ed561f:b733e946:e69820f4:aba9b223 Events : 2436 Number Major Minor RaidDevice State 0 0 0 0 removed 1 8 32 1 active sync /dev/sdc 2 8 16 - faulty spare /dev/sdb same behaviour if I zero the superblock (mdadm --zero-superblock /dev/sdb) before adding sdb. I do all commands as root and the system holds 3 more 4TB drives, ie the mainboard can handle them. The old harddrive was checked for errors using badblocks, but all is fine. Does anybody have any idea, what the problem is? |
Have you ruled out the controller as a possible source?
Any error messages in /var/log/messages or /var/log/syslog? |
In the meantime I (kind of) figured it out:
for unknows reasons, the original drive got kicked out of the RAID and its UUID blacklisted. Therefore I could not format it in the fileserver. By copying the first 512 bytes to the new drive, i probably transfered the blacklisted UUID to the new drive also causing it to be kicked out again. Only when I formated the old drive on a different computer (and checked it with badblocks for errors) and put it back in the fileserver (without copying the 512 bytes on it), it worked fine and i was able to add it back to the RAID without any hickups. Copying the 512 bytes for sure caused problems - i read it in some posts but this probably referred to old HDDs below 2TB size (FAT instad of GUID table). Fact is that i will never do that again Does this explanation make sense? |
All times are GMT -5. The time now is 08:40 PM. |