help interpreting MDADM readouts

DJCF · 04-11-2006, 08:41 AM

Hi all,

(Is this in the right place? Is it hardware or software related? Also, it's a repost because I posted it in the hardware section, but no replies.)

I have a Fedora Core 3 home server with three 320GB hard drives, which are fairly new -- only a month or so old. They are in a RAID-5 array, with partitions like this:

/dev/hda1 5GB root partition mounted as /
/dev/hdd1 and /dev/hdc1 are a 8GB Logical Volume Group thingumy mounted as /tmp
/dev/hda2 /dev/hdc3 and /dev/hdd3 are the 630 GB RAID-5 array mounted as /home

(There's some swap partitions there too and some space is lost due to filesystem inefficienies.)

This all well and good but late last night one of the hard drives (hdd, a secondary slave) started making loud clicking sounds at fairly regular intervals, about once a minute or so. Catting /proc/mdstat revealed one of the drives was faulty but there was nothing I could do about it until today. I turned off the server, reseated the cables etc., turned it back on, and the drive wasn't recognised by the BIOS. I put the drive into my own workstation and it "click"ed on startup, but was recognised by both the BIOS and by Suse (though I didnt try to mount it -- obviously). So I put it back into the server and the BIOS recognised it ok and its been running for an hour or so with no clicking sounds. I dont think all is well, however, perhaps you guys can help me make sense of these RAID readouts?

# cat /proc/mdstat
Personalities : [raid5]
md0 : active raid5 hdc3[1] hda2[0]
614903808 blocks level 5, 256k chunk, algorithm 2 [3/2] [UU_]

unused devices: <none>

There seems to be only hdc3 and hda2 in this array -- no sign of hdd3. And what does it mean [3/2]? Shouldnt it be [2/3] because it is two out of three drives?

# mdadm --examine /dev/md0
mdadm: No super block found on /dev/md0 (Expected magic a92b4efc, got 00000000)

What exactly does this mean?

# mdadm --query /dev/md0
/dev/md0:
Version : 00.90.01
Creation Time : Wed Feb 15 14:35:22 2006
Raid Level : raid5
Array Size : 614903808 (586.42 GiB 629.66 GB)
Device Size : 307451904 (293.21 GiB 314.83 GB)
Raid Devices : 3
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Fri Mar 3 14:39:02 2006
State : clean, degraded
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 256K

Number Major Minor RaidDevice State
0 3 2 0 active sync /dev/hda2
1 22 3 1 active sync /dev/hdc3
2 0 0 -1 removed
UUID : 8c8f0f62:9e69e701:409df450:89adf2fb
Events : 0.136964

This suggests to me that there are two drives in the array, not three -- we're missing HDD, right?

# mdadm --examine /dev/hda2

/dev/hda2:
Magic : a92b4efc
Version : 00.90.00
UUID : 8c8f0f62:9e69e701:409df450:89adf2fb
Creation Time : Wed Feb 15 14:35:22 2006
Raid Level : raid5
Device Size : 307451904 (293.21 GiB 314.83 GB)
Raid Devices : 3
Total Devices : 2
Preferred Minor : 0

Update Time : Fri Mar 3 14:39:24 2006
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 1
Spare Devices : 0
Checksum : 38c744bb - correct
Events : 0.136972

Layout : left-symmetric
Chunk Size : 256K

Number Major Minor RaidDevice State
this 0 3 2 0 active sync /dev/hda2
0 0 3 2 0 active sync /dev/hda2
1 1 22 3 1 active sync /dev/hdc3
2 2 0 0 2 faulty removed

Why does /dev/hda2 appear in that list twice? Surely it should only appear once? And again, we're missing HDD, right?

# mdadm --examine /dev/hdc3
/dev/hdc3:
Magic : a92b4efc
Version : 00.90.00
UUID : 8c8f0f62:9e69e701:409df450:89adf2fb
Creation Time : Wed Feb 15 14:35:22 2006
Raid Level : raid5
Device Size : 307451904 (293.21 GiB 314.83 GB)
Raid Devices : 3
Total Devices : 2
Preferred Minor : 0

Update Time : Fri Mar 3 14:39:36 2006
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 1
Spare Devices : 0
Checksum : 38c744e9 - correct
Events : 0.136978

Layout : left-symmetric
Chunk Size : 256K

Number Major Minor RaidDevice State
this 1 22 3 1 active sync /dev/hdc3
0 0 3 2 0 active sync /dev/hda2
1 1 22 3 1 active sync /dev/hdc3
2 2 0 0 2 faulty removed

Again, hdc3 appears twice (why?) and there is no sign of HDD.

Let's have a look for HDD...

# mdadm --examine /dev/hdd3
/dev/hdd3:
Magic : a92b4efc
Version : 00.90.00
UUID : 8c8f0f62:9e69e701:409df450:89adf2fb
Creation Time : Wed Feb 15 14:35:22 2006
Raid Level : raid5
Device Size : 307451904 (293.21 GiB 314.83 GB)
Raid Devices : 3
Total Devices : 3
Preferred Minor : 0

Update Time : Thu Mar 2 18:30:44 2006
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Checksum : 38c60fdd - correct
Events : 0.133609

Layout : left-symmetric
Chunk Size : 256K

Number Major Minor RaidDevice State
this 2 22 67 2 active sync /dev/hdd3
0 0 3 2 0 active sync /dev/hda2
1 1 22 3 1 active sync /dev/hdc3
2 2 22 67 2 active sync /dev/hdd3

Now this is strange: we now have hdd listed twice, along with the others.

What exactly is going on here? Is it, like I think, that I'm going on only two drives? If so, how can I make the array reintegrate hdd3? Or if I'm wrong and everything is OK, where have I gone wrong in interpreting the readouts?

Cheers,

Daniel

Emmanuel_uk · 04-20-2006, 04:50 AM

UU_
I think it means one HD is not part of the raid anymore
confirmed by
State : clean, degraded

AFAIK raid5 can work with only 2 HD, this is the whole point of it

Time to do some backups, and buy a new HD, and "rebuild" the area

man mdadm (I have never rebuilt an area)

DJCF · 04-20-2006, 06:37 AM

Cheers for the help, looks like I have some work to do.

As I understand it, the third hard disk should be working physically fine, just as not part of the array. So I'll have to reintegrate it somehow. The persistent superblock (am I right?) will still be there which will hamper my attempts to reintegrate it the "normal" way (tutorials, man pages, etc.)

Cheers for your help,

Daniel

Emmanuel_uk · 04-20-2006, 06:49 AM

If the drive make noises and your server is critical,
then why would you put that faulty drive back?
If you try to add back that faulty drive to the area I do not know what can happen
I suppose a new drive is needed

I played only with raid 0

clicking sound, 1 month, send back for refund (3 yr warranty)

DJCF · 04-20-2006, 07:09 AM

I think its a 5 year warranty actually, so very cool! (It's not even 5 months old yet.)

It was clicking but after restarting the server and plugging the hard drive back in, the clicking stopped and seems to be working fine now. (Both Suse and Fedora can see it in /dev, and I can querry it using mdadm.) So I was planning to try and simply re-add it. Good idea, or do you think I should send it back? If I send it back, wouldn't they most likely plug it into a test computer, discover that it "works" (no clicking, recognised by the OS and the BIOS) and send it back to me?

Cheers,

Daniel

Emmanuel_uk · 04-20-2006, 07:21 AM

you can install smartmontools and look into
the life parameters of the HD
(saying that only very recent kernel may support SMART on sata)

http://smartmontools.sourceforge.net/
vendor will accept just a printout

There is probably a win utility from the vendor to access smart data

Be frank with vendor and tell them that the noise stopped on putting it back

Using it again: how much is it worth loosing all your data?
You know it is partly faulty... why do you want to use again for?

series effect:
check that serial no do not follow
if one HD failed, maybe the other will

raid is no replacement for backups