Hi,
I have a desktop server set up with 3 x 1Tb disks set up in a number of partitions. Some are under RAID 1, others under RAID 5.
eg:
Code:
md0 : active raid1 sdc2[2](S) sdb2[1] sda2[0]
513984 blocks [2/2] [UU]
md1 : active raid5 sdc3[2] sdb3[1] sda3[0]
1595463168 blocks level 5, 256k chunk, algorithm 2 [3/3] [UUU]
I have a cron set up so that every week I recieve a report on the RAID status and the disk health (using mdadm --details and smartctl --test=short ). Never has there been reported any problems with either of these:
example output:
Code:
/dev/md0:
Version : 0.90
Creation Time : Tue Nov 17 19:18:18 2009
Raid Level : raid1
Array Size : 513984 (502.02 MiB 526.32 MB)
Used Dev Size : 513984 (502.02 MiB 526.32 MB)
Raid Devices : 2
Total Devices : 3
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Mon Jan 11 10:20:28 2010
State : clean
Active Devices : 2
Working Devices : 3
Failed Devices : 0
Spare Devices : 1
UUID : 514a3687:430c809d:8d977509:67cfc75f
Events : 0.30
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
1 8 18 1 active sync /dev/sdb2
2 8 34 - spare /dev/sdc2
/dev/md1:
Version : 0.90
Creation Time : Tue Nov 17 19:16:53 2009
Raid Level : raid5
Array Size : 1595463168 (1521.55 GiB 1633.75 GB)
Used Dev Size : 797731584 (760.78 GiB 816.88 GB)
Raid Devices : 3
Total Devices : 3
Preferred Minor : 8
Persistence : Superblock is persistent
Update Time : Mon Jan 11 13:51:35 2010
State : active
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 256K
UUID : 1557ef2c:c9c293be:e394a98f:485db1ea
Events : 0.1601
Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 8 19 1 active sync /dev/sdb3
2 8 35 2 active sync /dev/sdc3
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Short offline Completed without error 00% 1116 -
Yet, twice now, I have caught one of the disks being rebuilt. I have not been doing anything to do with the partitions, or RAID devices at these times. They have resynced fine automatically and the disk health report is fine afterwards. I do not know why this is happening - any suggestions? Or any suggestions on how to find out? I have checked my history to make sure I didn't inadventantly do something stupid just beforehand (the first time only - the other time I was not present when the resync started; no-one was doing anything on the machine apart from with the NFS files it serves 5 clients). As far as I can see, there is nothing in /var/log/messages that provides a hint to why this occurred (although forgive me if I am wrong - I have no idea what a lot of these messages mean...).
For example. today, the log shows a normal boot (the server had to be shutdown over the weekend), and then about an hour later (in a new /var/log/messages file):
Code:
Jan 11 10:15:03 localhost syslogd 1.4.1: restart.
Jan 11 10:20:19 localhost kernel: md: syncing RAID array md0
Jan 11 10:20:19 localhost kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Jan 11 10:20:19 localhost kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
Jan 11 10:20:19 localhost kernel: md: using 128k window, over a total of 513984 blocks.
I would really like to get to the bottom of this... I don't know what is making it happen!
Any light shed on this would be greatly appreciated...
Thanks!