why is my RAID device resyncing sporadically?

ascfu · 01-11-2010, 08:42 AM

Hi,

I have a desktop server set up with 3 x 1Tb disks set up in a number of partitions. Some are under RAID 1, others under RAID 5.

eg:

Code:

md0 : active raid1 sdc2[2](S) sdb2[1] sda2[0]
      513984 blocks [2/2] [UU]
      
md1 : active raid5 sdc3[2] sdb3[1] sda3[0]
      1595463168 blocks level 5, 256k chunk, algorithm 2 [3/3] [UUU]

I have a cron set up so that every week I recieve a report on the RAID status and the disk health (using mdadm --details and smartctl --test=short ). Never has there been reported any problems with either of these:

example output:

Code:

/dev/md0:
        Version : 0.90
  Creation Time : Tue Nov 17 19:18:18 2009
     Raid Level : raid1
     Array Size : 513984 (502.02 MiB 526.32 MB)
  Used Dev Size : 513984 (502.02 MiB 526.32 MB)
   Raid Devices : 2
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon Jan 11 10:20:28 2010
          State : clean
 Active Devices : 2
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 1

           UUID : 514a3687:430c809d:8d977509:67cfc75f
         Events : 0.30

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2

       2       8       34        -      spare   /dev/sdc2


/dev/md1:
        Version : 0.90
  Creation Time : Tue Nov 17 19:16:53 2009
     Raid Level : raid5
     Array Size : 1595463168 (1521.55 GiB 1633.75 GB)
  Used Dev Size : 797731584 (760.78 GiB 816.88 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 8
    Persistence : Superblock is persistent

    Update Time : Mon Jan 11 13:51:35 2010
          State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 256K

           UUID : 1557ef2c:c9c293be:e394a98f:485db1ea
         Events : 0.1601

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
       2       8       35        2      active sync   /dev/sdc3


SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours) 
LBA_of_first_error
# 1  Short offline       Completed without error       00%      1116         -

Yet, twice now, I have caught one of the disks being rebuilt. I have not been doing anything to do with the partitions, or RAID devices at these times. They have resynced fine automatically and the disk health report is fine afterwards. I do not know why this is happening - any suggestions? Or any suggestions on how to find out? I have checked my history to make sure I didn't inadventantly do something stupid just beforehand (the first time only - the other time I was not present when the resync started; no-one was doing anything on the machine apart from with the NFS files it serves 5 clients). As far as I can see, there is nothing in /var/log/messages that provides a hint to why this occurred (although forgive me if I am wrong - I have no idea what a lot of these messages mean...).

For example. today, the log shows a normal boot (the server had to be shutdown over the weekend), and then about an hour later (in a new /var/log/messages file):

Code:

Jan 11 10:15:03 localhost syslogd 1.4.1: restart.
Jan 11 10:20:19 localhost kernel: md: syncing RAID array md0
Jan 11 10:20:19 localhost kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Jan 11 10:20:19 localhost kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
Jan 11 10:20:19 localhost kernel: md: using 128k window, over a total of 513984 blocks.

I would really like to get to the bottom of this... I don't know what is making it happen!

Any light shed on this would be greatly appreciated...
Thanks!

sleddog · 01-11-2010, 08:37 PM

Is this CentOS/Redhat 5.4? If so, there's an /etc/cron.weekly/99-raid-check script that's new in 5.4. Typically, it'll run on Sunday, however as your server was off during the weekend it was missed, and may have been called by anacron after you booted. This is just a guess....

bbrunner · 01-13-2010, 03:32 AM

Under special circumstances raid 1 arrays can get out of sync because data is discarded before it has been synched (this can happen with swap and memory mapped files that reside on raid 1). The weekly raid check may spot these and send you messages which will raise your neckhairs.

Of course it's good to get to the bottom of this, but it may be harmless.

ascfu · 01-14-2010, 05:23 AM

sleddog: yes is CentOS 5.4. No messages from the 99-raid-check script so I'm still at a bit of a loss...

bbunner: unfortunately wasn't just the RAID 1 swap partition, but all of the partitions. this includes a large LVM partition with a lot of data on it which is mounted onto several clients using NFS. so, really, having found no problems with disk health (I've run long smartctl tests and everything seems fine), the only thing that concerns me is the speed of the service provided as resyncing tends to slow things down a bit... and if it happens sporadically, I'd rather know why!

a thought... (apologies for the retardedness of this question, but always best to check... I'm not the most advanced user of linux and this is the first time I've used RAID... ;-) ) - resyncing and rebuilding are the same thing, right...?

thanks everyone for your input :-)

sleddog · 01-14-2010, 07:19 AM

The raid-check script will sync each of your RAID arrays when it runs (weekly). This is a health check; it doesn't mean there is anything wrong. If there is something wrong it will be in the syslog (and maybe mailed to you). The time it takes to sync an array will depend on the size of the array. If the time of the weekly run is inconvenient for you, you can change it by editing /etc/crontab.

Here's typical output (in /var/log/messages) from one of my arrays (~450 GB). You can see I rescheduled it to Sunday afternoon instead of the typical early morning, as for me early morning conflicted with extensive backup activity.

Code:

Jan 10 15:00:03 serverName kernel: md: syncing RAID array md3
Jan 10 17:08:39 serverName kernel: md: md3: sync done.

Again, this is new in CentOS 5.4 (it did not exist in 5.3 or earlier).

ascfu · 01-15-2010, 03:37 AM

aha!

I didn't know that - thank you. This probably explains it - some of the arrays are large (> 1Tb) and so maybe it was just running over into Mon morning... I'll maybe reschedule (also possible conflicts with a backup script).

Thank you, you've put my mind to rest. :-)

ascfu · 01-15-2010, 03:40 AM

ah, also re-reading your previous post, I now see this is what you meant originally! Doh.
Thanks again :-)