md policy on device selection for RAID-1

jg167 · 07-04-2011, 06:43 PM

Does anyone know off hand how md selects which device to read from for RAID-1? Does it try for good performance by just distributing them round-robin fashion, or does it compute some sort of avg-service-time * q-length to guess where the read will finish first, or?

It looks like this choice can be influenced by
echo "writemostly" > /sys/block/md<n>/md/dev-<xxx>/state
Which according to md.txt will cause reads to go to this device only if there is no other choice. This would do just what I want in a RAID-1 of a fast and slow device, where its mostly a read only system.

has anyone tried that?

nooneknowme · 07-05-2011, 01:07 AM

I have not tried that, but the issue sounded interesting. Reading the manual I found an option which might do the trick for you

Quote:

For Manage mode:
--write-mostly
Subsequent devices that are added or re-added will have the
'write-mostly' flag set. This is only valid for RAID1 and means
that the 'md' driver will avoid reading from these devices if
possible.

--readwrite
Subsequent devices that are added or re-added will have the
'write-mostly' flag cleared.

You have similar options while creating the raid array as well.

jg167 · 07-06-2011, 09:01 PM

Quote:

Originally Posted by nooneknowme

I have not tried that, but the issue sounded interesting. Reading the manual I found an option which might do the trick for you

You have similar options while creating the raid array as well.

Thanks, that'l make it even easier to setup.

jg167 · 07-28-2011, 10:54 AM

This method works, however we are using a mirror of 2 md devices (i.e. two RAID0 stripes one on flash cards and one on disk, with the disk being marked write-mostly). Functionally it all works, but stacked md configurations are very slow. Reading through the mirror offers only about 50% of the bandwidth of reading the RAID0 stripe directly. This is true even for the disk side by itself, with the flash side removed (i.e. a mirror with one side failed).

Wondering how to create a md array that starts out missing a piece? use "missing" instead of the disk e.g.
mdadm --create /dev/md2 -l 1 -n 2 /dev/md1 "missing"
will create the md2 volume used below

here are the details

Code:

[root@pe-r910 ~]# mdadm --detail /dev/md2
/dev/md2:
        Version : 1.2
  Creation Time : Tue Jul 26 23:13:59 2011
     Raid Level : raid1
     Array Size : 1998196216 (1905.63 GiB 2046.15 GB)
  Used Dev Size : 1998196216 (1905.63 GiB 2046.15 GB)
   Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Thu Jul 28 08:29:35 2011
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           Name : pe-r910.ingres.prv:2  (local to host pe-r910.ingres.prv)
           UUID : 299ea821:756847a0:4db591e4:38769641
         Events : 160

    Number   Major   Minor   RaidDevice State
       0       9        1        0      active sync   /dev/md1
       1       0        0        1      removed
[root@pe-r910 ~]# mdadm --detail /dev/md1
/dev/md1:
        Version : 1.2
  Creation Time : Tue Jul 26 01:05:05 2011
     Raid Level : raid0
     Array Size : 1998197376 (1905.63 GiB 2046.15 GB)
   Raid Devices : 14
  Total Devices : 14
    Persistence : Superblock is persistent

    Update Time : Tue Jul 26 01:05:05 2011
          State : clean
 Active Devices : 14
Working Devices : 14
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 64K

           Name : pe-r910.ingres.prv:1  (local to host pe-r910.ingres.prv)
           UUID : 735bd502:62ed0509:08c33e15:19ae4f6b
         Events : 0

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8       65        3      active sync   /dev/sde1
       4       8       81        4      active sync   /dev/sdf1
       5       8       97        5      active sync   /dev/sdg1
       6       8      113        6      active sync   /dev/sdh1
       7       8      129        7      active sync   /dev/sdi1
       8       8      145        8      active sync   /dev/sdj1
       9       8      161        9      active sync   /dev/sdk1
      10       8      177       10      active sync   /dev/sdl1
      11       8      193       11      active sync   /dev/sdm1
      12       8      209       12      active sync   /dev/sdn1
      13       8      225       13      active sync   /dev/sdo1
[root@pe-r910 ~]# dd if=/dev/md1 bs=512K count=10000 iflag=nonblock,direct of=/dev/null
10000+0 records in
10000+0 records out
5242880000 bytes (5.2 GB) copied, 3.45236 s, 1.5 GB/s
[root@pe-r910 ~]# dd if=/dev/md2 bs=512K count=10000 iflag=nonblock,direct of=/dev/null
10000+0 records in
10000+0 records out
5242880000 bytes (5.2 GB) copied, 6.81182 s, 770 MB/s
[root@pe-r910 ~]#

update:
iostat shows 64K reads being done both do md1 and to its component devices when reading directly from md1. This is somewhat mysterious as dd is asking for 512k reads. So I would have expected to see 512k to md1, and 64K to its component devices (i.e. the chunk size).
But the killer is that when reading from md2 (the raid1 volume with only one half present) it shows only 4k reads to md2, md1, and the component devices. Perhaps that is due to md thinking that is the size it should use for error processing, but its killing performance.

update2:
This looks only to be an issue for md on md. If I make a raid1 directly on a disk, its io rate is the same as the disk.