[SOLVED] Distributed RAID(5/6) over NBD, fail to assemble after rebooting

QYInst · 01-27-2015, 09:29 PM

Hi all,
I am trying to build a distributed RAID (5 or 6) with several linux servers in a LAN environment.
The goal is to make use of all the hard disks of all the serves, for safely I/O and data storage.
We do not care much about the I/O speed, since there are not many users (no frequently write/read), and only 3-5 machines.
In this case, I think RAID 5 or 6 are good enough, while RAID 1/10 waste too much storage.

On the other hand, in order to share local hard disks (block devices) with remote computers, and create RAID over networking, I decide to use RAID over NBD. Since it is simple and free.

I have done some tests using two ubuntu-14.10 desktops (let us say host0 and host1) so far. The softwares I used are:

Quote:

mdadm -- v3.3
nbd-server/client -- v3.8

host1 shares 4 block devices through LAN (/dev/sda{5,6,7,8}) using nbd-server, which are connected as nbd{0,1,2,3} at host0.
Then the RAID 5 system is created at host0 together with the local /dev/sda{5,6,7}:

Code:

# mdadm --create --auto=yes /dev/md0 --level=5 --raid-devices=5 --spare-devices=2 /dev/sda{5,6,7} /dev/nbd{0,1,2,3}
# mkfs.ext4 /dev/md0
# mount /dev/md0 /mnt/md

Up to now everything goes smoothly, it works very well.

However, since the electric power here is not very reliable, I tried rebooting host1, to see if the RAID can be recovered correctly.
After host1 starts, of course the file system mounted at host0:/mnt/md is now read-only. Then I umounted the RAID and checked the details:

Code:

# umount /mnt/md
# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Jan 28 09:48:10 2015
     Raid Level : raid5
     Array Size : 195172352 (186.13 GiB 199.86 GB)
  Used Dev Size : 48793088 (46.53 GiB 49.96 GB)
   Raid Devices : 5
  Total Devices : 7
    Persistence : Superblock is persistent

    Update Time : Wed Jan 28 11:11:53 2015
          State : clean, FAILED 
 Active Devices : 3
Working Devices : 3
 Failed Devices : 4
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : host0:0  (local to host host0)
           UUID : fa43d095:47309bb4:4beaccca:fde903a4
         Events : 23

    Number   Major   Minor   RaidDevice State
       0       8        5        0      active sync   /dev/sda5
       1       8        6        1      active sync   /dev/sda6
       2       8        7        2      active sync   /dev/sda7
       6       0        0        6      removed
       8       0        0        8      removed

       3      43        0        -      faulty   /dev/nbd0
       5      43       32        -      faulty   /dev/nbd2
       6      43       48        -      faulty   /dev/nbd3
       7      43       16        -      faulty   /dev/nbd1

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid5 nbd1[7](F) nbd3[6](F) nbd2[5](F) nbd0[3](F) sda7[2] sda6[1] sda5[0]
      195172352 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/3] [UUU__]
      
unused devices: <none>

These are what we may expected. In this case, I tried to assemble the RAID:

Code:

# mdadm --stop /dev/md0
# mdadm --assemble --force /dev/md0 /dev/sda{5,6,7} /dev/nbd{0,1,2,3}
mdadm: clearing FAULTY flag for device 5 in /dev/md0 for /dev/nbd2
mdadm: clearing FAULTY flag for device 6 in /dev/md0 for /dev/nbd3
mdadm: Marking array /dev/md0 as 'clean'
mdadm: /dev/md0 assembled from 3 drives and 2 spares - not enough to start the array.

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : inactive sda5[0](S) nbd3[6](S) nbd2[5](S) nbd1[7](S) nbd0[3](S) sda7[2](S) sda6[1](S)
      341563993 blocks super 1.2
       
unused devices: <none>

# mdadm --examine /dev/sda{5,6,7} /dev/nbd{0,1,2,3}
   *** All the states are 'clean' ***
   Array State : AAA..
   Array State : AAA..
   Array State : AAA..
   Array State : AAAAA
   Array State : AAAAA
   Array State : AAAAA
   Array State : AAAAA

Absolutely I failed to re-assemble the RAID. And mdadm --manage does not work at all.

However, I find re-creating the RAID works:

Code:

# mdadm --create --auto=yes /dev/md0 --level=5 --raid-devices=5 --spare-devices=2 /dev/sda{5,6,7} /dev/nbd{0,1,2,3}

I have check and all the data is there.

Sorry for the boring details. Here comes my questions:
1. Is re-creating the RAID always safe?
2. How can I re-assemble (or recover) the RAID after rebooting of one node? I did not write anything to the devices, and in principle all the data should be there, right?

BTW, RAID 6 is a solution for this case, which can rebuild 2 failed disks. However, it does not solve my problem, since all the nodes may shutdown at the same time due to tripping of the eclectric power.

Thank you very much for any comments and suggestions!

smallpond · 01-28-2015, 08:49 AM

You are tap-dancing on a landmine. Since ext filesystems don't have data integrity checking, you don't know whether "all the data is there". fsck only tells you the filesystem metadata is intact.

Any write going on at the time you disconnect multiple drives from a RAID 5 or 6 will leave the array corrupted and that stripe of data unrecoverable. If you care about your data put all the drives in a single enclosure and power that server and the drives with a UPS.

QYInst · 01-28-2015, 07:02 PM

Quote:

Originally Posted by smallpond

You are tap-dancing on a landmine. Since ext filesystems don't have data integrity checking, you don't know whether "all the data is there". fsck only tells you the filesystem metadata is intact.

Thanks. I have checked not only fsck for the filesystem. Actually I have copied a bunch of data for testing in advance, and after re-creating the RAID all the data is there.
I know I am taking risks with such operation. However, I cannot come up with a workaround. Maybe I have to give up on this idea of building RAID over NBD. This is what tests for, i.e. exploring problems in advance. ;-)

Quote:

Originally Posted by smallpond

Any write going on at the time you disconnect multiple drives from a RAID 5 or 6 will leave the array corrupted and that stripe of data unrecoverable. If you care about your data put all the drives in a single enclosure and power that server and the drives with a UPS.

Yes, you are right. I am still wondering if there is any safely workaround, for making use of the hard disks from different machines. Maybe I am just too naive.;-)

btmiller · 01-28-2015, 07:59 PM

Look at a real world cluster filesystem, e.g. Ceph, Gluster, Lustre, probably others, that is actually designed for this sort of thing with the necessary redundancy and recovery protocols. Trying to run multiple-machine RAID just does not seem like a good idea to me.

QYInst · 01-28-2015, 08:45 PM

Quote:

Originally Posted by btmiller

Look at a real world cluster filesystem, e.g. Ceph, Gluster, Lustre, probably others, that is actually designed for this sort of thing with the necessary redundancy and recovery protocols. Trying to run multiple-machine RAID just does not seem like a good idea to me.

Sounds great. I think those are what I have been looking for.
I will take a look at the filesystems. Thanks a lot!