mdadm RAID5 degraded/rebuild access issues.

dbrazeau · 04-13-2010, 07:14 PM

I am trying to test my systems RAID5 recovery and I seem to be running into some issues.

So I created a RAID5 array with 3 drives using mdadm. I then create a ext3 file system on the RAID and copy a 3.5GB test file to the new file system on the RAID. I then proceed to fault and remove one of the drives from the RAID using mdadm. Up to this point all seems well. The now degraded RAID is still mounted and can see the test file on it using "ls -l".

From here on is where the trouble starts.

With the RAID still in degraded mode I try to copy(read) the test file and I get a bunch of errors. It doesn't matter if I try create a duplicate copy of it on the RAID or try to copy it to a drive not in the RAID. Adding the "faulted" drive back into the RAID and waiting for it to recover does not fix my issue, I still get the same errors.

Here are the errors I'm seeing:

Code:

# cp testfile.tar testfile_degraded.tar
attempt to access beyond end of device
md0: rw=0, want=15236514744, limit=22490368
__ratelimit: 626 callbacks suppressed
Buffer I/O error on device md0, logical block 1904564342
attempt to access beyond end of device
md0: rw=0, want=33612653048, limit=22490368
Buffer I/O error on device md0, logical block 4201581630
attempt to access beyond end of device
md0: rw=0, want=18764592712, limit=22490368
Buffer I/O error on device md0, logical block 2345574088
attempt to access beyond end of device
md0: rw=0, want=9395562552, limit=22490368
Buffer I/O error on device md0, logical block 1174445318
cp: read error: Input/output error

Does anyone know how to fix these errors?

Please let me know if there is any other information that would be helpful.

anuragccsu · 04-13-2010, 10:41 PM

Hi there,

did you try to see the status of the array using -D option of mdadm what does it say, you can get the complete status of the RAID by this command option.
you can check it out and some more things as:
what was the status of the array before you failed it and what is after failure?
was your array full?!
you can execute sync command to write all unwritten buffers to be written to the disk as in your output it says shows some buffer IO error.
To me it seems as your array is full, try to copy a small file, or try creating files using touch command.

Thanks
Anurag

dbrazeau · 04-14-2010, 11:03 AM

My RAID is not full. I created an RAID5 that's a little bigger than 11GB. As I mention before my test file is about 3.5GB. At the time I fail the drive and put my RAID in degraded mode the 3.5GB test file is the only file on the RAID, so there is still about 7.5GB free.

Also I have no problem creating a file using touch when the RAID is in degraded mode. When I try to create a duplicate copy of the test file on the degraded RAID some of it actually gets copied. The first time I tried about 700MB was copied before the errors started happening, and the second time I tried about 350MB were copied before I got the errors. Either way there should be plenty available space on my RAID to complete my copy operation.

Based on the error returned from cp.

Code:

cp: read error: Input/output error

It seems like it is having trouble reading my test file, not writing more data to the RAID. Also to support this claim, as I mentioned before, I cannot copy the test file to a separate drive that is not part of the array.

So the basic issue is that when trying to read my test file off a degraded RAID5 I get a bunch of "out of bounce" errors.

Does anyone know what metrics are used for this error:

Code:

md0: rw=0, want=9395562552, limit=22490368

Is this 22490368 blocks, cylinders, bytes?

dbrazeau · 04-14-2010, 03:35 PM

Here is some more details. In this test I copied the 3.5GB test file to the RAID. From here I can still read back the test file fine. Then I fault one my partitions in my RAID. Now when I try to read the file I get the "attempt to access beyond end of device" errors. In this test I was able to copy about 3GB of the 3.5GB file before I started getting errors.

Here is details about my raid before failing one of the drives:

Code:

~ # mdadm -D /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Wed Apr 14 12:50:54 2010
     Raid Level : raid5
     Array Size : 11245184 (10.72 GiB 11.52 GB)
  Used Dev Size : 5622592 (5.36 GiB 5.76 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Apr 14 13:04:46 2010
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : ab1369c8:669b3be5:14975abc:932ab79d (local to host Testbox)
         Events : 0.38

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1

This is the command I use fault one of the drives:

Code:

~ # mdadm /dev/md0 -f /dev/sdb1
raid5: Disk failure on sdb1, disabling device.
raid5: Operation continuing on 2 devices.
mdadm: set /dev/sdb1 faulty in /RAID5 conf printout:
dev/md0
~ #  --- rd:3 wd:2
 disk 0, o:1, dev:sda1
 disk 1, o:0, dev:sdb1
 disk 2, o:1, dev:sdc1
RAID5 conf printout:
 --- rd:3 wd:2
 disk 0, o:1, dev:sda1
 disk 2, o:1, dev:sdc1

Here are the details about my RAID after faulting the drive:

Code:

~ # mdadm -D /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Wed Apr 14 12:50:54 2010
     Raid Level : raid5
     Array Size : 11245184 (10.72 GiB 11.52 GB)
  Used Dev Size : 5622592 (5.36 GiB 5.76 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Apr 14 13:18:40 2010
          State : clean, degraded
 Active Devices : 2
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : ab1369c8:669b3be5:14975abc:932ab79d (local to host Testbox)
         Events : 0.40

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       0        0        1      removed
       2       8       33        2      active sync   /dev/sdc1

       3       8       17        -      faulty spare   /dev/sdb1

~ # cat /proc/mdstat 
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] 
md0 : active raid5 sdb1[3](F) sdc1[2] sda1[0]
      11245184 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]

I then try to copy the test file to a drive that is not in the RAID.
Here are the errors:

Code:

attempt to access beyond end of device
md0: rw=0, want=19245217848, limit=22490368
Buffer I/O error on device md0, logical block 2405652230
attempt to access beyond end of device
md0: rw=0, want=6566075848, limit=22490368
Buffer I/O error on device md0, logical block 820759480
attempt to access beyond end of device
md0: rw=0, want=27860599552, limit=22490368
Buffer I/O error on device md0, logical block 3482574943
attempt to access beyond end of device
md0: rw=0, want=16777306888, limit=22490368
Buffer I/O error on device md0, logical block 2097163360
...

I have googled this error, and it looks like others have had similar issues, but I was unable to find any solution to resolve it.

anuragccsu · 04-14-2010, 09:43 PM

Hi there,

I assume from the error messages(attempt to access beyond end of device) that your hard disk itself is having some bad sectors because after failure you are able to put some data again and you get yourself into errors when you hit the bad sector on the disk so could you please carry out the same testing with some other disks(perfect)?

Thanks
Anurag

dbrazeau · 04-15-2010, 12:12 PM

Quote:

Originally Posted by anuragccsu

Hi there,

I assume from the error messages(attempt to access beyond end of device) that your hard disk itself is having some bad sectors because after failure you are able to put some data again and you get yourself into errors when you hit the bad sector on the disk so could you please carry out the same testing with some other disks(perfect)?

Thanks
Anurag

I'll give it a go with some other disks, but I'm not so sure that is the problem since I have not seen any errors while the RAID is running in non-degraded mode. I have also ran some heavy data integrity tests on the non-degraded RAID with no errors. If there was a bad sector I would expect the data integrity test to fail.