RAID Failure

T-Prime3797 · 12-16-2021, 06:10 PM

Good Day,

My RAID has failed, and I'm not sure what's going on. mdadm is giving me strange information (see below):

First off, this says raid0 when it should be raid5

Code:

/dev/md127:
           Version : 1.2
        Raid Level : raid0
     Total Devices : 4
       Persistence : Superblock is persistent

             State : inactive
   Working Devices : 4

              Name : ubuntu-server:Data_RAID
              UUID : e53ba358:1a0b2928:60fa66ce:d96f4138
            Events : 5967434

    Number   Major   Minor   RaidDevice

       -       8       64        -        /dev/sde
       -       8        0        -        /dev/sda
       -       8       48        -        /dev/sdd
       -       8       16        -        /dev/sdb

This one says the first device in the array is missing.

Code:

/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : e53ba358:1a0b2928:60fa66ce:d96f4138
           Name : ubuntu-server:Data_RAID
  Creation Time : Tue Jul 23 01:24:47 2019
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3906764976 (1862.89 GiB 2000.26 GB)
     Array Size : 5860147200 (5588.67 GiB 6000.79 GB)
  Used Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=176 sectors
          State : clean
    Device UUID : 1860eee9:458f6d4d:afa39c8e:07fb048a

Internal Bitmap : 8 sectors from superblock
    Update Time : Thu Nov 25 14:03:43 2021
  Bad Block Log : 512 entries available at offset 16 sectors
       Checksum : c94d6a9 - correct
         Events : 5967434

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : .AAA ('A' == active, '.' == missing, 'R' == replacing)

This one says the first and third devices are missing.

Code:

/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : e53ba358:1a0b2928:60fa66ce:d96f4138
           Name : ubuntu-server:Data_RAID
  Creation Time : Tue Jul 23 01:24:47 2019
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3906764976 (1862.89 GiB 2000.26 GB)
     Array Size : 5860147200 (5588.67 GiB 6000.79 GB)
  Used Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=176 sectors
          State : clean
    Device UUID : 6f797783:0f21ab6a:69266265:14c4635b

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Dec  5 00:57:01 2021
  Bad Block Log : 512 entries available at offset 16 sectors
       Checksum : 6b8d409a - correct
         Events : 5967440

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : .A.A ('A' == active, '.' == missing, 'R' == replacing)

This one says the first and 3 devices are missing, and has bad blocks.

Code:

/dev/sdb:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x9
     Array UUID : e53ba358:1a0b2928:60fa66ce:d96f4138
           Name : ubuntu-server:Data_RAID
  Creation Time : Tue Jul 23 01:24:47 2019
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3906764976 (1862.89 GiB 2000.26 GB)
     Array Size : 5860147200 (5588.67 GiB 6000.79 GB)
  Used Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=176 sectors
          State : clean
    Device UUID : b2dd7d4b:a524b4b6:f80cb48e:b4c96bc6

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Dec  5 00:57:01 2021
  Bad Block Log : 512 entries available at offset 16 sectors - bad blocks present.
       Checksum : 13f69538 - correct
         Events : 5967440

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : .A.A ('A' == active, '.' == missing, 'R' == replacing)

And finally, this says all the devices are active.

Code:

/dev/sdd:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : e53ba358:1a0b2928:60fa66ce:d96f4138
           Name : ubuntu-server:Data_RAID
  Creation Time : Tue Jul 23 01:24:47 2019
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3906764976 (1862.89 GiB 2000.26 GB)
     Array Size : 5860147200 (5588.67 GiB 6000.79 GB)
  Used Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=176 sectors
          State : clean
    Device UUID : 23448303:be388788:1628ea60:0186e328

Internal Bitmap : 8 sectors from superblock
    Update Time : Tue Jul 27 00:34:58 2021
  Bad Block Log : 512 entries available at offset 16 sectors
       Checksum : d173ef57 - correct
         Events : 53434

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

I don't understand what's happening. Can someone help?

Thank you.

computersavvy · 12-18-2021, 12:36 PM

First off, what commands are you using to get each of those outputs?

I cannot even fathom what may give those divergent results, so please update the post with the command used for each.

Without the commands we cannot even hope to know the answer.

It is possible that you have had one drive in a failed state for some time and a second failure took the array offline (and makes it unrecoverable). We need more info to know.

Also please post the output of

Code:

cat /proc/mdstat

Crippled · 12-18-2021, 05:39 PM

Do you have a hardware RAID or a software RAID? If you have a hardware RAID just replace the defective drives and the RAID will rebuild it self. If you have a software RAID it's trashed.

smallpond · 12-18-2021, 06:15 PM

md stops writing to the device when it fails, so sdd failed first, all devices were good up to that point. It was device 0.

sde failed next. device 0 was failed since Nov 25. sde was device 2.

sda and sdb both noted the missing drives (0 and 2) on Dec 5, which is when sde and the RAID failed. They are devices 3 and 1.

Ideally, you should set up monitoring with mdadm in monitor mode and have it email or something when a drive dies.

No idea why it now thinks the array is RAID 0.

rnturn · 12-19-2021, 09:21 AM

Quote:

Originally Posted by computersavvy

First off, what commands are you using to get each of those outputs?

I cannot even fathom what may give those divergent results, so please update the post with the command used for each.

Without the commands we cannot even hope to know the answer.

The first appears to be the output of something like:

Code:

mdadm --query --detail /dev/mdNNN

(I haven't figured out what resulted in the remainder, though.)

Update:

Code:

mdadm --examine /dev/sd<A><N>

(It's been a long time since I've had to dig that deeply into an md device.)

michaelk · 12-19-2021, 10:05 AM

Quote:

Originally Posted by rnturn

(I haven't figured out what resulted in the remainder, though.)

mdadm --examine /dev/sdb

T-Prime3797 · 12-19-2021, 01:02 PM

Quote:

Originally Posted by smallpond

md stops writing to the device when it fails, so sdd failed first, all devices were good up to that point. It was device 0.

sde failed next. device 0 was failed since Nov 25. sde was device 2.

sda and sdb both noted the missing drives (0 and 2) on Dec 5, which is when sde and the RAID failed. They are devices 3 and 1.

Ideally, you should set up monitoring with mdadm in monitor mode and have it email or something when a drive dies.

No idea why it now thinks the array is RAID 0.

Okay, that makes sense to me. Unfortunately I was out of the country when all this happened, so even if I had been notified, I was in no position to do anything about it.

Right now I'm using 'dd' to pull data from the 4 drives in hopes I can rebuild the information at least long enough to recover some of the data. What are the odds of that actually working?

T-Prime3797 · 12-19-2021, 01:06 PM

Quote:

Originally Posted by computersavvy

First off, what commands are you using to get each of those outputs?

I cannot even fathom what may give those divergent results, so please update the post with the command used for each.

Without the commands we cannot even hope to know the answer.

It is possible that you have had one drive in a failed state for some time and a second failure took the array offline (and makes it unrecoverable). We need more info to know.

Also please post the output of

Code:

cat /proc/mdstat

rnturn & michaelk are correct in thier deductions of the commands I used. /proc/mdstat states:

Code:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
unused devices: <none>

computersavvy · 12-19-2021, 06:37 PM

That output from /proc/mdstat is not surprising since the raid array is failed and not active.

I appreciate the confirmation on the commands.

I do not envy you the recovery process as it will certainly be tedious at best.

If attempting to use dd to recover the data, I would not suggest using anything other than /dev/sde (the last to fail) for recovery since that lasted a lot longer than the other and the first one to fail will have data that is way out of date.

If you can recover a good image of that one and write the data to a new drive and thus get the array back online in a still degraded state then you can add in a drive to replace the first one that failed and once it has then rebuilt the data you may have a fully functioning raid array

syg00 · 12-19-2021, 07:05 PM

You may find this an interesting read - especially the bit about using overlay files to save stressing dodgy drives. Also note the preference for ddrescue rather than dd where an image is actually required.

Lotsa luck.

michaelk · 12-19-2021, 07:49 PM

I don't know understand why the status is different between the disks.

With RAID 5 the data is stripped across and requires at least 3 disks. As far as I know you need at least 2 disks to run RAID 5 in degraded mode.

testdisk can recover data from a RAID.

lvm_ · 12-20-2021, 05:40 AM

mdadm --examine reports data stored on individual devices. Once device fell out of array md naturally stops writing to it so different data on different devices is perfectly ok and lets you track the order in which array collapsed: device with AAAA was the first to go followed by .AAA, and after that array stopped. Since event counts on all devices are pretty close, array should be [almost] ok after you force-assemble it. Run fsck and checkarray after that.

computersavvy · 12-20-2021, 10:15 AM

Quote:

Originally Posted by michaelk

I don't know understand why the status is different between the disks.

With RAID 5 the data is stripped across and requires at least 3 disks. As far as I know you need at least 2 disks to run RAID 5 in degraded mode.

testdisk can recover data from a RAID.

His raid5 array was 4 disks. Raid 5 can tolerate only one drive failure and he has had 2 drives fail.

computersavvy · 12-20-2021, 10:27 AM

Quote:

Originally Posted by lvm_

mdadm --examine reports data stored on individual devices. Once device fell out of array md naturally stops writing to it so different data on different devices is perfectly ok and lets you track the order in which array collapsed: device with AAAA was the first to go followed by .AAA, and after that array stopped. Since event counts on all devices are pretty close, array should be [almost] ok after you force-assemble it. Run fsck and checkarray after that.

The event count on /dev/sdd is tiny compared to the other three. /dev/sde is only 6 events less than /dev/sda and /dev/sdb so he may be able to force assemble those 3 into a degraded state.

I would suggest as you did, that he do an fsck and checkarray, but that then he immediately add a 4th disk replacing /dev/sdd and allow the array to fully rebuild before doing anything else, not even mounting it. Alternatively he could do a backup of the data on that array which would involve read only while still in the degraded state.

lvm_ · 12-21-2021, 02:14 AM

Quote:

Originally Posted by computersavvy

The event count on /dev/sdd is tiny compared to the other three.

Oh yes, missed that in cursory reading - they are all 5-somethings :) So actually the first drive dropped out of the array ages ago, but OP was not paying attention...