LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)
-   -   Failed drive while converting raid5 to raid6, then a hard reboot (https://www.linuxquestions.org/questions/linux-server-73/failed-drive-while-converting-raid5-to-raid6-then-a-hard-reboot-942554/)

hakon.gislason 04-30-2012 09:08 AM

Failed drive while converting raid5 to raid6, then a hard reboot
 
Hello,
I've been having frequent drive "failures", as in, they are reported failed/bad and mdadm sends me an email telling me things went wrong, etc... but after a reboot or two, they are perfectly fine again. I'm not sure what it is, but this server is quite new and I think there might be more behind it, bad memory or the motherboard (I've been having other issues as well). I've had 4 drive "failures" in this month, all different drives except for one, which "failed" twice, and all have been fixed with a reboot or rebuild (all drives reported bad by mdadm passed an extensive SMART test).
Due to this, I decided to convert my raid5 array to a raid6 array while I find the root cause of the problem.

I started the conversion right after a drive failure & rebuild, but as it had converted/reshaped aprox. 4%(if I remember correctly, and it was going really slowly, ~7500 minutes to completion), it reported another drive bad, and the conversion to raid6 stopped (it said "rebuilding", but the speed was 0K/sec and the time left was a few million minutes.
After that happened, I tried to stop the array and reboot the server, as I had done previously to get the reportedly "bad" drive working again, but It wouldn't stop the array or reboot, neither could I unmount it, it just hung whenever I tried to do something with /dev/md0. After trying to reboot a few times, I just killed the power and re-started it. Admittedly this was probably not the best thing I could have done at that point.

I have backup of ca. 80% of the data on there, it's been a month since the last complete backup (because I ran out of backup disk space).

So, the big question, can the array be activated, and can it complete the conversion to raid6? And will I get my data back?
I hope the data can be rescued, and any help I can get would be much appreciated!

I'm fairly new to raid in general, and have been using mdadm for about a month now.
Here's some data:

Code:

root@axiom:~# mdadm --examine --scan
ARRAY /dev/md/0 metadata=1.2 UUID=cfedbfc1:feaee982:4e92ccf4:45e08ed1 name=axiom.is:0

root@axiom:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : inactive sdc[6] sde[7] sdb[5] sda[4]
      7814054240 blocks super 1.2

root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
mdadm: /dev/md0 is already in use.

root@axiom:~# mdadm --stop /dev/md0
mdadm: stopped /dev/md0

root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
mdadm: Failed to restore critical section for reshape, sorry.
      Possibly you needed to specify the --backup-file

root@axiom:~# mdadm --assemble --scan --force --run /dev/md0 --backup-file=/root/mdadm-backup-file
mdadm: Failed to restore critical section for reshape, sorry.

root@axiom:~# fdisk -l | grep 2000
Disk /dev/sda doesn't contain a valid partition table
Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
Disk /dev/sdc: 2000.4 GB, 2000398934016 bytes
Disk /dev/sde: 2000.4 GB, 2000398934016 bytes
Disk /dev/sdf: 2000.4 GB, 2000398934016 bytes

root@axiom:~# mdadm --examine /dev/sd{a,b,c,e,f}
/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
    Array UUID : cfedbfc1:feaee982:4e92ccf4:45e08ed1
          Name : axiom.is:0  (local to host axiom.is )
  Creation Time : Mon Apr  9 01:05:20 2012
    Raid Level : raid6
  Raid Devices : 5

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
    Array Size : 11721080448 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907026816 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
  Super Offset : 8 sectors
          State : active
    Device UUID : b11a7424:fc470ea7:51ba6ea0:158c0ce6

  Reshape pos'n : 242343936 (231.12 GiB 248.16 GB)
    New Layout : left-symmetric

    Update Time : Sun Oct 14 15:20:06 2012
      Checksum : 76ecd244 - correct
        Events : 138274

        Layout : left-symmetric-6
    Chunk Size : 32K

  Device Role : Active device 3
  Array State : .AAAA ('A' == active, '.' == missing)
/dev/sdb:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x6
    Array UUID : cfedbfc1:feaee982:4e92ccf4:45e08ed1
          Name : axiom.is:0  (local to host axiom.is )
  Creation Time : Mon Apr  9 01:05:20 2012
    Raid Level : raid6
  Raid Devices : 5

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
    Array Size : 11721080448 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907026816 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
  Super Offset : 8 sectors
Recovery Offset : 161546240 sectors
          State : active
    Device UUID : 8389f39f:cc7fa027:f10cf717:1d41d40b

  Reshape pos'n : 242343936 (231.12 GiB 248.16 GB)
    New Layout : left-symmetric

    Update Time : Sun Oct 14 15:20:06 2012
      Checksum : 19ef8090 - correct
        Events : 138274

        Layout : left-symmetric-6
    Chunk Size : 32K

  Device Role : Active device 4
  Array State : .AAAA ('A' == active, '.' == missing)
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
    Array UUID : cfedbfc1:feaee982:4e92ccf4:45e08ed1
          Name : axiom.is:0  (local to host axiom.is )
  Creation Time : Mon Apr  9 01:05:20 2012
    Raid Level : raid6
  Raid Devices : 5

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
    Array Size : 11721080448 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907026816 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
  Super Offset : 8 sectors
          State : clean
    Device UUID : b2cec17f:e526b42e:9e69e46b:23be5163

  Reshape pos'n : 242343936 (231.12 GiB 248.16 GB)
    New Layout : left-symmetric

    Update Time : Sun Oct 14 15:20:06 2012
      Checksum : a29b468a - correct
        Events : 138274

        Layout : left-symmetric-6
    Chunk Size : 32K

  Device Role : Active device 1
  Array State : .AAAA ('A' == active, '.' == missing)
/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
    Array UUID : cfedbfc1:feaee982:4e92ccf4:45e08ed1
          Name : axiom.is:0  (local to host axiom.is )
  Creation Time : Mon Apr  9 01:05:20 2012
    Raid Level : raid6
  Raid Devices : 5

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
    Array Size : 11721080448 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907026816 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
  Super Offset : 8 sectors
          State : active
    Device UUID : 21c799cd:58be3156:6830865b:fa984134

  Reshape pos'n : 242343936 (231.12 GiB 248.16 GB)
    New Layout : left-symmetric

    Update Time : Sun Oct 14 15:20:06 2012
      Checksum : d882780e - correct
        Events : 138274

        Layout : left-symmetric-6
    Chunk Size : 32K

  Device Role : Active device 2
  Array State : .AAAA ('A' == active, '.' == missing)
/dev/sdf:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
    Array UUID : cfedbfc1:feaee982:4e92ccf4:45e08ed1
          Name : axiom.is:0  (local to host axiom.is )
  Creation Time : Mon Apr  9 01:05:20 2012
    Raid Level : raid6
  Raid Devices : 5

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
    Array Size : 11721080448 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907026816 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
  Super Offset : 8 sectors
          State : active
    Device UUID : 8b043488:8379f327:5f00e0fe:6a1e0bee

  Reshape pos'n : 242343936 (231.12 GiB 248.16 GB)
    New Layout : left-symmetric

    Update Time : Sat Apr 28 22:57:36 2012
      Checksum : c122639f - correct
        Events : 138241

        Layout : left-symmetric-6
    Chunk Size : 32K

  Device Role : Active device 0
  Array State : AAAAA ('A' == active, '.' == missing)


hakon.gislason 05-05-2012 11:35 AM

Over 225 views and nobody can help me?
I'd really appreciate help in getting this array online again.

lithos 05-05-2012 12:56 PM

Hi,

I'm sorry to read you have trouble with your RAID, but I see you're using a software RAID within Linux, which I don't know and I don't use.

I would like to recommend you in future to use a TRUE Hardware RAID controller which works on a hardware level, not software (in linux).
I don't intend to make any commercial ads or anything like it, just to point you what a server should be using for RAID.

I wish one with mdadm experience will help you out.


good luck

arandall 09-15-2012 08:47 AM

Interesting that you have been experiencing similar issues to me. Your post was a while ago but perhaps this will help someone else out.

One thing that I noticed in your post, that prompted my reply, is that you are not partitioning your drives. Typically one uses a single primary partition on all the raid drives with a partition type of 0xFD (Linux RAID) - option 't' in fdisk.

Now onto the failed drives.

I have noticed in the last few months on one of my set-ups where I do not use partitions on the drives that if a disk changes from a block size of 4096 bytes to 512 bytes (can be seen when you run `blockdev --getbsz /dev/sd?`). The number of blocks reported by `cat /proc/partitions` changes and it directly related to drives being marked as faulty in an array, as expected. Often a reboot, as you describe, would allow me to re-add the drive back in the array and it could go on for weeks before hitting the problem again.

Eg. This is with Seagate 2Tb drives, notice the #blocks are different:

Code:

# cat /proc/partitions # (extract)
major minor  #blocks  name

  8      32 1953513527 sdc
  8      33 1953512001 sdc1
  8      16 1953514584 sdb
  8      17 1953512001 sdb1

# blockdev --getbsz /dev/sdc
512
# blockdev --getbsz /dev/sdb
4096

I never got to the bottom of why the block size changed but as this happened a number of times I changed the set-up to use partitions as mentioned above. With the partitions in place partition size or /dev/sd?1 remains the same size regardless of the block size reported and the RAID is happy.

devdol 03-07-2019 04:23 AM

Appending the "--invalid-backup" option in addition to "--backup-file=..." seems to do the trick.

After rebooting a stuck server while reshaping (RAID5 to RAID6), hence similar situation like OP above and still relevant, we got a somewhat terrifying error message:

Code:

mdadm --stop /dev/md1
mdadm --assemble --force /dev/md1 /dev/sd[abcde]4 --backup-file=/path/to/md1.bak
"mdadm: Failed to restore critical section for reshape, sorry."

However, the same sequence of instructions with additional "--invalid-backup"
Code:

mdadm --stop /dev/md1
mdadm --assemble --force /dev/md1 /dev/sd[abcde]4 --backup-file=/path/to/md1.bak --invalid-backup

lead to "mdadm: /dev/md1 has been started with 5 drives.

This behaviour was always reproducible for this RAID.

It took us a long time to find this solution, as we thought it was pointless to specify a backup file with a simultaneous statement that it was worthless. So this note may help one or the other :o)


All times are GMT -5. The time now is 02:16 PM.