LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Slackware (https://www.linuxquestions.org/questions/slackware-14/)
-   -   raid1 recovery w bitmap + PAR2 (https://www.linuxquestions.org/questions/slackware-14/raid1-recovery-w-bitmap-par2-4175537095/)

lazardo 03-18-2015 02:20 AM

raid1 recovery w bitmap + PAR2
 
Code:

recovery = 49.9% (724112960/1448225792) finish=1.0min speed=11321520K/sec
The above is from an ancient Shuttle SN21G5 (with the problematic nvidia MCP51 chipset) recovering from an as yet unidentified raid member drop out. The system was born a roll-your-own media server nearly 10 years ago and contains 872GB of ripped DVD quality video, audio CD transcodes and a large image gallery.

And yes, mdadm is showing an effective recovery rate of 11.3GB/s thanks to the write-intent internal bitmap. The whole process of recovering the 1.4TB took about 100 seconds on 64bit 14.0, 3.2.45 kernel, RAID10f2.

After many hours and much angst looking at ZoL/btrfs current buglists I could not pull the trigger on conversion and so stuck with ext4, added the bitmaps and went with PAR2 for bitrot protection.
  • At 2% coverage, the tradeoff is about 2% disk space, eg, American_Beauty_1999.MP4 is 1967MB, the par2 files 41MB. And at 11GB/s recovery rate I could care less if bitmapped writes are a bit slower.
  • The source VOBs are on a separate machine also with 2% PAR2 recovery files. Note that VOBs compress by about 2.5%, so good protection at net zero disk utilization (pigz and a few cores helps too).

Here's the quick script used to create checksums. Run a small test and see if it meets your needs, or even works. It has not been cleaned up.

I used the parallel par2 http://slackbuilds.org/repository/14...r2cmdline-tbb/ which is why the rudimentary simultaneous job management.

Cheers,

[Update] Root cause was a marginal SATA cable, old logs show over a year of sporadic SATA resets. The above recovery was a simple hot-add of the same partition back in to the array. A subsequent scrub took 7.07 hours on an otherwise unused system, so the bitmap in this case effectively reduced rebuild by a factor of 250 (100 vs 25.5K seconds).
Code:

#!/bin/bash

PARLOC=/tmp/PAR2                # par2 file repository
MAXJOBS=3                        # more disk I/O than horsepower
PERCENT=2                        # par2 redundancy, 2% s/b bitrot safe

################################################################

MAXMEM=2048                        # max per instance par2 mem
GAP=0.2                                # seconds sleep between starts

NUMCPU=$( awk '/^processor/{ ++NUMCPU }; END{print NUMCPU}' /proc/cpuinfo )
RAWMEM=$( awk '/^MemTotal:/{ print int($2/1000); exit }' /proc/meminfo )
RAWMEM=$(( RAWMEM / MAXJOBS / NUMCPU ))

for ((i=2; i<=$MAXMEM; i*=2)); do
        if [ $((RAWMEM % i)) -eq $RAWMEM ]; then
                PAR2MEM=$(( i / 2 ))
                break
        fi
done
PAR2MEM=${PAR2MEM:-$MAXMEM}

################################################################

case "$1" in
        -h) echo "use: crc [-fp] file [ file ... ] - creates par2 recovery set"; exit
                ;;
        -f|-force) FORCE="1"; shift
                ;;
        -p) shift; PARLOC="$1"; shift
                ;;
        *)
esac

FORCE=${FORCE:-0}

[ "$PARLOC" == "${PARLOC%%PAR2*}" ] && PARLOC=$PARLOC/PAR2
[ ! -d $PARLOC ] && mkdir -pv $PARLOC

echo "start `date`"
for i in "$@"; do
        (( CNT++ ))
        j=$( basename $i )

        while [ 1 ]; do
                sleep $GAP
                JOBS=( $PARLOC/par2.* )
                # NUMJOBS is never less than 1
                if [ ${#JOBS[*]} -lt $MAXJOBS ]; then
                        mkdir $PARLOC/par2.$j
                        break
                fi
        done

        [ $(( CNT % 5 )) -eq 0 ] && echo ""
        echo -n "$i,"

        (
        ionice -c 3 nice par2 create -r$PERCENT -m$PAR2MEM $PARLOC/$j.par2 $i > /dev/null 2>&1

        if [ $? -ne 0 ]; then
                logger -st $0 "par2 create failed for $i"
        else
                md5sum $i $PARLOC/$j*.par2 > $PARLOC/$j.md5
                MD5=$( head -1 $PARLOC/$j.md5 )

                cd $PARLOC
                tar cpf - $j.*{md5,par2} > $j.${MD5:0:5}.par2.tar && rm $j.*{md5,par2}
        fi

        rmdir par2.$j
        ) &
done

echo -e "\nfinish `date`"


unSpawn 04-12-2015 04:47 AM

Thanks for posting your update, off of the 0-reply list now.


//NTLB

lazardo 02-15-2016 07:45 PM

Quote:

Originally Posted by lazardo (Post 5333880)
...
After many hours and much angst looking at ZoL/btrfs current buglists I could not pull the trigger on conversion and so stuck with ext4, added the bitmaps and went with PAR2 for bitrot protection.

End of scene:

After full recovery I went ahead w btrfs on one disk and duplicated files from ext4. After almost a year, 2 kernel and 1 btrfs-progs updates, there were zero failure/corruption or other incidents on either disk, however it was write performance that finally triggered replacing btrfs with ext4+PAR2:

Streaming writes from ext4 -> btrfs averaged just under 49MB/s while ext4 -> ext4 averaged 111MB/s even during ext4 lazy initialization.

Cheers,


All times are GMT -5. The time now is 04:59 PM.