LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   mass substitutions (https://www.linuxquestions.org/questions/programming-9/mass-substitutions-4175448362/)

ntubski 02-14-2013 08:15 PM

Quote:

Originally Posted by ntubski (Post 4883803)
More improvement is possible by using the right algorithm, ...
I can't quite figure out what trs does because all the comments and names in the source are Polish.

I did some tests and it looks like trs isn't using a fast algorithm. trs is faster than sed, but only by a constant factor. For both methods, doubling input quadruples runtime.

Since it appears that there is no good solution for this, I think I'll try writing one using MultiFast: Aho-Corasick C Library.

Code:

~/tmp/replacing$ for size in 10 100 1000 2000 ; do ./replace.sh sed $size ; done
0.01 real        0.00 user        -        sed (10)
0.05 real        0.04 user        -        sed (100)
2.93 real        2.86 user        -        sed (1000)
10.67 real        10.57 user        -        sed (2000)
~/tmp/replacing$ for size in 10 100 1000 2000 4000 8000 ; do ./replace.sh trs $size ; done
0.00 real        0.00 user        -        trs (10)
0.01 real        0.00 user        -        trs (100)
0.24 real        0.23 user        -        trs (1000)
0.83 real        0.82 user        -        trs (2000)
3.15 real        3.12 user        -        trs (4000)
12.44 real        12.34 user        -        trs (8000)

Code:

#!/bin/bash
# replace.sh

method=$1
size=$2

TIMEFORMAT=$'%2R real\t%2U user\t-\t'"$method ($size) "

case "$method" in
    sed)
        time sed -r 's|(^.*) (.*)|s/\1\/\2/g|' "$size-replacements.txt" | \
            sed -f - "$size-unreplaced.txt" > /dev/null
        ;;

    trs)
        time trs -f "$size-replacements.txt" < "$size-unreplaced.txt" > /dev/null
        ;;

    *) echo "I don't know how to do $method" ; exit 1
        ;;
esac

Code:

#!/bin/sh

# Some test inputs: replace a bunch numbers to xxxxxxx. The number of
# replacement pairs is equal to the size of input, so a naive
# algorithm will be show quadratic running times.

makeit()
{
    seq -f'%09.0f xxxxxxxxx' $1 > "$1-replacements.txt"
    seq -f'%09.0f' $1 > "$1-unreplaced.txt"
}

rm *-replacements.txt *-unreplaced.txt

makeit 10
makeit 100
makeit 1000
makeit 2000
makeit 4000
makeit 8000


ntubski 02-16-2013 12:25 PM

fast-replace shows linear run time:
Code:

~/tmp/replacing$ for size in 1000 2000 4000 8000 16000 32000 ; do ./replace.sh fast-replace $size ; done
0.01 real        0.00 user        -        fast-replace (1000)
0.02 real        0.01 user        -        fast-replace (2000)
0.03 real        0.02 user        -        fast-replace (4000)
0.06 real        0.04 user        -        fast-replace (8000)
0.10 real        0.08 user        -        fast-replace (16000)
0.20 real        0.17 user        -        fast-replace (32000)

Added case to replace.sh:
Code:

    fast-replace)
        time ./fast-replace "$size-replacements.txt" < "$size-unreplaced.txt" > /dev/null
        ;;



All times are GMT -5. The time now is 01:50 AM.