ntubski |
02-14-2013 08:15 PM |
Quote:
Originally Posted by ntubski
(Post 4883803)
More improvement is possible by using the right algorithm, ...
I can't quite figure out what trs does because all the comments and names in the source are Polish.
|
I did some tests and it looks like trs isn't using a fast algorithm. trs is faster than sed, but only by a constant factor. For both methods, doubling input quadruples runtime.
Since it appears that there is no good solution for this, I think I'll try writing one using MultiFast: Aho-Corasick C Library.
Code:
~/tmp/replacing$ for size in 10 100 1000 2000 ; do ./replace.sh sed $size ; done
0.01 real 0.00 user - sed (10)
0.05 real 0.04 user - sed (100)
2.93 real 2.86 user - sed (1000)
10.67 real 10.57 user - sed (2000)
~/tmp/replacing$ for size in 10 100 1000 2000 4000 8000 ; do ./replace.sh trs $size ; done
0.00 real 0.00 user - trs (10)
0.01 real 0.00 user - trs (100)
0.24 real 0.23 user - trs (1000)
0.83 real 0.82 user - trs (2000)
3.15 real 3.12 user - trs (4000)
12.44 real 12.34 user - trs (8000)
Code:
#!/bin/bash
# replace.sh
method=$1
size=$2
TIMEFORMAT=$'%2R real\t%2U user\t-\t'"$method ($size) "
case "$method" in
sed)
time sed -r 's|(^.*) (.*)|s/\1\/\2/g|' "$size-replacements.txt" | \
sed -f - "$size-unreplaced.txt" > /dev/null
;;
trs)
time trs -f "$size-replacements.txt" < "$size-unreplaced.txt" > /dev/null
;;
*) echo "I don't know how to do $method" ; exit 1
;;
esac
Code:
#!/bin/sh
# Some test inputs: replace a bunch numbers to xxxxxxx. The number of
# replacement pairs is equal to the size of input, so a naive
# algorithm will be show quadratic running times.
makeit()
{
seq -f'%09.0f xxxxxxxxx' $1 > "$1-replacements.txt"
seq -f'%09.0f' $1 > "$1-unreplaced.txt"
}
rm *-replacements.txt *-unreplaced.txt
makeit 10
makeit 100
makeit 1000
makeit 2000
makeit 4000
makeit 8000
|