[SOLVED] mass substitutions

schneidz · 02-02-2013, 01:03 PM

[aix]
i have a two column file like:

Code:

...
hello world
l33tz h4x0r
akuma gouki
quest tribe
salad carot
simon zelda
...

which has about 30,000 pairs.

i would like to take a file with about 400,000 lines (each line is about 4,000 charecters) and replace each occurrence of the word in the left with the word on the rite. e.g.:

Code:

hello my name is simon, and i like to do drawings; simon says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: chun-li akuma ken ryu sakura
third-line: choppin broccoli -- helloproject2501helloceltics#35hello123
you dont win friends with salad
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called quest - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul

should become:

Code:

world my name is zelda, and i like to do drawings; zelda says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: chun-li gouki ken ryu sakura
third-line: choppin broccoli -- worldproject2501worldceltics#35world123
you dont win friends with carot
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called tribe - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul

i tried piping in a while loop using sed s/col1/col2/g but each iteration takes about 40 seconds (would take a few weeks to finish).

i also tried creating a c program that would determine if the match occured on a certain record then make the substitution at either byte offset 29 or byte 310 and 637 of the array (but it took about the same amount of time) -- i'll post code when i get back to work.

is there anything that anyone could suggest (possibly a way to do all 30,000 substitutions in 1 40 second iteration

) ?
thanks,

H_TeXMeX_H · 02-02-2013, 01:12 PM

Some ideas:

Sort or group the pairs alphabetically so you can jump to them very quickly rather than iterating through 30000 elements. Match them letter by letter, and exit as soon as no match is found.

When looking through the large file, check every word for a match and replace if it is found, that way you only go through the file once.

danielbmartin · 02-02-2013, 02:30 PM

This problem is similar (maybe identical) to:
http://www.linuxquestions.org/questi...ce-4175432577/

Daniel B. Martin

firstfire · 02-02-2013, 02:59 PM

Hi.

There is an utility, called trs (konwert Debian package). It can read replace-pairs from file (-f option). Give it a try.

jroggow · 02-02-2013, 08:23 PM

I would think tr would do the trick. Something like:

EDIT:
I just reread the original post. Don't use this. I thought you were trying something else. tr replaces characters, not strings. It won't work . But it would be nifty if it did. I shouldn't be allowed to post on my way to bed.

Code:

for line in $file
do
    echo "$line" | tr "${column1[id]} ${column2[id]}"
done

should work for you. You'll need to store the respective column values in arrays and match some expressions, but I think it will be fairly speedy once those kinks are worked out.

I realize that code example was simplistic to the point of 'fucking stupid', but I'm just driving by.

Except the part about tr. That is a stroke of genius. Look into tr.

schneidz · 02-02-2013, 09:15 PM

Quote:

Originally Posted by danielbmartin

This problem is similar (maybe identical) to:
http://www.linuxquestions.org/questi...ce-4175432577/
Daniel B. Martin

eerily similar. too bad I don't have gnu awk on aix

ntubski · 02-02-2013, 10:27 PM

Quote:

Originally Posted by schneidz

eerily similar. too bad I don't have gnu awk on aix

I don't think that solution relies on any GNU-specific awk features. On the other hand, I don't know if it would be fast enough either.

trs looks interesting, although it appears to be unmaintained (homepage listed in README is down) you can still get the source from the Debian package page.

@jroggow: I don't think you really understand what tr does...

danielbmartin · 02-03-2013, 11:47 AM

Quote:

Originally Posted by schneidz

... i would like to take a file with about 400,000 lines (each line is about 4,000 charecters) and replace each occurrence of the word in the left with the word on the rite. ...

You say "replace each occurrence of the word" yet your example has instances where you replaced each occurrence of the string. Which is correct? Since the files are large and execution time is an important consideration, this distinction is important.

Daniel B. Martin

danielbmartin · 02-03-2013, 12:21 PM

Quote:

Originally Posted by schneidz

...i tried piping in a while loop using sed s/col1/col2/g but each iteration takes about 40 seconds (would take a few weeks to finish).

Considering the size of the input files, this transformation is a big job. Your wish for a major improvement in execution time may be unrealistic.

However, you might gain something by executing sed without a loop.

InFile1 ...

Code:

hello world
l33tz h4x0r
chunl akuma
quest tribe
salad carot
simon zelda

InFile2 ...

Code:

hello my name is simon, and i like to do drawings; simon says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: chunli akuma ken ryu sakura
third-line: choppin broccoli -- helloproject2501helloceltics#35hello123
you dont win friends with salad
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called quest - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul

Code ...

Code:

 sed -r 's|(^.*) (.*)|s/\1\/\2/g|' $InFile1 \
|sed -f - $InFile2 > $OutFile1

OutFile1 ...

Code:

world my name is zelda, and i like to do drawings; zelda says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: akumai akuma ken ryu sakura
third-line: choppin broccoli -- worldproject2501worldceltics#35world123
you dont win friends with carot
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called tribe - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul

Daniel B. Martin

ntubski · 02-03-2013, 04:48 PM

Quote:

Originally Posted by danielbmartin

Considering the size of the input files, this transformation is a big job. Your wish for a major improvement in execution time may be unrealistic.

More improvement is possible by using the right algorithm, Aho-Corasick or Rabin-Karp seem like good choices. The Aho-Corasick page has links to some implementations but it's going to involve more effort than the sed/awk solutions.

I can't quite figure out what trs does because all the comments and names in the source are Polish.

grail · 02-04-2013, 04:25 AM

I like Daniel's idea and thought of suggesting something similar, however I am not sure how the file operations are performed, ie does one line in the change file get performed on the main
file or are all 30k changes performed on the first line and then move to the second in the main file.

On a side not for the OP, are we to assume that all lines are uniq in the change file and no one future change will be thwarted by a prior one?

schneidz · 02-04-2013, 08:09 AM

heres some code i promised:

Code:

#include "stdio.h"
#include <string.h>

main(int argc, char *argv[])
{
 int i;
 char xyz[6000], abc[50], *s1, *s2;

 FILE *fstream0, *fstream2, *fstream1;
 fstream0 = fopen(argv[1], "r");
 fstream1 = fopen("xyz.tmp", "w");
 fstream2 = fopen("pairs.paste", "r");

 while(fgets(abc, 50, fstream2) != NULL)
 {
  while(fgets(xyz, 6000, fstream0) != NULL)
  {
   if(strcmp(strndup(abc,9),strndup(xyz+3,9)) == 0)
   {
    xyz[3] = abc[10]; xyz[4] = abc[11]; xyz[5] = abc[12]; xyz[6] = abc[13]; xyz[7] = abc[14]; xyz[8] = abc[15]; xyz[9] = abc[16]; xyz[10] = abc[17]; xyz[11] = abc[18];
    if(strcmp(strndup(abc,9),strndup(xyz+69,9)) == 0)
    {
         xyz[69] = abc[10]; xyz[70] = abc[11]; xyz[71] = abc[12]; xyz[72] = abc[13]; xyz[73] = abc[14]; xyz[74] = abc[15]; xyz[75] = abc[16]; xyz[76] = abc[17]; xyz[77] = abc[18];
    }
    if(strcmp(strndup(abc,9),strndup(xyz+2976,9)) == 0)
    {
         xyz[2976] = abc[10]; xyz[2977] = abc[11]; xyz[2978] = abc[12]; xyz[2979] = abc[13]; xyz[2980] = abc[14]; xyz[2981] = abc[15]; xyz[2982] = abc[16]; xyz[2983] = abc[17]; xyz[2984] = abc[18];
    }
   }
  fprintf(fstream1, "%s", xyz);
  }
  fclose(fstream1); fclose(fstream0);
  rename("xyz.tmp", argv[1]);
  rewind(fstream0); rewind(fstream1);
  fstream0 = fopen(argv[1], "r");
  fstream1 = fopen("xyz.tmp", "w");
 }
 fclose(fstream2); fclose(fstream0); fclose(fstream1);
}

@ grail: lines are probably uniq.

edit: here is some information about the file i'm dealing with:

Code:

schneidz-str-search.ksh "substring" test.tmp
9       :  4  70  2977          # line begins with 123
10      :  4  70                # line begins with 456
11      :  4  70                # line begins with 456
12      :  4  70                # line begins with 456
13      :  4  70                # line begins with 456
14      :  4                    # line begins with 789
15      :  4                    # line begins with 789
16      :  4                    # line begins with 789

schneidz · 02-04-2013, 09:11 AM

Quote:

Originally Posted by danielbmartin

Considering the size of the input files, this transformation is a big job. Your wish for a major improvement in execution time may be unrealistic.

However, you might gain something by executing sed without a loop.

InFile1 ...

Code:

hello world
l33tz h4x0r
chunl akuma
quest tribe
salad carot
simon zelda

InFile2 ...

Code:

hello my name is simon, and i like to do drawings; simon says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: chunli akuma ken ryu sakura
third-line: choppin broccoli -- helloproject2501helloceltics#35hello123
you dont win friends with salad
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called quest - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul

Code ...

Code:

 sed -r 's|(^.*) (.*)|s/\1\/\2/g|' $InFile1 \
|sed -f - $InFile2 > $OutFile1

OutFile1 ...

Code:

world my name is zelda, and i like to do drawings; zelda says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: akumai akuma ken ryu sakura
third-line: choppin broccoli -- worldproject2501worldceltics#35world123
you dont win friends with carot
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called tribe - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul

Daniel B. Martin

aix-sux:
this is the error i am getting with aix's version of sed:

Code:

sed -r 's|(^.*) (.*)|s/\1\/\2/g|' clm.tmp |sed -f - test.tmp
sed: Not a recognized flag: r
Usage:  sed [-n] [-u] Script [File ...]
        sed [-n] [-u] [-e Script] ... [-f Script_file] ... [File ...]
sed: 0602-420 Cannot open pattern file -.
Usage:  sed [-n] [-u] Script [File ...]
        sed [-n] [-u] [-e Script] ... [-f Script_file] ... [File ...]

danielbmartin · 02-04-2013, 09:48 AM

Quote:

Originally Posted by schneidz

... this is the error i am getting with aix's version of sed ...

Try this code variation ...

Code:

 sed 's|\(^.*\) \(.*\)|s/\1\/\2/g|' $InFile1 \
|sed -f - $InFile2 > $OutFile1

Daniel B. Martin

schneidz · 02-04-2013, 09:51 AM

Quote:

Originally Posted by danielbmartin

Try this code variation ...

Code:

 sed 's|\(^.*\) \(.*\)|s/\1\/\2/g|' $InFile1 \
|sed -f - $InFile2 > $OutFile1

Daniel B. Martin

thanks but it says the function s|\(^.*\) \(.*\)|s/\1\/\2/g| cannot be parsed