ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
...
hello world
l33tz h4x0r
akuma gouki
quest tribe
salad carot
simon zelda
...
which has about 30,000 pairs.
i would like to take a file with about 400,000 lines (each line is about 4,000 charecters) and replace each occurrence of the word in the left with the word on the rite. e.g.:
Code:
hello my name is simon, and i like to do drawings; simon says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: chun-li akuma ken ryu sakura
third-line: choppin broccoli -- helloproject2501helloceltics#35hello123
you dont win friends with salad
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called quest - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul
should become:
Code:
world my name is zelda, and i like to do drawings; zelda says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: chun-li gouki ken ryu sakura
third-line: choppin broccoli -- worldproject2501worldceltics#35world123
you dont win friends with carot
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called tribe - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul
i tried piping in a while loop using seds/col1/col2/g but each iteration takes about 40 seconds (would take a few weeks to finish).
i also tried creating a c program that would determine if the match occured on a certain record then make the substitution at either byte offset 29 or byte 310 and 637 of the array (but it took about the same amount of time) -- i'll post code when i get back to work.
is there anything that anyone could suggest (possibly a way to do all 30,000 substitutions in 1 40 second iteration ) ?
thanks,
Sort or group the pairs alphabetically so you can jump to them very quickly rather than iterating through 30000 elements. Match them letter by letter, and exit as soon as no match is found.
When looking through the large file, check every word for a match and replace if it is found, that way you only go through the file once.
Last edited by H_TeXMeX_H; 02-02-2013 at 01:14 PM.
I would think tr would do the trick. Something like:
EDIT:
I just reread the original post. Don't use this. I thought you were trying something else. tr replaces characters, not strings. It won't work . But it would be nifty if it did. I shouldn't be allowed to post on my way to bed.
Code:
for line in $file
do
echo "$line" | tr "${column1[id]} ${column2[id]}"
done
should work for you. You'll need to store the respective column values in arrays and match some expressions, but I think it will be fairly speedy once those kinks are worked out.
I realize that code example was simplistic to the point of 'fucking stupid', but I'm just driving by.
Except the part about tr. That is a stroke of genius. Look into tr.
Last edited by jroggow; 02-03-2013 at 05:54 AM.
Reason: Bouncy trackpad
eerily similar. too bad I don't have gnu awk on aix
I don't think that solution relies on any GNU-specific awk features. On the other hand, I don't know if it would be fast enough either.
trs looks interesting, although it appears to be unmaintained (homepage listed in README is down) you can still get the source from the Debian package page.
@jroggow: I don't think you really understand what tr does...
... i would like to take a file with about 400,000 lines (each line is about 4,000 charecters) and replace each occurrence of the word in the left with the word on the rite. ...
You say "replace each occurrence of the word" yet your example has instances where you replaced each occurrence of the string. Which is correct? Since the files are large and execution time is an important consideration, this distinction is important.
...i tried piping in a while loop using seds/col1/col2/g but each iteration takes about 40 seconds (would take a few weeks to finish).
Considering the size of the input files, this transformation is a big job. Your wish for a major improvement in execution time may be unrealistic.
However, you might gain something by executing sed without a loop.
InFile1 ...
Code:
hello world
l33tz h4x0r
chunl akuma
quest tribe
salad carot
simon zelda
InFile2 ...
Code:
hello my name is simon, and i like to do drawings; simon says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: chunli akuma ken ryu sakura
third-line: choppin broccoli -- helloproject2501helloceltics#35hello123
you dont win friends with salad
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called quest - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul
world my name is zelda, and i like to do drawings; zelda says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: akumai akuma ken ryu sakura
third-line: choppin broccoli -- worldproject2501worldceltics#35world123
you dont win friends with carot
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called tribe - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul
Considering the size of the input files, this transformation is a big job. Your wish for a major improvement in execution time may be unrealistic.
More improvement is possible by using the right algorithm, Aho-Corasick or Rabin-Karp seem like good choices. The Aho-Corasick page has links to some implementations but it's going to involve more effort than the sed/awk solutions.
I can't quite figure out what trs does because all the comments and names in the source are Polish.
I like Daniel's idea and thought of suggesting something similar, however I am not sure how the file operations are performed, ie does one line in the change file get performed on the main
file or are all 30k changes performed on the first line and then move to the second in the main file.
On a side not for the OP, are we to assume that all lines are uniq in the change file and no one future change will be thwarted by a prior one?
edit: here is some information about the file i'm dealing with:
Code:
schneidz-str-search.ksh "substring" test.tmp
9 : 4 70 2977 # line begins with 123
10 : 4 70 # line begins with 456
11 : 4 70 # line begins with 456
12 : 4 70 # line begins with 456
13 : 4 70 # line begins with 456
14 : 4 # line begins with 789
15 : 4 # line begins with 789
16 : 4 # line begins with 789
Considering the size of the input files, this transformation is a big job. Your wish for a major improvement in execution time may be unrealistic.
However, you might gain something by executing sed without a loop.
InFile1 ...
Code:
hello world
l33tz h4x0r
chunl akuma
quest tribe
salad carot
simon zelda
InFile2 ...
Code:
hello my name is simon, and i like to do drawings; simon says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: chunli akuma ken ryu sakura
third-line: choppin broccoli -- helloproject2501helloceltics#35hello123
you dont win friends with salad
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called quest - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul
world my name is zelda, and i like to do drawings; zelda says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: akumai akuma ken ryu sakura
third-line: choppin broccoli -- worldproject2501worldceltics#35world123
you dont win friends with carot
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called tribe - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul
Daniel B. Martin
aix-sux:
this is the error i am getting with aix's version of sed:
Code:
sed -r 's|(^.*) (.*)|s/\1\/\2/g|' clm.tmp |sed -f - test.tmp
sed: Not a recognized flag: r
Usage: sed [-n] [-u] Script [File ...]
sed [-n] [-u] [-e Script] ... [-f Script_file] ... [File ...]
sed: 0602-420 Cannot open pattern file -.
Usage: sed [-n] [-u] Script [File ...]
sed [-n] [-u] [-e Script] ... [-f Script_file] ... [File ...]
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.