LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 02-02-2013, 01:03 PM   #1
schneidz
Senior Member
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 4,095

Rep: Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635
mass substitutions


[aix]
i have a two column file like:
Code:
...
hello world
l33tz h4x0r
akuma gouki
quest tribe
salad carot
simon zelda
...
which has about 30,000 pairs.

i would like to take a file with about 400,000 lines (each line is about 4,000 charecters) and replace each occurrence of the word in the left with the word on the rite. e.g.:
Code:
hello my name is simon, and i like to do drawings; simon says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: chun-li akuma ken ryu sakura
third-line: choppin broccoli -- helloproject2501helloceltics#35hello123
you dont win friends with salad
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called quest - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul
should become:
Code:
world my name is zelda, and i like to do drawings; zelda says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: chun-li gouki ken ryu sakura
third-line: choppin broccoli -- worldproject2501worldceltics#35world123
you dont win friends with carot
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called tribe - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul
i tried piping in a while loop using sed s/col1/col2/g but each iteration takes about 40 seconds (would take a few weeks to finish).

i also tried creating a c program that would determine if the match occured on a certain record then make the substitution at either byte offset 29 or byte 310 and 637 of the array (but it took about the same amount of time) -- i'll post code when i get back to work.

is there anything that anyone could suggest (possibly a way to do all 30,000 substitutions in 1 40 second iteration ) ?
thanks,

Last edited by schneidz; 02-03-2013 at 08:20 AM.
 
Old 02-02-2013, 01:12 PM   #2
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Some ideas:

Sort or group the pairs alphabetically so you can jump to them very quickly rather than iterating through 30000 elements. Match them letter by letter, and exit as soon as no match is found.

When looking through the large file, check every word for a match and replace if it is found, that way you only go through the file once.

Last edited by H_TeXMeX_H; 02-02-2013 at 01:14 PM.
 
Old 02-02-2013, 02:30 PM   #3
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,136

Rep: Reputation: 299Reputation: 299Reputation: 299
Previous solution?

This problem is similar (maybe identical) to:
http://www.linuxquestions.org/questi...ce-4175432577/

Daniel B. Martin
 
1 members found this post helpful.
Old 02-02-2013, 02:59 PM   #4
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 640

Rep: Reputation: 374Reputation: 374Reputation: 374Reputation: 374
Hi.

There is an utility, called trs (konwert Debian package). It can read replace-pairs from file (-f option). Give it a try.
 
2 members found this post helpful.
Old 02-02-2013, 08:23 PM   #5
jroggow
Member
 
Registered: Mar 2006
Distribution: Slackware
Posts: 33

Rep: Reputation: 15
I would think tr would do the trick. Something like:

EDIT:
I just reread the original post. Don't use this. I thought you were trying something else. tr replaces characters, not strings. It won't work . But it would be nifty if it did. I shouldn't be allowed to post on my way to bed.

Code:
for line in $file
do
    echo "$line" | tr "${column1[id]} ${column2[id]}"
done
should work for you. You'll need to store the respective column values in arrays and match some expressions, but I think it will be fairly speedy once those kinks are worked out.

I realize that code example was simplistic to the point of 'fucking stupid', but I'm just driving by.

Except the part about tr. That is a stroke of genius. Look into tr.

Last edited by jroggow; 02-03-2013 at 05:54 AM. Reason: Bouncy trackpad
 
Old 02-02-2013, 09:15 PM   #6
schneidz
Senior Member
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 4,095

Original Poster
Rep: Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635
Quote:
Originally Posted by danielbmartin View Post
This problem is similar (maybe identical) to:
http://www.linuxquestions.org/questi...ce-4175432577/
Daniel B. Martin
eerily similar. too bad I don't have gnu awk on aix
 
Old 02-02-2013, 10:27 PM   #7
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,495

Rep: Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850
Quote:
Originally Posted by schneidz View Post
eerily similar. too bad I don't have gnu awk on aix
I don't think that solution relies on any GNU-specific awk features. On the other hand, I don't know if it would be fast enough either.

trs looks interesting, although it appears to be unmaintained (homepage listed in README is down) you can still get the source from the Debian package page.


@jroggow: I don't think you really understand what tr does...
 
1 members found this post helpful.
Old 02-03-2013, 11:47 AM   #8
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,136

Rep: Reputation: 299Reputation: 299Reputation: 299
Quote:
Originally Posted by schneidz View Post
... i would like to take a file with about 400,000 lines (each line is about 4,000 charecters) and replace each occurrence of the word in the left with the word on the rite. ...
You say "replace each occurrence of the word" yet your example has instances where you replaced each occurrence of the string. Which is correct? Since the files are large and execution time is an important consideration, this distinction is important.

Daniel B. Martin
 
Old 02-03-2013, 12:21 PM   #9
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,136

Rep: Reputation: 299Reputation: 299Reputation: 299
Quote:
Originally Posted by schneidz View Post
...i tried piping in a while loop using sed s/col1/col2/g but each iteration takes about 40 seconds (would take a few weeks to finish).
Considering the size of the input files, this transformation is a big job. Your wish for a major improvement in execution time may be unrealistic.

However, you might gain something by executing sed without a loop.

InFile1 ...
Code:
hello world
l33tz h4x0r
chunl akuma
quest tribe
salad carot
simon zelda
InFile2 ...
Code:
hello my name is simon, and i like to do drawings; simon says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: chunli akuma ken ryu sakura
third-line: choppin broccoli -- helloproject2501helloceltics#35hello123
you dont win friends with salad
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called quest - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul
Code ...
Code:
 sed -r 's|(^.*) (.*)|s/\1\/\2/g|' $InFile1 \
|sed -f - $InFile2 > $OutFile1
OutFile1 ...
Code:
world my name is zelda, and i like to do drawings; zelda says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: akumai akuma ken ryu sakura
third-line: choppin broccoli -- worldproject2501worldceltics#35world123
you dont win friends with carot
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called tribe - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul
Daniel B. Martin
 
1 members found this post helpful.
Old 02-03-2013, 04:48 PM   #10
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,495

Rep: Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850
Quote:
Originally Posted by danielbmartin View Post
Considering the size of the input files, this transformation is a big job. Your wish for a major improvement in execution time may be unrealistic.
More improvement is possible by using the right algorithm, Aho-Corasick or Rabin-Karp seem like good choices. The Aho-Corasick page has links to some implementations but it's going to involve more effort than the sed/awk solutions.

I can't quite figure out what trs does because all the comments and names in the source are Polish.
 
Old 02-04-2013, 04:25 AM   #11
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,564

Rep: Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939
I like Daniel's idea and thought of suggesting something similar, however I am not sure how the file operations are performed, ie does one line in the change file get performed on the main
file or are all 30k changes performed on the first line and then move to the second in the main file.

On a side not for the OP, are we to assume that all lines are uniq in the change file and no one future change will be thwarted by a prior one?
 
Old 02-04-2013, 08:09 AM   #12
schneidz
Senior Member
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 4,095

Original Poster
Rep: Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635
heres some code i promised:
Code:
#include "stdio.h"
#include <string.h>

main(int argc, char *argv[])
{
 int i;
 char xyz[6000], abc[50], *s1, *s2;

 FILE *fstream0, *fstream2, *fstream1;
 fstream0 = fopen(argv[1], "r");
 fstream1 = fopen("xyz.tmp", "w");
 fstream2 = fopen("pairs.paste", "r");

 while(fgets(abc, 50, fstream2) != NULL)
 {
  while(fgets(xyz, 6000, fstream0) != NULL)
  {
   if(strcmp(strndup(abc,9),strndup(xyz+3,9)) == 0)
   {
    xyz[3] = abc[10]; xyz[4] = abc[11]; xyz[5] = abc[12]; xyz[6] = abc[13]; xyz[7] = abc[14]; xyz[8] = abc[15]; xyz[9] = abc[16]; xyz[10] = abc[17]; xyz[11] = abc[18];
    if(strcmp(strndup(abc,9),strndup(xyz+69,9)) == 0)
    {
         xyz[69] = abc[10]; xyz[70] = abc[11]; xyz[71] = abc[12]; xyz[72] = abc[13]; xyz[73] = abc[14]; xyz[74] = abc[15]; xyz[75] = abc[16]; xyz[76] = abc[17]; xyz[77] = abc[18];
    }
    if(strcmp(strndup(abc,9),strndup(xyz+2976,9)) == 0)
    {
         xyz[2976] = abc[10]; xyz[2977] = abc[11]; xyz[2978] = abc[12]; xyz[2979] = abc[13]; xyz[2980] = abc[14]; xyz[2981] = abc[15]; xyz[2982] = abc[16]; xyz[2983] = abc[17]; xyz[2984] = abc[18];
    }
   }
  fprintf(fstream1, "%s", xyz);
  }
  fclose(fstream1); fclose(fstream0);
  rename("xyz.tmp", argv[1]);
  rewind(fstream0); rewind(fstream1);
  fstream0 = fopen(argv[1], "r");
  fstream1 = fopen("xyz.tmp", "w");
 }
 fclose(fstream2); fclose(fstream0); fclose(fstream1);
}

@ grail: lines are probably uniq.



edit: here is some information about the file i'm dealing with:
Code:
schneidz-str-search.ksh "substring" test.tmp
9       :  4  70  2977          # line begins with 123
10      :  4  70                # line begins with 456
11      :  4  70                # line begins with 456
12      :  4  70                # line begins with 456
13      :  4  70                # line begins with 456
14      :  4                    # line begins with 789
15      :  4                    # line begins with 789
16      :  4                    # line begins with 789

Last edited by schneidz; 02-04-2013 at 08:42 AM.
 
Old 02-04-2013, 09:11 AM   #13
schneidz
Senior Member
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 4,095

Original Poster
Rep: Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635
Quote:
Originally Posted by danielbmartin View Post
Considering the size of the input files, this transformation is a big job. Your wish for a major improvement in execution time may be unrealistic.

However, you might gain something by executing sed without a loop.

InFile1 ...
Code:
hello world
l33tz h4x0r
chunl akuma
quest tribe
salad carot
simon zelda
InFile2 ...
Code:
hello my name is simon, and i like to do drawings; simon says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: chunli akuma ken ryu sakura
third-line: choppin broccoli -- helloproject2501helloceltics#35hello123
you dont win friends with salad
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called quest - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul
Code ...
Code:
 sed -r 's|(^.*) (.*)|s/\1\/\2/g|' $InFile1 \
|sed -f - $InFile2 > $OutFile1
OutFile1 ...
Code:
world my name is zelda, and i like to do drawings; zelda says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: akumai akuma ken ryu sakura
third-line: choppin broccoli -- worldproject2501worldceltics#35world123
you dont win friends with carot
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called tribe - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul
Daniel B. Martin
aix-sux:
this is the error i am getting with aix's version of sed:
Code:
sed -r 's|(^.*) (.*)|s/\1\/\2/g|' clm.tmp |sed -f - test.tmp
sed: Not a recognized flag: r
Usage:  sed [-n] [-u] Script [File ...]
        sed [-n] [-u] [-e Script] ... [-f Script_file] ... [File ...]
sed: 0602-420 Cannot open pattern file -.
Usage:  sed [-n] [-u] Script [File ...]
        sed [-n] [-u] [-e Script] ... [-f Script_file] ... [File ...]
 
Old 02-04-2013, 09:48 AM   #14
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,136

Rep: Reputation: 299Reputation: 299Reputation: 299
Quote:
Originally Posted by schneidz View Post
... this is the error i am getting with aix's version of sed ...
Try this code variation ...
Code:
 sed 's|\(^.*\) \(.*\)|s/\1\/\2/g|' $InFile1 \
|sed -f - $InFile2 > $OutFile1
Daniel B. Martin
 
Old 02-04-2013, 09:51 AM   #15
schneidz
Senior Member
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 4,095

Original Poster
Rep: Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635
Quote:
Originally Posted by danielbmartin View Post
Try this code variation ...
Code:
 sed 's|\(^.*\) \(.*\)|s/\1\/\2/g|' $InFile1 \
|sed -f - $InFile2 > $OutFile1
Daniel B. Martin
thanks but it says the function s|\(^.*\) \(.*\)|s/\1\/\2/g| cannot be parsed
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] awk - field substitutions gafoleyo Linux - Newbie 12 05-13-2012 04:29 PM
code substitutions Loarn Programming 2 07-14-2011 06:07 PM
string substitutions within a file cleopard Programming 1 09-05-2008 03:52 PM
variables within sed substitutions? ocicat Programming 3 07-29-2007 12:17 PM
Perl: Using Vars in Substitutions cramer Programming 6 08-26-2006 12:52 PM


All times are GMT -5. The time now is 09:11 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration