LinuxQuestions.org - [SOLVED] mass substitutions

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - mass substitutions (https://www.linuxquestions.org/questions/programming-9/mass-substitutions-4175448362/)

Quote:

Originally Posted by schneidz (Post 4884178)

thanks but it says the function s|$^.*$ $.*$|s/\1\/\2/g| cannot be parsed

Okay, let's try again by breaking that sed into smaller pieces, hoping they will be "digestible" by aix.

Code:

 sed 's/^/s\//' $InFile1 \

|tr " " "/"              \

|sed 's/$/\/g/'          \

|sed -f - $InFile2 > $OutFile3

Daniel B. Martin

Quote:

Originally Posted by danielbmartin (Post 4884191)

Okay, let's try again by breaking that sed into smaller pieces, hoping they will be "digestible" by aix.

Code:

 sed 's/^/s\//' $InFile1 \

|tr " " "/"              \

|sed 's/$/\/g/'          \

|sed -f - $InFile2 > $OutFile3

Daniel B. Martin

once again thanx, but according to the previous error it seems like aix sed cant read from stdin: sed: 0602-420 Cannot open pattern file -.

Quote:

Originally Posted by schneidz (Post 4884194)

once again thanx, but according to the previous error it seems like aix sed cant read from stdin: sed: 0602-420 Cannot open pattern file -.

Not giving up yet! This version uses an intermediate file.

Code:

 sed 's/^/s\//' $InFile1 \

|tr " " "/"              \

|sed 's/$/\/g/'          \

> $Work1

sed -f $Work1 $InFile2 > $OutFile4

Daniel B. Martin

Quote:

Originally Posted by danielbmartin (Post 4884196)

Not giving up yet! This version uses an intermediate file.

Code:

 sed 's/^/s\//' $InFile1 \

|tr " " "/"              \

|sed 's/$/\/g/'          \

> $Work1

sed -f $Work1 $InFile2 > $OutFile4

Daniel B. Martin

thanx, i tried it with a 22 line infile1 and a 7 line infile2 and it seems to work well.
now i will time it using the large datasets and see what happens.

thanks alot (even if unsuccessful, at least i learned a bit more about sed).

Quote:

Originally Posted by schneidz (Post 4884201)

thanx, i tried it with a 22 line infile1 and a 7 line infile2 and it seems to work well.
now i will time it using the large datasets and see what happens.

Suggestion: test timidly. Start with a full-size InFile1 and an InFile2 which is a 10% subset of the real thing. Then 20%, then 30%. It will be instructive if the execution time increases linearly.

Daniel B. Martin

If you would like to keep the initial changes all sed you could try:

Code:

sed 's/$^\|$\| $/\//g;s/^/s/' infile > workfile

Quote:

Originally Posted by danielbmartin (Post 4884211)

it takes about 2 minutes to cross-correlate a list of 10 substitutions against the large file.

however i get an error like:

Code:

time sed -f sed.f dataset.txt > dataset.sub

sed: 0602-405 There are too many commands for the s/123456789/schneidz5/g function.

when i try to do all the substitutions.

edit: 100 substitutions took about 8 and 1/2 minuts. i tired with 1000 but i got the error above.
(1 substitution took about 1minute 8seconds. so its not linear... its like a bulk discount)

Quote:

Originally Posted by schneidz (Post 4884268)

... however i get an error like:

Code:

time sed -f sed.f dataset.txt > dataset.sub

sed: 0602-405 There are too many commands for the s/123456789/schneidz5/g function.

when i try to do all the substitutions.

edit: 100 substitutions took about 8 and 1/2 minutes. i tried with 1000 but i got the error above.
(1 substitution took about 1minute 8seconds. so its not linear... its like a bulk discount)

100 substitutions ran; 1000 did not. It may be expedient (though not elegant) to run 500 subs at a time until the whole task is accomplished. This might be done with a loop in which each iteration chews off the next 500 lines of File1, and makes all the substitutions in File2.

500 is a guess, maybe the upper limit is a lower number.

There is light at the end of this tunnel!

Daniel B. Martin

Quote:

Originally Posted by danielbmartin (Post 4884327)

yes i am in the process of haxing something together using split grep and sed. so far looks promising.

Instead of split, grep and sed, maybe a simple awk can prepare your files:

Code:

awk '!(NR%500){n++}{print "s/"$1"/"$2"/g" > "workfile" n}' infile

Now you cn simply loop through the files and use your sed -f option. Simply change 500 to whatever you find to be an acceptable number of changes :)

Quote:

Originally Posted by grail (Post 4884558)

Instead of split, grep and sed, maybe a simple awk can prepare your files:

Code:

awk '!(NR%500){n++}{print "s/"$1"/"$2"/g" > "workfile" n}' infile

Now you cn simply loop through the files and use your sed -f option. Simply change 500 to whatever you find to be an acceptable number of changes :)

I like this idea and attempted to construct a simple test case, but cannot make it work.

Code:

# Create a test file which contains 100 lines,

#  each of the form (number) XXXXX,

#  and break it into 5 equal segments.

seq -w 100          \

|sed 's/$/ XXXXX/'  \

 > $Work3

for ((pass=1;pass<=5;pass=pass+5))

do

  echo "This is loop iteration # $pass"

# grail said: awk '!(NR%500){n++}{print "s/"$1"/"$2"/g" > "workfile" n}' infile

              awk '!(NR%20) {n++}{print "s/"$1"/"$2"/g" > "$Work4"  n}' $Work3

  echo; echo "Segment $pass of input file Work3 ..."; cat $Work4

done

File Work3 is created as desired but the awk isn't producing Work4.
This is what happened.

Code:

This is loop iteration # 1



Segment 1 of input file Work3 ...

cat: /home/daniel/Desktop/LQfiles/dbm614w04.txt: No such file or directory

Please advise.

Daniel B. Martin

Quote:

Originally Posted by danielbmartin (Post 4885614)

Code:

# grail said: awk '!(NR%500){n++}{print "s/"$1"/"$2"/g" > "workfile" n}' infile

              awk '!(NR%20) {n++}{print "s/"$1"/"$2"/g" > "$Work4"  n}' $Work3

awk doesn't see shell variables, you need something like:

Code:

awk -vWork4="$Work4" '!(NR%20) {n++}{print "s/"$1"/"$2"/g" > (Work4 n)}' $Work3

# or some trickiness with quoting:

awk '!(NR%20) {n++}{print "s/"$1"/"$2"/g" > ("'"$Work4"'" n)}' $Work3

Quote:

Originally Posted by ntubski (Post 4885709)

... you need something like:

Code:

awk '!(NR%20) {n++}{print "s/"$1"/"$2"/g" > ("'"$Work4"'" n)}' $Work3

Thank you for getting me over that hurdle. The code runs but does not produce the expected output. For ease of testing I scaled back to a source file with only 9 lines and code which attempts to parcel them out 3 at a time.

This code ...

Code:

# Create a test file which contains 9 lines,

#  each of the form (number) XXXXX,

#  and break it into 3 equal segments.

seq -w 9 |sed 's/$/ XXXXX/' > $Work3

for ((pass=1;pass<=3;pass++))

do

  rm $Work5

  echo "This is loop iteration # $pass"

  awk '!(NR%4) {n++} {print "s/"$1"/"$2"/g" > ("'"$Work5"'" n)}' $Work3

  echo "Work5 ..."; cat $Work5              

done

... produced this result ...

Code:

This is loop iteration # 1

Work5 ...

s/1/XXXXX/g

s/2/XXXXX/g

s/3/XXXXX/g

This is loop iteration # 2

Work5 ...

s/1/XXXXX/g

s/2/XXXXX/g

s/3/XXXXX/g

This is loop iteration # 3

Work5 ...

s/1/XXXXX/g

s/2/XXXXX/g

s/3/XXXXX/g

Observe that it dished out the same three lines on each iteration.

Please advise.

Daniel B. Martin

Quote:

Originally Posted by danielbmartin (Post 4885964)

The awk code grail proposed already outputs to separate files, try this:

Code:

seq -w 9 | sed 's/$/ XXXXX/' > "$Work3"



# modifed n++ condition to avoid small hiccup on the first parcel

awk '(n*3 < NR) {n++} {print "s/"$1"/"$2"/g" > ("'"$Work5"'" n)}' "$Work3"



for work in "$Work5"* ; do

    echo "$work ..."

    cat "$work"

done

InFile1 ...

Code:

hello world

l33tz h4x0r

chunl akuma

quest tribe

salad carot

simon zelda

InFile2 ...

Code:

hello my name is simon, and i like to do drawings; simon says.

lemonade was a popular drink in my day, and it still is.

g0t r00tz third-line: chunli akuma ken ryu sakura

third-line: choppin broccoli -- helloproject2501helloceltics#35hello123

you dont win friends with salad

first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2

first-line: a tribe called quest - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul

Code ...

Code:

#

# Method of LQ Member danielbmartin #14 using sed

#  to break the change-pairs file into pieces,

#  and apply each piece individually to the source file.

#

# Rework InFile1 (the change pairs) into substitution pairs

#  for subsequent use by a "sed -f".

 sed 's/^/s\//' $InFile1 \

|tr " " "/"              \

|sed 's/$/\/g/'          \

> $Work01

# Make a copy of InFile2 (the source file), which will be 

#  incrementally transformed to the desired end product.

cat $InFile2 > $OutFile14

start=1

step=4  # step = number of lines in each subset

for ((start=1;;start=start+step))

do

  let stop=start+step-1

# Use sed to create Work09, a subset of the change file.

  sed $start','$stop'!d' $Work01 > $Work09

# If Work09 is an empty file, leave this for-loop.

# This escapes from what would otherwise be an infinite loop.

  if [ ! -s $Work09 ]; then break; fi

  echo; echo "Now applying this subset of the change file..."; cat $Work09

  sed -f $Work09 $OutFile14 > $Work14

  cat $Work14 > $OutFile14

done

This code applies the change-pairs 4 at a time.
In production use you would change the value of variable step to 300, 400, 500, whatever value your system can handle.
In production use you would disable the echo statements which are used for explanation.

Execution produced this on-screen display ...

Code:

Now applying this subset of the change file...

s/hello/world/g

s/l33tz/h4x0r/g

s/chunl/akuma/g

s/quest/tribe/g



Now applying this subset of the change file...

s/salad/carot/g

s/simon/zelda/g

... and produced this end product ...

Code:

world my name is zelda, and i like to do drawings; zelda says.

lemonade was a popular drink in my day, and it still is.

g0t r00tz third-line: akumai akuma ken ryu sakura

third-line: choppin broccoli -- worldproject2501worldceltics#35world123

you dont win friends with carot

first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2

first-line: a tribe called tribe - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul

Daniel B. Martin