LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   mass substitutions (https://www.linuxquestions.org/questions/programming-9/mass-substitutions-4175448362/)

danielbmartin 02-04-2013 10:08 AM

Quote:

Originally Posted by schneidz (Post 4884178)
thanks but it says the function s|\(^.*\) \(.*\)|s/\1\/\2/g| cannot be parsed

Okay, let's try again by breaking that sed into smaller pieces, hoping they will be "digestible" by aix.
Code:

sed 's/^/s\//' $InFile1 \
|tr " " "/"              \
|sed 's/$/\/g/'          \
|sed -f - $InFile2 > $OutFile3

Daniel B. Martin

schneidz 02-04-2013 10:12 AM

Quote:

Originally Posted by danielbmartin (Post 4884191)
Okay, let's try again by breaking that sed into smaller pieces, hoping they will be "digestible" by aix.
Code:

sed 's/^/s\//' $InFile1 \
|tr " " "/"              \
|sed 's/$/\/g/'          \
|sed -f - $InFile2 > $OutFile3

Daniel B. Martin

once again thanx, but according to the previous error it seems like aix sed cant read from stdin: sed: 0602-420 Cannot open pattern file -.

danielbmartin 02-04-2013 10:20 AM

Quote:

Originally Posted by schneidz (Post 4884194)
once again thanx, but according to the previous error it seems like aix sed cant read from stdin: sed: 0602-420 Cannot open pattern file -.

Not giving up yet! This version uses an intermediate file.
Code:

sed 's/^/s\//' $InFile1 \
|tr " " "/"              \
|sed 's/$/\/g/'          \
> $Work1
sed -f $Work1 $InFile2 > $OutFile4

Daniel B. Martin

schneidz 02-04-2013 10:34 AM

Quote:

Originally Posted by danielbmartin (Post 4884196)
Not giving up yet! This version uses an intermediate file.
Code:

sed 's/^/s\//' $InFile1 \
|tr " " "/"              \
|sed 's/$/\/g/'          \
> $Work1
sed -f $Work1 $InFile2 > $OutFile4

Daniel B. Martin

thanx, i tried it with a 22 line infile1 and a 7 line infile2 and it seems to work well.
now i will time it using the large datasets and see what happens.

thanks alot (even if unsuccessful, at least i learned a bit more about sed).

danielbmartin 02-04-2013 10:53 AM

Quote:

Originally Posted by schneidz (Post 4884201)
thanx, i tried it with a 22 line infile1 and a 7 line infile2 and it seems to work well.
now i will time it using the large datasets and see what happens.

Suggestion: test timidly. Start with a full-size InFile1 and an InFile2 which is a 10% subset of the real thing. Then 20%, then 30%. It will be instructive if the execution time increases linearly.

Daniel B. Martin

grail 02-04-2013 11:53 AM

If you would like to keep the initial changes all sed you could try:
Code:

sed 's/\(^\|$\| \)/\//g;s/^/s/' infile > workfile

schneidz 02-04-2013 12:43 PM

Quote:

Originally Posted by danielbmartin (Post 4884211)
Suggestion: test timidly. Start with a full-size InFile1 and an InFile2 which is a 10% subset of the real thing. Then 20%, then 30%. It will be instructive if the execution time increases linearly.

Daniel B. Martin

it takes about 2 minutes to cross-correlate a list of 10 substitutions against the large file.


however i get an error like:
Code:

time sed -f sed.f dataset.txt > dataset.sub
sed: 0602-405 There are too many commands for the s/123456789/schneidz5/g function.

when i try to do all the substitutions.

edit: 100 substitutions took about 8 and 1/2 minuts. i tired with 1000 but i got the error above.
(1 substitution took about 1minute 8seconds. so its not linear... its like a bulk discount)

danielbmartin 02-04-2013 02:28 PM

Quote:

Originally Posted by schneidz (Post 4884268)
... however i get an error like:
Code:

time sed -f sed.f dataset.txt > dataset.sub
sed: 0602-405 There are too many commands for the s/123456789/schneidz5/g function.

when i try to do all the substitutions.

edit: 100 substitutions took about 8 and 1/2 minutes. i tried with 1000 but i got the error above.
(1 substitution took about 1minute 8seconds. so its not linear... its like a bulk discount)

100 substitutions ran; 1000 did not. It may be expedient (though not elegant) to run 500 subs at a time until the whole task is accomplished. This might be done with a loop in which each iteration chews off the next 500 lines of File1, and makes all the substitutions in File2.

500 is a guess, maybe the upper limit is a lower number.

There is light at the end of this tunnel!

Daniel B. Martin

schneidz 02-04-2013 03:05 PM

Quote:

Originally Posted by danielbmartin (Post 4884327)
100 substitutions ran; 1000 did not. It may be expedient (though not elegant) to run 500 subs at a time until the whole task is accomplished. This might be done with a loop in which each iteration chews off the next 500 lines of File1, and makes all the substitutions in File2.

500 is a guess, maybe the upper limit is a lower number.

There is light at the end of this tunnel!

Daniel B. Martin

yes i am in the process of haxing something together using split grep and sed. so far looks promising.

grail 02-04-2013 11:15 PM

Instead of split, grep and sed, maybe a simple awk can prepare your files:
Code:

awk '!(NR%500){n++}{print "s/"$1"/"$2"/g" > "workfile" n}' infile
Now you cn simply loop through the files and use your sed -f option. Simply change 500 to whatever you find to be an acceptable number of changes :)

danielbmartin 02-06-2013 10:36 AM

Quote:

Originally Posted by grail (Post 4884558)
Instead of split, grep and sed, maybe a simple awk can prepare your files:
Code:

awk '!(NR%500){n++}{print "s/"$1"/"$2"/g" > "workfile" n}' infile
Now you cn simply loop through the files and use your sed -f option. Simply change 500 to whatever you find to be an acceptable number of changes :)

I like this idea and attempted to construct a simple test case, but cannot make it work.
Code:

# Create a test file which contains 100 lines,
#  each of the form (number) XXXXX,
#  and break it into 5 equal segments.
seq -w 100          \
|sed 's/$/ XXXXX/'  \
 > $Work3
for ((pass=1;pass<=5;pass=pass+5))
do
  echo "This is loop iteration # $pass"
# grail said: awk '!(NR%500){n++}{print "s/"$1"/"$2"/g" > "workfile" n}' infile
              awk '!(NR%20) {n++}{print "s/"$1"/"$2"/g" > "$Work4"  n}' $Work3
  echo; echo "Segment $pass of input file Work3 ..."; cat $Work4
done

File Work3 is created as desired but the awk isn't producing Work4.
This is what happened.
Code:

This is loop iteration # 1

Segment 1 of input file Work3 ...
cat: /home/daniel/Desktop/LQfiles/dbm614w04.txt: No such file or directory

Please advise.

Daniel B. Martin

ntubski 02-06-2013 12:38 PM

Quote:

Originally Posted by danielbmartin (Post 4885614)
Code:

# grail said: awk '!(NR%500){n++}{print "s/"$1"/"$2"/g" > "workfile" n}' infile
              awk '!(NR%20) {n++}{print "s/"$1"/"$2"/g" > "$Work4"  n}' $Work3


awk doesn't see shell variables, you need something like:
Code:

awk -vWork4="$Work4" '!(NR%20) {n++}{print "s/"$1"/"$2"/g" > (Work4 n)}' $Work3
# or some trickiness with quoting:
awk '!(NR%20) {n++}{print "s/"$1"/"$2"/g" > ("'"$Work4"'" n)}' $Work3


danielbmartin 02-06-2013 08:54 PM

Quote:

Originally Posted by ntubski (Post 4885709)
... you need something like:
Code:

awk '!(NR%20) {n++}{print "s/"$1"/"$2"/g" > ("'"$Work4"'" n)}' $Work3

Thank you for getting me over that hurdle. The code runs but does not produce the expected output. For ease of testing I scaled back to a source file with only 9 lines and code which attempts to parcel them out 3 at a time.

This code ...
Code:

# Create a test file which contains 9 lines,
#  each of the form (number) XXXXX,
#  and break it into 3 equal segments.
seq -w 9 |sed 's/$/ XXXXX/' > $Work3
for ((pass=1;pass<=3;pass++))
do
  rm $Work5
  echo "This is loop iteration # $pass"
  awk '!(NR%4) {n++} {print "s/"$1"/"$2"/g" > ("'"$Work5"'" n)}' $Work3
  echo "Work5 ..."; cat $Work5             
done

... produced this result ...
Code:

This is loop iteration # 1
Work5 ...
s/1/XXXXX/g
s/2/XXXXX/g
s/3/XXXXX/g
This is loop iteration # 2
Work5 ...
s/1/XXXXX/g
s/2/XXXXX/g
s/3/XXXXX/g
This is loop iteration # 3
Work5 ...
s/1/XXXXX/g
s/2/XXXXX/g
s/3/XXXXX/g

Observe that it dished out the same three lines on each iteration.

Please advise.

Daniel B. Martin

ntubski 02-06-2013 10:33 PM

Quote:

Originally Posted by danielbmartin (Post 4885964)
Thank you for getting me over that hurdle. The code runs but does not produce the expected output. For ease of testing I scaled back to a source file with only 9 lines and code which attempts to parcel them out 3 at a time.

The awk code grail proposed already outputs to separate files, try this:
Code:

seq -w 9 | sed 's/$/ XXXXX/' > "$Work3"

# modifed n++ condition to avoid small hiccup on the first parcel
awk '(n*3 < NR) {n++} {print "s/"$1"/"$2"/g" > ("'"$Work5"'" n)}' "$Work3"

for work in "$Work5"* ; do
    echo "$work ..."
    cat "$work"
done


danielbmartin 02-08-2013 11:52 AM

InFile1 ...
Code:

hello world
l33tz h4x0r
chunl akuma
quest tribe
salad carot
simon zelda

InFile2 ...
Code:

hello my name is simon, and i like to do drawings; simon says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: chunli akuma ken ryu sakura
third-line: choppin broccoli -- helloproject2501helloceltics#35hello123
you dont win friends with salad
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called quest - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul

Code ...
Code:

#
# Method of LQ Member danielbmartin #14 using sed
#  to break the change-pairs file into pieces,
#  and apply each piece individually to the source file.
#
# Rework InFile1 (the change pairs) into substitution pairs
#  for subsequent use by a "sed -f".
 sed 's/^/s\//' $InFile1 \
|tr " " "/"              \
|sed 's/$/\/g/'          \
> $Work01
# Make a copy of InFile2 (the source file), which will be
#  incrementally transformed to the desired end product.
cat $InFile2 > $OutFile14
start=1
step=4  # step = number of lines in each subset
for ((start=1;;start=start+step))
do
  let stop=start+step-1
# Use sed to create Work09, a subset of the change file.
  sed $start','$stop'!d' $Work01 > $Work09
# If Work09 is an empty file, leave this for-loop.
# This escapes from what would otherwise be an infinite loop.
  if [ ! -s $Work09 ]; then break; fi
  echo; echo "Now applying this subset of the change file..."; cat $Work09
  sed -f $Work09 $OutFile14 > $Work14
  cat $Work14 > $OutFile14
done

This code applies the change-pairs 4 at a time.
In production use you would change the value of variable step to 300, 400, 500, whatever value your system can handle.
In production use you would disable the echo statements which are used for explanation.

Execution produced this on-screen display ...
Code:

Now applying this subset of the change file...
s/hello/world/g
s/l33tz/h4x0r/g
s/chunl/akuma/g
s/quest/tribe/g

Now applying this subset of the change file...
s/salad/carot/g
s/simon/zelda/g

... and produced this end product ...
Code:

world my name is zelda, and i like to do drawings; zelda says.
lemonade was a popular drink in my day, and it still is.
g0t r00tz third-line: akumai akuma ken ryu sakura
third-line: choppin broccoli -- worldproject2501worldceltics#35world123
you dont win friends with carot
first-line: deus ex second-line: counter strike v1.6 third-line: burden of 80 proof fourth-line: battle field 2
first-line: a tribe called tribe - midnite marauders second-line: the perceptionists - black dialog third-line: buju banton - rasta got soul

Daniel B. Martin


All times are GMT -5. The time now is 11:09 AM.