[SOLVED] bash: get all possible substring substitution combinations (having fun)

danielbmartin · 02-04-2014, 09:51 AM

Quote:

Originally Posted by masavini

this works... it takes a few seconds with 4 occurrences strings and few minutes with 5 occurrences strings... did not test with more occurrences (i guess it would take hours)...

Please elaborate this execution time statement.

How many input lines with 4 occurrences? 5 occurrences? Does the real-world input file ever have strings with more than 5 occurrences?

Daniel B. Martin

masavini · 02-04-2014, 12:11 PM

Quote:

Originally Posted by danielbmartin

How many input lines with 4 occurrences? 5 occurrences? Does the real-world input file ever have strings with more than 5 occurrences?

i can't really tell... i only know the file is huge, changes frequently, and may contain many occurrences. even more, "[A-Z-]0" is not the only regex that should be subject of substitutions.

(un)fortunately, the is no chance that such a script can be used in the production system... but in the meantime, i probably understood how to make it work without the use of thermodynamics...

let's say a string has n occurrences. i just need to write numbers from 1 to (2^n-1) in base 2, using a n digits format. if n=2, i have the following base 2 numbers:
1-> 01
2-> 10
3-> 11

each of those strings indicates which occurrences must be substituted (1) and which must not (0).

i'm trying now to write it down, i hope i can post it tonight...

@grail
thanks for the ruby script, but i'm already paying someone to make a c program... i'm trying to solve the problem with bash just to learn some more... i will obviously post the c version as soon as i get it...

masavini · 02-04-2014, 12:41 PM

got it!

Code:

#!/bin/bash

string="A0~-D09101E012-0A0-0"

n=$(echo "$string" | awk -F '~' '{print $2}' | grep -o "[A-Z-]0" | wc -l) 	# count occurrences

expectedN=$(echo "2 ^ $n - 1" | bc)
echo "$n occurrences, $expectedN expected output lines"

# substitute all of the occurrences, so i can save 1 cycle in the next loop
echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "g", $2); print $1"~"out}' > out

maps=$(seq 1 $(( $expectedN-1 )) | awk '{print "ibase=10;obase=2;" $1}' | bc | xargs printf "%0${n}d\n")
for map in $maps; do
	s=$string
	j=0	# substituted
	for (( i=0; i<${#map}; i++ )); do
		if (( ${map:$i:1} > 0 )); then
			let k=$i+1-$j    # add 1 to the char position (they go from 0 to n-1, i need to tell awk the position from 1 to n) and subtract previous substitutions number
			s=$(echo "$s" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$k'", $2); print $1"~"out}')
			let j+=1	# substituted
		fi
	done
	echo "$s" >> out
done

wc -l out
cat out

i can't tell how proud i am of this script!!

danielbmartin · 02-04-2014, 01:00 PM

Quote:

Originally Posted by masavini

i'm trying to solve the problem with bash just to learn some more...

I wrote a solution to this problem, a bash script which uses awk to do the heavy lifting. I'd like to have some measure of performance of your script and post mine if it is competitive. Could you create a meaningful input file (~100 lines) which could be used to compare execution times?

Daniel B. Martin

masavini · 02-04-2014, 02:24 PM

what if i just repeat the same operation 100 times?

Code:

#!/bin/bash

string="A0~-D09101E012-0A0-0"

start=$(date +%s)

for c in $(seq 1 100); do

	n=$(echo "$string" | awk -F '~' '{print $2}' | grep -o "[A-Z-]0" | wc -l) 	# count occurrences
	
	expectedN=$(echo "2 ^ $n - 1" | bc)
	
	echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "g", $2); print $1"~"out}' > out$c	# substitute all of the occurrences
	
	maps=$(seq 1 $(( $expectedN-1 )) | awk '{print "ibase=10;obase=2;" $1}' | bc | xargs printf "%0${n}d\n")
	for map in $maps; do
		s=$string
		j=0	# substituted
		for (( i=0; i<${#map}; i++ )); do
			if (( ${map:$i:1} > 0 )); then
				let k=$i-$j+1
				s=$(echo "$s" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$k'", $2); print $1"~"out}')
				let j+=1	# substituted
			fi
		done
		echo "$s" >> out$c
	done
done

end=$(date +%s)

echo "executed in $(( $end - $start ))s"

cat out* > output
wc -l output

executed in 26s
3100 output

danielbmartin · 02-04-2014, 03:12 PM

Quote:

Originally Posted by masavini

what if i just repeat the same operation 100 times?

I built an input file with 100 identical lines.
Each line is: A0~-D09101E012-0A0-0

The result is:

Code:

daniel@daniel-desktop:~$ bash /home/daniel/Desktop/LQfiles/dbm1057.bin
Execution started:
Tue Feb  4 16:06:34 EST 2014
Execution ended:
Tue Feb  4 16:06:36 EST 2014

The input file contains 100 lines.
The output file contains 3100 lines.

Normal end of job.

The results look right to my eye, but you should judge for yourself.

Daniel B. Martin

danielbmartin · 02-04-2014, 03:18 PM

I tried to post the code and my computer went nuts. I'll try again.

Is there a limit to the size of a post on this forum?

Daniel B. Martin

danielbmartin · 02-04-2014, 03:19 PM

Code:

#!/bin/bash
#   Daniel B. Martin   Feb14
#
#   To execute this program, launch a terminal session and enter:
#   bash /home/daniel/Desktop/LQfiles/dbm1057.bin
#
#  This program inspired ...
#    http://www.linuxquestions.org/questions/programming-9/
#      bash-get-all-possible-substring-substitution-combinations-4175493478/
# File identification
   Path=$(cut -d'.' -f1 <<< ${0})
 InFile=$Path"inp.txt"
  Work1=$Path"w1.txt"
  Work2=$Path"w2.txt"
  Work3=$Path"w3.txt"
OutFile=$Path"out.txt"

# Have: a file with ~300k strings like this: A0~-D091010E12-0

# Want: all the possibile strings generated by the substitution of
#   '[A-Z-]0' with '\1o', but only AFTER the ~ char.
# The first field (~ delimited) is a suffix that must be printed
# with every output line...

#  A0~-D09101E012-0
# In this case, the regexp '[A-Z-]0' matches 3 substrings:
#  A0~-D09101E012-0
#      ^^    ^^  ^^
#      D0    E0  -0
 
# There would be 7 output lines from this 1 input line.
# A0~-Do9101E012-0
#      ^
# A0~-D09101Eo12-0
#            ^
# A0~-D09101E012-o
#                ^
# A0~-Do9101Eo12-0
#      ^     ^
# A0~-Do9101E012-o
#      ^         ^
# A0~-D09101Eo12-o
#            ^   ^
# A0~-Do9101Eo12-o
#     ^     ^   ^


echo "Execution started:"
date

# Create a binary table (in mirror form)
# This is done only once, no matter how large the InFile may be.

awk '{printf("%d ",  ($0*1.0000)%2);
      printf("%d ",  ($0*0.5000)%2);
      printf("%d ",  ($0*0.2500)%2);
      printf("%d ",  ($0*0.1250)%2);
      printf("%d\n", ($0*0.0625)  )}' <(seq -w 1 31) >$Work2

rm $OutFile  # Blow away any leftover output file.

# Bash loop ...
while read InLine  # Read one line from the InFile.
do

# Identify potential substitution locations.
awk -F "" '{for (j=index($0,"~");j<=NF;j++)
     {if (match(substr($0,j,2),/[A-Z-]0/))
       print j,substr($0,j,2),p,$0}}' <<<$InLine >$Work1
 
# Work1 looks like this:
# 5 D0  A0~-D09101E012-0
# 11 E0  A0~-D09101E012-0
# 15 -0  A0~-D09101E012-0

# Trim the binary table to suit this input line.
nc=$(wc -l <$Work1)
# Test for no substitutions required.
if [ $nc -eq 0 ]; then
  echo $InLine >>$OutFile
  continue
fi

let nr=-1+2**$nc
cut -d" " -f1-$nc <$Work2 |rev |head -$nr >$Work3

# Work3 contains a table which looks like this:
# 0 0 1
# 0 1 0
# 0 1 1
# 1 0 0
# 1 0 1
# 1 1 0
# 1 1 1
 
# FNR==NR is true only when reading the first input file.
awk 'FNR==NR {p[NR]=$1; str=$3; next}
 {s=str; for (j=1;j<=NF;j++) if ($j>0) {s=substr(s,1,p[j])"o"substr(s,p[j]+2)};
    print s}' $Work1 $Work3 >>$OutFile
done <$InFile  # End of bash loop

#echo "OutFile ..."; cat $OutFile; echo "End Of File"


echo "Execution ended:"
date
echo
echo "The input file contains" $(wc -l <$InFile) "lines."
echo "The output file contains" $(wc -l <$OutFile) "lines."



echo; echo "Normal end of job."; echo; exit

Daniel B. Martin

masavini · 02-04-2014, 04:44 PM

2 seconds?!?

absolutely great... amazing...

can you please explain this?

Quote:

Code:

awk 'FNR==NR {p[NR]=$1; str=$3; next}
 {s=str; for (j=1;j<=NF;j++) if ($j>0) {s=substr(s,1,p[j])"o"substr(s,p[j]+2)};
    print s}' $Work1 $Work3 >>$OutFile

i can understand what it does, but not how...
if i'm not wrong, this avoids the loop within the base2-string.

great... i'll test it against the c code the friend of mine is going to sell me...

masavini · 02-04-2014, 05:41 PM

i gave my script a 2nd chance... 21 seconds!

and yours took 1s on my shiny fast ultrabook...

danielbmartin · 02-04-2014, 05:43 PM

Quote:

Originally Posted by masavini

2 seconds?!?

Please test it on your computer to verify execution time and output validity.

Quote:

can you please explain this?

Code:

awk 'FNR==NR {p[NR]=$1; str=$3; next}
 {s=str; for (j=1;j<=NF;j++) if ($j>0) {s=substr(s,1,p[j])"o"substr(s,p[j]+2)};
    print s}' $Work1 $Work3 >>$OutFile

This part ...

Code:

FNR==NR {p[NR]=$1; str=$3; next}

... reads the first of two input files, Work1, which provides a list of substitution positions (columns) and the input line which is to be modified by those substitutions. The positions are saved in the awk array p (for position) and the input line (such as A0~-D09101E012-0) is saved in the awk variable str (for string).

The rest of this awk reads the second input file Work3 which guides the code to perform all possible permutations. For a three-substitution input line, Work3 looks like this ...

Code:

Reading top-to-bottom, you will recognize each row as binary 1, 2, ..., 7.

s=str copies the A0~-D09101E012-0 string before any substitutions are made.

for (j=1;j<=NF;j++) sets up a loop to apply each row (such as 1 1 0) from Work3.

if ($j>0) says a substitution is made only for a 1 in the row (such as 1 1 0).

s=substr(s,1,p[j])"o"substr(s,p[j]+2) makes one substitution. It is done by taking a "left hand" substring (such as A0~-D09101E), appending an "o", and appending a "right hand" substring (such as 12-0).

print s is performed after the j-loop has ended. It writes the awk variable s (the finished product) to the output file.

Whew!

Quote:

i'll test it against the c code the friend of mine is going to sell me...

My price beats his!

Daniel B. Martin

masavini · 02-04-2014, 06:15 PM

got it... i guessed that some "magic" awk script would have solved it, and it did...

a couple more of questions:
1) the script goes crazy if the first part of the record (the string before "~") contains spaces...
i guess it's because of $Work1 format:
# Work1 looks like this:
# 5 D0 A0~-D09101E012-0

i solved it by substituting all spaces with "_" before executing the script and applying the opposite substitution to the output file...

2) about $Work2: do you think it's worth to create a HUGE file, save it somewhere and reuse it everytime the script runs?

masavini · 02-04-2014, 06:41 PM

Quote:

Originally Posted by danielbmartin;5111596
[B

s=substr(s,1,p[j])"o"substr(s,p[j]+2)[/B] makes one substitution. It is done by taking a "left hand" substring (such as A0~-D09101E), appending an "o", and appending a "right hand" substring (such as 12-0).

this said, i'd expect the script to work with a regex like "[A-Z]0[A-Z]", but it doesn't seem to... it's late, i'm going to sleep... tomorrow i'll think about it... if you have suggestions (NOT the solution), they're always welcome...

danielbmartin · 02-04-2014, 07:19 PM

Quote:

Originally Posted by masavini

2) about $Work2: do you think it's worth to create a HUGE file, save it somewhere and reuse it everytime the script runs?

Work2 is built only once per execution and therefore contributes little to overall execution time. Each line in the InFile causes a new Work3 to be created, and it is always a subset of Work2. There are only five possible "versions" of Work3. An input line with two substitutions always creates the same Work3 ...

Code:

0 1
1 0
1 1

... and an input line with three substitutions always creates the same Work3 ...

Code:

... and so forth. This suggests the possibility of building five versions at the start of program execution and selecting the appropriate one for each InFile line. However, the time saving would be small because it saves only the time to perform the subsetting. That task is handled by this code snippet:

Code:

# Trim the binary table to suit this input line.
nc=$(wc -l <$Work1)
let nr=-1+2**$nc
cut -d" " -f1-$nc <$Work2 |rev |head -$nr >$Work3

There's not much processing here. No loops, no function calls, nothing complicated. I think the rev can be eliminated; it was coded that way for esthetic reasons.

Daniel B. Martin

grail · 02-05-2014, 12:46 AM

I think as most of the hard yards are done in awk, which can also easily do arithmetic and most if not all of your other commands, it would been cleaner and make more sense to do the lot in awk as opposed
to asking bash to collate everything. You can also probably do away with all the temp files as well if you use a single interface.

I know you have someone doing the C option, but here is the tidied up ruby that now handles repetition, ie "A0" appears 2 or more times (thanks Daniel for the ideas)

Code:

#!/usr/bin/env ruby

# String array used to store regex matches
sa_scan = []

# Loop over the file passed in as the first argument to script
# 'if' at the closing brace ensures we received at least 1 argument and
# that the file exists
IO.foreach(ARGV[0]){
	|line|

	# Separate line using '~' into a string array and create a copy of the
	# original string
	sa_split = line.split('~')
	s_copy = sa_split[1].clone

	# Loop over regex items found and store the index at where they are in
	# the original string into a number array
	sa_split[1].scan(/[A-Z-]0/){|m|
		na_scan << s_copy.index(m) + 1
		
		# Replace regex found by underscores in case of repeated patterns
		s_copy.sub!(m, '__')
	}

	# Set n to size of array.  If no matches found print original entry
	n = na_scan.size
    puts line if n == 0

	# Loop over number of entries found
	1.upto(n){|i|
		# Loop over the possible combinations
		na_scan.combination(i){|a|
			s_original = sa_split[1].clone
			# Replace indexed item with new character
			a.each{|s|
				s_original[s] = 'o'
			}
			# Join back new string with original and insert '~' where necessary
			puts [sa_split[0],s_original,sa_split[2..-1]].flatten.join('~')
		}
	}

	# Clear array to be re-used
    na_scan.clear
} if ! ARGV.empty? && File.exists?(ARGV[0])

This one now reads from a file and performed the 100 lines Daniel used as per below times:

Code:

real	0m0.837s
user	0m0.177s
sys	0m0.650s

This was run on a VM.

The other beauty here is it would be simple enough to also supply a second argument on the line, first being the file name, to be the regex to use or if not available to use a default.