[SOLVED] bash: get all possible substring substitution combinations (having fun)
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
this works... it takes a few seconds with 4 occurrences strings and few minutes with 5 occurrences strings... did not test with more occurrences (i guess it would take hours)...
Please elaborate this execution time statement.
How many input lines with 4 occurrences? 5 occurrences? Does the real-world input file ever have strings with more than 5 occurrences?
How many input lines with 4 occurrences? 5 occurrences? Does the real-world input file ever have strings with more than 5 occurrences?
i can't really tell... i only know the file is huge, changes frequently, and may contain many occurrences. even more, "[A-Z-]0" is not the only regex that should be subject of substitutions.
(un)fortunately, the is no chance that such a script can be used in the production system... but in the meantime, i probably understood how to make it work without the use of thermodynamics...
let's say a string has n occurrences. i just need to write numbers from 1 to (2^n-1) in base 2, using a n digits format. if n=2, i have the following base 2 numbers:
1-> 01
2-> 10
3-> 11
each of those strings indicates which occurrences must be substituted (1) and which must not (0).
i'm trying now to write it down, i hope i can post it tonight...
@grail
thanks for the ruby script, but i'm already paying someone to make a c program... i'm trying to solve the problem with bash just to learn some more... i will obviously post the c version as soon as i get it...
#!/bin/bash
string="A0~-D09101E012-0A0-0"
n=$(echo "$string" | awk -F '~' '{print $2}' | grep -o "[A-Z-]0" | wc -l) # count occurrences
expectedN=$(echo "2 ^ $n - 1" | bc)
echo "$n occurrences, $expectedN expected output lines"
# substitute all of the occurrences, so i can save 1 cycle in the next loop
echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "g", $2); print $1"~"out}' > out
maps=$(seq 1 $(( $expectedN-1 )) | awk '{print "ibase=10;obase=2;" $1}' | bc | xargs printf "%0${n}d\n")
for map in $maps; do
s=$string
j=0 # substituted
for (( i=0; i<${#map}; i++ )); do
if (( ${map:$i:1} > 0 )); then
let k=$i+1-$j # add 1 to the char position (they go from 0 to n-1, i need to tell awk the position from 1 to n) and subtract previous substitutions number
s=$(echo "$s" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$k'", $2); print $1"~"out}')
let j+=1 # substituted
fi
done
echo "$s" >> out
done
wc -l out
cat out
i'm trying to solve the problem with bash just to learn some more...
I wrote a solution to this problem, a bash script which uses awk to do the heavy lifting. I'd like to have some measure of performance of your script and post mine if it is competitive. Could you create a meaningful input file (~100 lines) which could be used to compare execution times?
what if i just repeat the same operation 100 times?
I built an input file with 100 identical lines.
Each line is: A0~-D09101E012-0A0-0
The result is:
Code:
daniel@daniel-desktop:~$ bash /home/daniel/Desktop/LQfiles/dbm1057.bin
Execution started:
Tue Feb 4 16:06:34 EST 2014
Execution ended:
Tue Feb 4 16:06:36 EST 2014
The input file contains 100 lines.
The output file contains 3100 lines.
Normal end of job.
The results look right to my eye, but you should judge for yourself.
#!/bin/bash
# Daniel B. Martin Feb14
#
# To execute this program, launch a terminal session and enter:
# bash /home/daniel/Desktop/LQfiles/dbm1057.bin
#
# This program inspired ...
# http://www.linuxquestions.org/questions/programming-9/
# bash-get-all-possible-substring-substitution-combinations-4175493478/
# File identification
Path=$(cut -d'.' -f1 <<< ${0})
InFile=$Path"inp.txt"
Work1=$Path"w1.txt"
Work2=$Path"w2.txt"
Work3=$Path"w3.txt"
OutFile=$Path"out.txt"
# Have: a file with ~300k strings like this: A0~-D091010E12-0
# Want: all the possibile strings generated by the substitution of
# '[A-Z-]0' with '\1o', but only AFTER the ~ char.
# The first field (~ delimited) is a suffix that must be printed
# with every output line...
# A0~-D09101E012-0
# In this case, the regexp '[A-Z-]0' matches 3 substrings:
# A0~-D09101E012-0
# ^^ ^^ ^^
# D0 E0 -0
# There would be 7 output lines from this 1 input line.
# A0~-Do9101E012-0
# ^
# A0~-D09101Eo12-0
# ^
# A0~-D09101E012-o
# ^
# A0~-Do9101Eo12-0
# ^ ^
# A0~-Do9101E012-o
# ^ ^
# A0~-D09101Eo12-o
# ^ ^
# A0~-Do9101Eo12-o
# ^ ^ ^
echo "Execution started:"
date
# Create a binary table (in mirror form)
# This is done only once, no matter how large the InFile may be.
awk '{printf("%d ", ($0*1.0000)%2);
printf("%d ", ($0*0.5000)%2);
printf("%d ", ($0*0.2500)%2);
printf("%d ", ($0*0.1250)%2);
printf("%d\n", ($0*0.0625) )}' <(seq -w 1 31) >$Work2
rm $OutFile # Blow away any leftover output file.
# Bash loop ...
while read InLine # Read one line from the InFile.
do
# Identify potential substitution locations.
awk -F "" '{for (j=index($0,"~");j<=NF;j++)
{if (match(substr($0,j,2),/[A-Z-]0/))
print j,substr($0,j,2),p,$0}}' <<<$InLine >$Work1
# Work1 looks like this:
# 5 D0 A0~-D09101E012-0
# 11 E0 A0~-D09101E012-0
# 15 -0 A0~-D09101E012-0
# Trim the binary table to suit this input line.
nc=$(wc -l <$Work1)
# Test for no substitutions required.
if [ $nc -eq 0 ]; then
echo $InLine >>$OutFile
continue
fi
let nr=-1+2**$nc
cut -d" " -f1-$nc <$Work2 |rev |head -$nr >$Work3
# Work3 contains a table which looks like this:
# 0 0 1
# 0 1 0
# 0 1 1
# 1 0 0
# 1 0 1
# 1 1 0
# 1 1 1
# FNR==NR is true only when reading the first input file.
awk 'FNR==NR {p[NR]=$1; str=$3; next}
{s=str; for (j=1;j<=NF;j++) if ($j>0) {s=substr(s,1,p[j])"o"substr(s,p[j]+2)};
print s}' $Work1 $Work3 >>$OutFile
done <$InFile # End of bash loop
#echo "OutFile ..."; cat $OutFile; echo "End Of File"
echo "Execution ended:"
date
echo
echo "The input file contains" $(wc -l <$InFile) "lines."
echo "The output file contains" $(wc -l <$OutFile) "lines."
echo; echo "Normal end of job."; echo; exit
Please test it on your computer to verify execution time and output validity.
Quote:
can you please explain this?
Code:
awk 'FNR==NR {p[NR]=$1; str=$3; next}
{s=str; for (j=1;j<=NF;j++) if ($j>0) {s=substr(s,1,p[j])"o"substr(s,p[j]+2)};
print s}' $Work1 $Work3 >>$OutFile
This part ...
Code:
FNR==NR {p[NR]=$1; str=$3; next}
... reads the first of two input files, Work1, which provides a list of substitution positions (columns) and the input line which is to be modified by those substitutions. The positions are saved in the awk array p (for position) and the input line (such as A0~-D09101E012-0) is saved in the awk variable str (for string).
The rest of this awk reads the second input file Work3 which guides the code to perform all possible permutations. For a three-substitution input line, Work3 looks like this ...
Code:
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
Reading top-to-bottom, you will recognize each row as binary 1, 2, ..., 7.
s=str copies the A0~-D09101E012-0 string before any substitutions are made.
for (j=1;j<=NF;j++) sets up a loop to apply each row (such as 1 1 0) from Work3.
if ($j>0) says a substitution is made only for a 1 in the row (such as 1 1 0).
s=substr(s,1,p[j])"o"substr(s,p[j]+2) makes one substitution. It is done by taking a "left hand" substring (such as A0~-D09101E), appending an "o", and appending a "right hand" substring (such as 12-0).
print s is performed after the j-loop has ended. It writes the awk variable s (the finished product) to the output file.
Whew!
Quote:
i'll test it against the c code the friend of mine is going to sell me...
My price beats his!
Daniel B. Martin
Last edited by danielbmartin; 02-04-2014 at 06:01 PM.
Reason: Cosmetic improvement
got it... i guessed that some "magic" awk script would have solved it, and it did...
a couple more of questions:
1) the script goes crazy if the first part of the record (the string before "~") contains spaces...
i guess it's because of $Work1 format:
# Work1 looks like this:
# 5 D0 A0~-D09101E012-0
i solved it by substituting all spaces with "_" before executing the script and applying the opposite substitution to the output file...
2) about $Work2: do you think it's worth to create a HUGE file, save it somewhere and reuse it everytime the script runs?
s=substr(s,1,p[j])"o"substr(s,p[j]+2)[/B] makes one substitution. It is done by taking a "left hand" substring (such as A0~-D09101E), appending an "o", and appending a "right hand" substring (such as 12-0).
this said, i'd expect the script to work with a regex like "[A-Z]0[A-Z]", but it doesn't seem to... it's late, i'm going to sleep... tomorrow i'll think about it... if you have suggestions (NOT the solution), they're always welcome...
2) about $Work2: do you think it's worth to create a HUGE file, save it somewhere and reuse it everytime the script runs?
Work2 is built only once per execution and therefore contributes little to overall execution time. Each line in the InFile causes a new Work3 to be created, and it is always a subset of Work2. There are only five possible "versions" of Work3. An input line with two substitutions always creates the same Work3 ...
Code:
0 1
1 0
1 1
... and an input line with three substitutions always creates the same Work3 ...
Code:
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
... and so forth. This suggests the possibility of building five versions at the start of program execution and selecting the appropriate one for each InFile line. However, the time saving would be small because it saves only the time to perform the subsetting. That task is handled by this code snippet:
Code:
# Trim the binary table to suit this input line.
nc=$(wc -l <$Work1)
let nr=-1+2**$nc
cut -d" " -f1-$nc <$Work2 |rev |head -$nr >$Work3
There's not much processing here. No loops, no function calls, nothing complicated. I think the rev can be eliminated; it was coded that way for esthetic reasons.
I think as most of the hard yards are done in awk, which can also easily do arithmetic and most if not all of your other commands, it would been cleaner and make more sense to do the lot in awk as opposed
to asking bash to collate everything. You can also probably do away with all the temp files as well if you use a single interface.
I know you have someone doing the C option, but here is the tidied up ruby that now handles repetition, ie "A0" appears 2 or more times (thanks Daniel for the ideas)
Code:
#!/usr/bin/env ruby
# String array used to store regex matches
sa_scan = []
# Loop over the file passed in as the first argument to script
# 'if' at the closing brace ensures we received at least 1 argument and
# that the file exists
IO.foreach(ARGV[0]){
|line|
# Separate line using '~' into a string array and create a copy of the
# original string
sa_split = line.split('~')
s_copy = sa_split[1].clone
# Loop over regex items found and store the index at where they are in
# the original string into a number array
sa_split[1].scan(/[A-Z-]0/){|m|
na_scan << s_copy.index(m) + 1
# Replace regex found by underscores in case of repeated patterns
s_copy.sub!(m, '__')
}
# Set n to size of array. If no matches found print original entry
n = na_scan.size
puts line if n == 0
# Loop over number of entries found
1.upto(n){|i|
# Loop over the possible combinations
na_scan.combination(i){|a|
s_original = sa_split[1].clone
# Replace indexed item with new character
a.each{|s|
s_original[s] = 'o'
}
# Join back new string with original and insert '~' where necessary
puts [sa_split[0],s_original,sa_split[2..-1]].flatten.join('~')
}
}
# Clear array to be re-used
na_scan.clear
} if ! ARGV.empty? && File.exists?(ARGV[0])
This one now reads from a file and performed the 100 lines Daniel used as per below times:
Code:
real 0m0.837s
user 0m0.177s
sys 0m0.650s
This was run on a VM.
The other beauty here is it would be simple enough to also supply a second argument on the line, first being the file name, to be the regex to use or if not available to use a default.
Last edited by grail; 02-06-2014 at 10:17 PM.
Reason: Added correction found by Daniel; Added comments
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.