[SOLVED] bash: get all possible substring substitution combinations (having fun)

masavini · 02-02-2014, 06:11 PM

hi,
i have a file with ~300k strings like this:
A0~-D091010E12-0

i'd like get all the possibile strings generated by the substitution of '[A-Z-]0' with '\1o', but only AFTER the ~ char. the first field (~ delimited) is a suffix that must be printed with every output line...

in this case, the regexp '[A-Z-]0' matches 3 substrings:

Code:

A0~-D09101E012-0
    ^^    ^^  ^^
    D0    E0  -0

the output should look like this (^ are written just to clarify what substitutions have been made for each line):

Code:

A0~-Do9101E012-0
     ^
A0~-D09101Eo12-0
           ^
A0~-D09101E012-o
               ^
A0~-Do9101Eo12-0
     ^     ^
A0~-Do9101E012-o
     ^         ^
A0~-D09101Eo12-o
           ^   ^
A0~-Do9101Eo12-o
     ^     ^   ^

i've been struggling for the whole afternoon trying to get this output, but the best i could do is just this:

Code:

#! /bin/bash

string="A0~-D09101E012-0"

echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "g", $2); print $1"~"out}' > out	# substitute all of the occurrences

n=$(echo "$string" | awk -F '~' '{print $2}' | grep -o "[A-Z-]0" | wc -l) 	# count occurrences

for i in $(seq 1 $n); do
	s=$(echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$i'", $2); print $1"~"out}')	# generate "1-substitution" lines
	echo $s  >> out
	for j in $(seq 1 $n); do
		echo "$s" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$j'", $2); print $1"~"out}' >> out	# generate "2-substitutions" lines
	done
done

# how can i create cycles for "12-occurrences" strings?! do i need 12 nested for cycles?!?!

sort -u out

this is the output:

Code:

A0~-D09101E012-o
               ^
A0~-D09101Eo12-0
           ^
A0~-D09101Eo12-o
           ^   ^
A0~-Do9101E012-0
     ^
A0~-Do9101E012-o
     ^         ^
A0~-Do9101Eo12-0
     ^     ^
A0~-Do9101Eo12-o
     ^     ^   ^

the script seems to work for this particular example, but i can't figure out how to adapt it for n occurrences strings... and i'm suspecting the way i approached the problem is completely wrong...

hints?

grail · 02-03-2014, 01:41 AM

So are you saying that not only do you wish to place the 'o' in the correct spot but the output must demonstrate all the places the input was changed?

I am also not sure I follow the logic as you show that the 'o' at the end is changed 4 times?? What is the benefit here?

Your original output made some sense up to the point where it shows the individual changes. After this point you are showing the combinations possible (perhaps permutations, been a while)
To what end?

As for:

Quote:

# how can i create cycles for "12-occurrences" strings?! do i need 12 nested for cycles?!?!

You probably need to look at some form of recursion, but this is not generally a task well suited for bash.

NevemTeve · 02-03-2014, 03:06 AM

You should do this in C; it would be ten times easier to write the code, and the program would run one thousand time faster.

Nephew · 02-03-2014, 03:31 AM

I think for n occurrences, there has to be n for loops.
Your example works because you have 3 occurrences and you have 3 for loops.
If n is known or fixed, then you can use this method, but if n changes a lot then I guess you have to seek another method.

masavini · 02-03-2014, 03:38 AM

n changes, so no chances to achieve my goal with this method... thanks anyway...

@grail

Quote:

So are you saying that not only do you wish to place the 'o' in the correct spot but the output must demonstrate all the places the input was changed?

no, no... i don't want the output to demonstrate where the substitution have been applied, i only put it in the first post to clarify what i need...

pan64 · 02-03-2014, 04:07 AM

First pass:
I would try to mark all those spaces with a special sign, for example ! (if it cannot occur in the original text).
That would be a normal search/replace operation without problems (cut at the delimiter first)
Second pass
You can count the number of ! sign, that will give you information about the number of strings generated from that source line. You can try to generate those strings one by one (one cycle) walking char by char (second loop) and replace ! based on its position (=nth !) with the (nth digit)?0

of the first loop counter.

I hope this helps
I would use perl or c, not awk or shell.
Also avoid chains like awk|grep|wc, that can be implemented in a single awk (those constructs may seriously slow down the execution)

masavini · 02-03-2014, 05:27 AM

Quote:

Originally Posted by pan64

I would use perl or c, not awk or shell.

it would be great, but i'm a total newbie... i can only do simple awk/grep/sed scripts...
i hoped it could be done with some awk "magic" commands, but it sounds like it does not...

grail · 02-03-2014, 05:39 AM

So I am still not following if we need all changes to be made or only a set number, but all could be handled as:

Code:

$ echo "A0~-D09101E012-0" | sed -r 's/([A-Z-])0([^~]|$)/\1o\2/g'
A0~-Do9101Eo12-o

This seems to get the desired output but not sure if this is the idea?

masavini · 02-03-2014, 06:03 AM

Quote:

Originally Posted by grail

So I am still not following if we need all changes to be made or only a set number, but all could be handled as:

Code:

$ echo "A0~-D09101E012-0" | sed -r 's/([A-Z-])0([^~]|$)/\1o\2/g'
A0~-Do9101Eo12-o

This seems to get the desired output but not sure if this is the idea?

sorry grail, it seems i've not been clear enough... the input file is a csv "~" separated.
i need to process the 2nd field, so that every record that contains "([A-Z-])0" is the subject for a '\1o' sustitution.

if the 2nd field of a particular record contains "[A-Z-]0" only once, the command is very easy and it's the one you wrote.
but if the regexp is found twice, the process of that record should create a 3 lines output:
a line where only the 1st occurrence is subsituted
a line where only the 2nd occurrence is subsituted
a line where both the occurrences are substituted.

if a record is "A0~-D091010E12-0" (3 occurrences), the output should be 7 lines:

Code:

A0~-Do9101E012-0
A0~-D09101Eo12-0
A0~-D09101E012-o
A0~-Do9101Eo12-0
A0~-Do9101E012-o
A0~-D09101Eo12-o
A0~-Do9101Eo12-o

pan64 · 02-03-2014, 06:43 AM

Quote:

Originally Posted by masavini

it would be great, but i'm a total newbie... i can only do simple awk/grep/sed scripts...
i hoped it could be done with some awk "magic" commands, but it sounds like it does not...

You can try it also with awk, that should work, just try to understand the logic I suggested.

grail · 02-03-2014, 09:28 AM

Well if it helps, your solution set is - - (2^n) - 1

So if your field has 10 changes this will equate to 1023 lines being displayed.

If this is really what you need, have fun as it should be a good challenge.

As mentioned earlier you will more than likely need to go to something like Perl, Ruby or C to get a decent solution that works in a reasonable time frame.

masavini · 02-03-2014, 03:28 PM

wanna have fun?

this works... it takes a few seconds with 4 occurrences strings and few minutes with 5 occurrences strings... did not test with more occurrences (i guess it would take hours)...

obviously i'm not proud of this script, but i thought that if my brain can't take me to the solution, entropy always will.

Code:

#! /bin/bash

string="A0~-D09101E012-0A0"    # 4 occurrences of "[A-Z-]0"

echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "g", $2); print $1"~"out}' > out	# substitute all of the occurrences

n=$(echo "$string" | awk -F '~' '{print $2}' | grep -o "[A-Z-]0" | wc -l) 	# count occurrences
for i in $(seq 1 $n); do
	s=$(echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$i'", $2); print $1"~"out}')	# generate "1-substitution" lines
	echo $s  >> out
	if [[ $s =~ [A-Z]0 ]]; then
		for j in $(seq 1 $n); do
			echo "$s" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$j'", $2); print $1"~"out}' >> out	# generate "2-substitutions" lines
		done
	fi
done

expectedN=$(echo "2 ^ $n - 1" | bc)
echo "$n occurrences, $expectedN expected output lines"

while (( $(sort -u out | wc -l ) < $expectedN )); do
	echo -ne "$(sort -u out | wc -l ) combinations found so far   \r"
	randomCycles=$(( $(rand -M $(( $n - 2 ))) + 3 ))	# if n=3, i always get 3. if n=4, i can get 3 or 4. if n=5, i can get 3, 4 or 5.
	s=$string
	for i in $(seq 1 $randomCycles); do
		randomPosition=$(( $(rand -M $(( $n - $i + 1 ))) + 1 ))	# get a random position. if n=3, i always get 1. if n=5 and i=3, i can get 1, 2 or 3.
		s=$(echo "$s" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$randomPosition'", $2); print $1"~"out}')    # substitute the random position
	done
	echo "$s" >> out
done

sort -u out

take care of your heart: just don't laugh too much...

danielbmartin · 02-03-2014, 03:47 PM

Quote:

Originally Posted by masavini

wanna have fun?

Yes.

I tried to run your code but was not successful.

Code:

 rand: command not found

Daniel B. Martin

masavini · 02-03-2014, 06:12 PM

Code:

$ rand --version
Random numbers generator for GNU/Linux, version 1.0.4, May  7 2009
Copyright (c) 2008  Guduleasa Alexandru Ionut <gulyan89@yahoo.com>
Licence: GPL v3 or any later

i'm using ubuntu 13.04, can't remember if rand was included among installation packages...

Code:

$ rand -M 3

generates a random integer between 0 and 2... nothing else...

grail · 02-04-2014, 07:55 AM

So the following does the initial requirement, however it does not currently allow for a repeated combination:

Code:

#!/usr/bin/env ruby

s_input = 'A0~-D09201E012-0A0'

sa_split = s_input.split('~')
sa_scan  = sa_split[1].scan(/[A-Z-]0/)

1.upto(sa_scan.size){|i|
    sa_scan.combination(i).to_a.each{|a|
        s_original = sa_split[1].clone
        a.each{|s|
            s_original.sub!(s,s[0]+'o')
        }
        puts [sa_split[0],s_original,sa_split[2..-1]].flatten.join('~')
    }   
}

There is also little to no error checking, so this would be something else to implement

As you can see, in languages such as bash and awk this level of complexity would take considerably more work

Here is a reference for you to look up any of the commands used:

http://www.ruby-doc.org/core