[SOLVED] bash: get all possible substring substitution combinations (having fun)
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
bash: get all possible substring substitution combinations (having fun)
hi,
i have a file with ~300k strings like this:
A0~-D091010E12-0
i'd like get all the possibile strings generated by the substitution of '[A-Z-]0' with '\1o', but only AFTER the ~ char. the first field (~ delimited) is a suffix that must be printed with every output line...
in this case, the regexp '[A-Z-]0' matches 3 substrings:
Code:
A0~-D09101E012-0
^^ ^^ ^^
D0 E0 -0
the output should look like this (^ are written just to clarify what substitutions have been made for each line):
the script seems to work for this particular example, but i can't figure out how to adapt it for n occurrences strings... and i'm suspecting the way i approached the problem is completely wrong...
So are you saying that not only do you wish to place the 'o' in the correct spot but the output must demonstrate all the places the input was changed?
I am also not sure I follow the logic as you show that the 'o' at the end is changed 4 times?? What is the benefit here?
Your original output made some sense up to the point where it shows the individual changes. After this point you are showing the combinations possible (perhaps permutations, been a while)
To what end?
As for:
Quote:
# how can i create cycles for "12-occurrences" strings?! do i need 12 nested for cycles?!?!
You probably need to look at some form of recursion, but this is not generally a task well suited for bash.
I think for n occurrences, there has to be n for loops.
Your example works because you have 3 occurrences and you have 3 for loops.
If n is known or fixed, then you can use this method, but if n changes a lot then I guess you have to seek another method.
First pass:
I would try to mark all those spaces with a special sign, for example ! (if it cannot occur in the original text).
That would be a normal search/replace operation without problems (cut at the delimiter first)
Second pass
You can count the number of ! sign, that will give you information about the number of strings generated from that source line. You can try to generate those strings one by one (one cycle) walking char by char (second loop) and replace ! based on its position (=nth !) with the (nth digit)?0 of the first loop counter.
I hope this helps
I would use perl or c, not awk or shell.
Also avoid chains like awk|grep|wc, that can be implemented in a single awk (those constructs may seriously slow down the execution)
it would be great, but i'm a total newbie... i can only do simple awk/grep/sed scripts...
i hoped it could be done with some awk "magic" commands, but it sounds like it does not...
So I am still not following if we need all changes to be made or only a set number, but all could be handled as:
Code:
$ echo "A0~-D09101E012-0" | sed -r 's/([A-Z-])0([^~]|$)/\1o\2/g'
A0~-Do9101Eo12-o
This seems to get the desired output but not sure if this is the idea?
sorry grail, it seems i've not been clear enough... the input file is a csv "~" separated.
i need to process the 2nd field, so that every record that contains "([A-Z-])0" is the subject for a '\1o' sustitution.
if the 2nd field of a particular record contains "[A-Z-]0" only once, the command is very easy and it's the one you wrote.
but if the regexp is found twice, the process of that record should create a 3 lines output:
a line where only the 1st occurrence is subsituted
a line where only the 2nd occurrence is subsituted
a line where both the occurrences are substituted.
if a record is "A0~-D091010E12-0" (3 occurrences), the output should be 7 lines:
it would be great, but i'm a total newbie... i can only do simple awk/grep/sed scripts...
i hoped it could be done with some awk "magic" commands, but it sounds like it does not...
You can try it also with awk, that should work, just try to understand the logic I suggested.
Well if it helps, your solution set is - - (2^n) - 1
So if your field has 10 changes this will equate to 1023 lines being displayed.
If this is really what you need, have fun as it should be a good challenge.
As mentioned earlier you will more than likely need to go to something like Perl, Ruby or C to get a decent solution that works in a reasonable time frame.
this works... it takes a few seconds with 4 occurrences strings and few minutes with 5 occurrences strings... did not test with more occurrences (i guess it would take hours)...
obviously i'm not proud of this script, but i thought that if my brain can't take me to the solution, entropy always will.
Code:
#! /bin/bash
string="A0~-D09101E012-0A0" # 4 occurrences of "[A-Z-]0"
echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "g", $2); print $1"~"out}' > out # substitute all of the occurrences
n=$(echo "$string" | awk -F '~' '{print $2}' | grep -o "[A-Z-]0" | wc -l) # count occurrences
for i in $(seq 1 $n); do
s=$(echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$i'", $2); print $1"~"out}') # generate "1-substitution" lines
echo $s >> out
if [[ $s =~ [A-Z]0 ]]; then
for j in $(seq 1 $n); do
echo "$s" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$j'", $2); print $1"~"out}' >> out # generate "2-substitutions" lines
done
fi
done
expectedN=$(echo "2 ^ $n - 1" | bc)
echo "$n occurrences, $expectedN expected output lines"
while (( $(sort -u out | wc -l ) < $expectedN )); do
echo -ne "$(sort -u out | wc -l ) combinations found so far \r"
randomCycles=$(( $(rand -M $(( $n - 2 ))) + 3 )) # if n=3, i always get 3. if n=4, i can get 3 or 4. if n=5, i can get 3, 4 or 5.
s=$string
for i in $(seq 1 $randomCycles); do
randomPosition=$(( $(rand -M $(( $n - $i + 1 ))) + 1 )) # get a random position. if n=3, i always get 1. if n=5 and i=3, i can get 1, 2 or 3.
s=$(echo "$s" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$randomPosition'", $2); print $1"~"out}') # substitute the random position
done
echo "$s" >> out
done
sort -u out
take care of your heart: just don't laugh too much...
$ rand --version
Random numbers generator for GNU/Linux, version 1.0.4, May 7 2009
Copyright (c) 2008 Guduleasa Alexandru Ionut <gulyan89@yahoo.com>
Licence: GPL v3 or any later
i'm using ubuntu 13.04, can't remember if rand was included among installation packages...
Code:
$ rand -M 3
generates a random integer between 0 and 2... nothing else...
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.