LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-02-2014, 06:11 PM   #1
masavini
Member
 
Registered: Jun 2008
Posts: 285

Rep: Reputation: 6
bash: get all possible substring substitution combinations (having fun)


hi,
i have a file with ~300k strings like this:
A0~-D091010E12-0

i'd like get all the possibile strings generated by the substitution of '[A-Z-]0' with '\1o', but only AFTER the ~ char. the first field (~ delimited) is a suffix that must be printed with every output line...

in this case, the regexp '[A-Z-]0' matches 3 substrings:
Code:
A0~-D09101E012-0
    ^^    ^^  ^^
    D0    E0  -0
the output should look like this (^ are written just to clarify what substitutions have been made for each line):
Code:
A0~-Do9101E012-0
     ^
A0~-D09101Eo12-0
           ^
A0~-D09101E012-o
               ^
A0~-Do9101Eo12-0
     ^     ^
A0~-Do9101E012-o
     ^         ^
A0~-D09101Eo12-o
           ^   ^
A0~-Do9101Eo12-o
     ^     ^   ^
i've been struggling for the whole afternoon trying to get this output, but the best i could do is just this:
Code:
#! /bin/bash

string="A0~-D09101E012-0"

echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "g", $2); print $1"~"out}' > out	# substitute all of the occurrences

n=$(echo "$string" | awk -F '~' '{print $2}' | grep -o "[A-Z-]0" | wc -l) 	# count occurrences

for i in $(seq 1 $n); do
	s=$(echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$i'", $2); print $1"~"out}')	# generate "1-substitution" lines
	echo $s  >> out
	for j in $(seq 1 $n); do
		echo "$s" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$j'", $2); print $1"~"out}' >> out	# generate "2-substitutions" lines
	done
done

# how can i create cycles for "12-occurrences" strings?! do i need 12 nested for cycles?!?!

sort -u out
this is the output:
Code:
A0~-D09101E012-o
               ^
A0~-D09101Eo12-0
           ^
A0~-D09101Eo12-o
           ^   ^
A0~-Do9101E012-0
     ^
A0~-Do9101E012-o
     ^         ^
A0~-Do9101Eo12-0
     ^     ^
A0~-Do9101Eo12-o
     ^     ^   ^
the script seems to work for this particular example, but i can't figure out how to adapt it for n occurrences strings... and i'm suspecting the way i approached the problem is completely wrong...

hints?

Last edited by masavini; 02-03-2014 at 03:40 PM.
 
Old 02-03-2014, 01:41 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
So are you saying that not only do you wish to place the 'o' in the correct spot but the output must demonstrate all the places the input was changed?

I am also not sure I follow the logic as you show that the 'o' at the end is changed 4 times?? What is the benefit here?

Your original output made some sense up to the point where it shows the individual changes. After this point you are showing the combinations possible (perhaps permutations, been a while)
To what end?

As for:
Quote:
# how can i create cycles for "12-occurrences" strings?! do i need 12 nested for cycles?!?!
You probably need to look at some form of recursion, but this is not generally a task well suited for bash.
 
Old 02-03-2014, 03:06 AM   #3
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,862
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
You should do this in C; it would be ten times easier to write the code, and the program would run one thousand time faster.
 
Old 02-03-2014, 03:31 AM   #4
Nephew
LQ Newbie
 
Registered: Jun 2013
Posts: 14

Rep: Reputation: Disabled
I think for n occurrences, there has to be n for loops.
Your example works because you have 3 occurrences and you have 3 for loops.
If n is known or fixed, then you can use this method, but if n changes a lot then I guess you have to seek another method.
 
Old 02-03-2014, 03:38 AM   #5
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
n changes, so no chances to achieve my goal with this method... thanks anyway...

@grail
Quote:
So are you saying that not only do you wish to place the 'o' in the correct spot but the output must demonstrate all the places the input was changed?
no, no... i don't want the output to demonstrate where the substitution have been applied, i only put it in the first post to clarify what i need...
 
Old 02-03-2014, 04:07 AM   #6
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,842

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
First pass:
I would try to mark all those spaces with a special sign, for example ! (if it cannot occur in the original text).
That would be a normal search/replace operation without problems (cut at the delimiter first)
Second pass
You can count the number of ! sign, that will give you information about the number of strings generated from that source line. You can try to generate those strings one by one (one cycle) walking char by char (second loop) and replace ! based on its position (=nth !) with the (nth digit)?0 of the first loop counter.

I hope this helps
I would use perl or c, not awk or shell.
Also avoid chains like awk|grep|wc, that can be implemented in a single awk (those constructs may seriously slow down the execution)
 
Old 02-03-2014, 05:27 AM   #7
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
Quote:
Originally Posted by pan64 View Post
I would use perl or c, not awk or shell.
it would be great, but i'm a total newbie... i can only do simple awk/grep/sed scripts...
i hoped it could be done with some awk "magic" commands, but it sounds like it does not...
 
Old 02-03-2014, 05:39 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
So I am still not following if we need all changes to be made or only a set number, but all could be handled as:
Code:
$ echo "A0~-D09101E012-0" | sed -r 's/([A-Z-])0([^~]|$)/\1o\2/g'
A0~-Do9101Eo12-o
This seems to get the desired output but not sure if this is the idea?
 
Old 02-03-2014, 06:03 AM   #9
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
Quote:
Originally Posted by grail View Post
So I am still not following if we need all changes to be made or only a set number, but all could be handled as:
Code:
$ echo "A0~-D09101E012-0" | sed -r 's/([A-Z-])0([^~]|$)/\1o\2/g'
A0~-Do9101Eo12-o
This seems to get the desired output but not sure if this is the idea?
sorry grail, it seems i've not been clear enough... the input file is a csv "~" separated.
i need to process the 2nd field, so that every record that contains "([A-Z-])0" is the subject for a '\1o' sustitution.

if the 2nd field of a particular record contains "[A-Z-]0" only once, the command is very easy and it's the one you wrote.
but if the regexp is found twice, the process of that record should create a 3 lines output:
a line where only the 1st occurrence is subsituted
a line where only the 2nd occurrence is subsituted
a line where both the occurrences are substituted.

if a record is "A0~-D091010E12-0" (3 occurrences), the output should be 7 lines:
Code:
A0~-Do9101E012-0
A0~-D09101Eo12-0
A0~-D09101E012-o
A0~-Do9101Eo12-0
A0~-Do9101E012-o
A0~-D09101Eo12-o
A0~-Do9101Eo12-o

Last edited by masavini; 02-03-2014 at 06:05 AM.
 
Old 02-03-2014, 06:43 AM   #10
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,842

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
Quote:
Originally Posted by masavini View Post
it would be great, but i'm a total newbie... i can only do simple awk/grep/sed scripts...
i hoped it could be done with some awk "magic" commands, but it sounds like it does not...
You can try it also with awk, that should work, just try to understand the logic I suggested.
 
Old 02-03-2014, 09:28 AM   #11
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Well if it helps, your solution set is - - (2^n) - 1

So if your field has 10 changes this will equate to 1023 lines being displayed.

If this is really what you need, have fun as it should be a good challenge.

As mentioned earlier you will more than likely need to go to something like Perl, Ruby or C to get a decent solution that works in a reasonable time frame.
 
Old 02-03-2014, 03:28 PM   #12
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
wanna have fun?

this works... it takes a few seconds with 4 occurrences strings and few minutes with 5 occurrences strings... did not test with more occurrences (i guess it would take hours)...

obviously i'm not proud of this script, but i thought that if my brain can't take me to the solution, entropy always will.
Code:
#! /bin/bash

string="A0~-D09101E012-0A0"    # 4 occurrences of "[A-Z-]0"

echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "g", $2); print $1"~"out}' > out	# substitute all of the occurrences

n=$(echo "$string" | awk -F '~' '{print $2}' | grep -o "[A-Z-]0" | wc -l) 	# count occurrences
for i in $(seq 1 $n); do
	s=$(echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$i'", $2); print $1"~"out}')	# generate "1-substitution" lines
	echo $s  >> out
	if [[ $s =~ [A-Z]0 ]]; then
		for j in $(seq 1 $n); do
			echo "$s" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$j'", $2); print $1"~"out}' >> out	# generate "2-substitutions" lines
		done
	fi
done

expectedN=$(echo "2 ^ $n - 1" | bc)
echo "$n occurrences, $expectedN expected output lines"

while (( $(sort -u out | wc -l ) < $expectedN )); do
	echo -ne "$(sort -u out | wc -l ) combinations found so far   \r"
	randomCycles=$(( $(rand -M $(( $n - 2 ))) + 3 ))	# if n=3, i always get 3. if n=4, i can get 3 or 4. if n=5, i can get 3, 4 or 5.
	s=$string
	for i in $(seq 1 $randomCycles); do
		randomPosition=$(( $(rand -M $(( $n - $i + 1 ))) + 1 ))	# get a random position. if n=3, i always get 1. if n=5 and i=3, i can get 1, 2 or 3.
		s=$(echo "$s" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$randomPosition'", $2); print $1"~"out}')    # substitute the random position
	done
	echo "$s" >> out
done

sort -u out
take care of your heart: just don't laugh too much...

Last edited by masavini; 02-03-2014 at 03:30 PM.
 
Old 02-03-2014, 03:47 PM   #13
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by masavini View Post
wanna have fun?
Yes.

I tried to run your code but was not successful.
Code:
 rand: command not found
Daniel B. Martin
 
Old 02-03-2014, 06:12 PM   #14
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
Code:
$ rand --version
Random numbers generator for GNU/Linux, version 1.0.4, May  7 2009
Copyright (c) 2008  Guduleasa Alexandru Ionut <gulyan89@yahoo.com>
Licence: GPL v3 or any later
i'm using ubuntu 13.04, can't remember if rand was included among installation packages...

Code:
$ rand -M 3
generates a random integer between 0 and 2... nothing else...
 
Old 02-04-2014, 07:55 AM   #15
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
So the following does the initial requirement, however it does not currently allow for a repeated combination:
Code:
#!/usr/bin/env ruby

s_input = 'A0~-D09201E012-0A0'

sa_split = s_input.split('~')
sa_scan  = sa_split[1].scan(/[A-Z-]0/)

1.upto(sa_scan.size){|i|
    sa_scan.combination(i).to_a.each{|a|
        s_original = sa_split[1].clone
        a.each{|s|
            s_original.sub!(s,s[0]+'o')
        }
        puts [sa_split[0],s_original,sa_split[2..-1]].flatten.join('~')
    }   
}
There is also little to no error checking, so this would be something else to implement

As you can see, in languages such as bash and awk this level of complexity would take considerably more work

Here is a reference for you to look up any of the commands used:

http://www.ruby-doc.org/core

Last edited by grail; 02-04-2014 at 07:57 AM.
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] substring using awk and sed ( binary ) iffarrukh Programming 7 05-15-2012 11:22 PM
awk substring command casperdaghost Linux - Newbie 1 07-31-2010 07:26 AM
Using Grep/Awk/Sed to get a substring from a command johnjust Programming 5 01-12-2010 08:02 PM
Bash - Substring Extraction and Substitution on same variable jax8 Programming 5 04-26-2009 06:20 AM
using awk substring function on a file in a bash script matt007 Programming 3 06-17-2008 08:17 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:23 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration