LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-04-2014, 09:51 AM   #16
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660

Quote:
Originally Posted by masavini View Post
this works... it takes a few seconds with 4 occurrences strings and few minutes with 5 occurrences strings... did not test with more occurrences (i guess it would take hours)...
Please elaborate this execution time statement.

How many input lines with 4 occurrences? 5 occurrences? Does the real-world input file ever have strings with more than 5 occurrences?

Daniel B. Martin
 
Old 02-04-2014, 12:11 PM   #17
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
Quote:
Originally Posted by danielbmartin View Post
How many input lines with 4 occurrences? 5 occurrences? Does the real-world input file ever have strings with more than 5 occurrences?
i can't really tell... i only know the file is huge, changes frequently, and may contain many occurrences. even more, "[A-Z-]0" is not the only regex that should be subject of substitutions.

(un)fortunately, the is no chance that such a script can be used in the production system... but in the meantime, i probably understood how to make it work without the use of thermodynamics...

let's say a string has n occurrences. i just need to write numbers from 1 to (2^n-1) in base 2, using a n digits format. if n=2, i have the following base 2 numbers:
1-> 01
2-> 10
3-> 11

each of those strings indicates which occurrences must be substituted (1) and which must not (0).

i'm trying now to write it down, i hope i can post it tonight...

@grail
thanks for the ruby script, but i'm already paying someone to make a c program... i'm trying to solve the problem with bash just to learn some more... i will obviously post the c version as soon as i get it...

Last edited by masavini; 02-04-2014 at 12:12 PM.
 
Old 02-04-2014, 12:41 PM   #18
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
got it!

Code:
#!/bin/bash

string="A0~-D09101E012-0A0-0"

n=$(echo "$string" | awk -F '~' '{print $2}' | grep -o "[A-Z-]0" | wc -l) 	# count occurrences

expectedN=$(echo "2 ^ $n - 1" | bc)
echo "$n occurrences, $expectedN expected output lines"

# substitute all of the occurrences, so i can save 1 cycle in the next loop
echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "g", $2); print $1"~"out}' > out

maps=$(seq 1 $(( $expectedN-1 )) | awk '{print "ibase=10;obase=2;" $1}' | bc | xargs printf "%0${n}d\n")
for map in $maps; do
	s=$string
	j=0	# substituted
	for (( i=0; i<${#map}; i++ )); do
		if (( ${map:$i:1} > 0 )); then
			let k=$i+1-$j    # add 1 to the char position (they go from 0 to n-1, i need to tell awk the position from 1 to n) and subtract previous substitutions number
			s=$(echo "$s" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$k'", $2); print $1"~"out}')
			let j+=1	# substituted
		fi
	done
	echo "$s" >> out
done

wc -l out
cat out
i can't tell how proud i am of this script!!

Last edited by masavini; 02-04-2014 at 12:42 PM.
 
Old 02-04-2014, 01:00 PM   #19
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by masavini View Post
i'm trying to solve the problem with bash just to learn some more...
I wrote a solution to this problem, a bash script which uses awk to do the heavy lifting. I'd like to have some measure of performance of your script and post mine if it is competitive. Could you create a meaningful input file (~100 lines) which could be used to compare execution times?

Daniel B. Martin
 
Old 02-04-2014, 02:24 PM   #20
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
what if i just repeat the same operation 100 times?
Code:
#!/bin/bash

string="A0~-D09101E012-0A0-0"

start=$(date +%s)

for c in $(seq 1 100); do

	n=$(echo "$string" | awk -F '~' '{print $2}' | grep -o "[A-Z-]0" | wc -l) 	# count occurrences
	
	expectedN=$(echo "2 ^ $n - 1" | bc)
	
	echo "$string" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "g", $2); print $1"~"out}' > out$c	# substitute all of the occurrences
	
	maps=$(seq 1 $(( $expectedN-1 )) | awk '{print "ibase=10;obase=2;" $1}' | bc | xargs printf "%0${n}d\n")
	for map in $maps; do
		s=$string
		j=0	# substituted
		for (( i=0; i<${#map}; i++ )); do
			if (( ${map:$i:1} > 0 )); then
				let k=$i-$j+1
				s=$(echo "$s" | awk -F '~' '/[A-Z-]0/{out = gensub(/([A-Z-])0/, "\\1o", "'$k'", $2); print $1"~"out}')
				let j+=1	# substituted
			fi
		done
		echo "$s" >> out$c
	done
done

end=$(date +%s)

echo "executed in $(( $end - $start ))s"

cat out* > output
wc -l output
executed in 26s
3100 output
 
Old 02-04-2014, 03:12 PM   #21
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by masavini View Post
what if i just repeat the same operation 100 times?
I built an input file with 100 identical lines.
Each line is: A0~-D09101E012-0A0-0

The result is:
Code:
daniel@daniel-desktop:~$ bash /home/daniel/Desktop/LQfiles/dbm1057.bin
Execution started:
Tue Feb  4 16:06:34 EST 2014
Execution ended:
Tue Feb  4 16:06:36 EST 2014

The input file contains 100 lines.
The output file contains 3100 lines.

Normal end of job.
The results look right to my eye, but you should judge for yourself.

Daniel B. Martin
 
Old 02-04-2014, 03:18 PM   #22
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
I tried to post the code and my computer went nuts. I'll try again.

Is there a limit to the size of a post on this forum?

Daniel B. Martin
 
Old 02-04-2014, 03:19 PM   #23
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Code:
#!/bin/bash
#   Daniel B. Martin   Feb14
#
#   To execute this program, launch a terminal session and enter:
#   bash /home/daniel/Desktop/LQfiles/dbm1057.bin
#
#  This program inspired ...
#    http://www.linuxquestions.org/questions/programming-9/
#      bash-get-all-possible-substring-substitution-combinations-4175493478/
# File identification
   Path=$(cut -d'.' -f1 <<< ${0})
 InFile=$Path"inp.txt"
  Work1=$Path"w1.txt"
  Work2=$Path"w2.txt"
  Work3=$Path"w3.txt"
OutFile=$Path"out.txt"

# Have: a file with ~300k strings like this: A0~-D091010E12-0

# Want: all the possibile strings generated by the substitution of
#   '[A-Z-]0' with '\1o', but only AFTER the ~ char.
# The first field (~ delimited) is a suffix that must be printed
# with every output line...

#  A0~-D09101E012-0
# In this case, the regexp '[A-Z-]0' matches 3 substrings:
#  A0~-D09101E012-0
#      ^^    ^^  ^^
#      D0    E0  -0
 
# There would be 7 output lines from this 1 input line.
# A0~-Do9101E012-0
#      ^
# A0~-D09101Eo12-0
#            ^
# A0~-D09101E012-o
#                ^
# A0~-Do9101Eo12-0
#      ^     ^
# A0~-Do9101E012-o
#      ^         ^
# A0~-D09101Eo12-o
#            ^   ^
# A0~-Do9101Eo12-o
#     ^     ^   ^


echo "Execution started:"
date

# Create a binary table (in mirror form)
# This is done only once, no matter how large the InFile may be.

awk '{printf("%d ",  ($0*1.0000)%2);
      printf("%d ",  ($0*0.5000)%2);
      printf("%d ",  ($0*0.2500)%2);
      printf("%d ",  ($0*0.1250)%2);
      printf("%d\n", ($0*0.0625)  )}' <(seq -w 1 31) >$Work2

rm $OutFile  # Blow away any leftover output file.

# Bash loop ...
while read InLine  # Read one line from the InFile.
do

# Identify potential substitution locations.
awk -F "" '{for (j=index($0,"~");j<=NF;j++)
     {if (match(substr($0,j,2),/[A-Z-]0/))
       print j,substr($0,j,2),p,$0}}' <<<$InLine >$Work1
 
# Work1 looks like this:
# 5 D0  A0~-D09101E012-0
# 11 E0  A0~-D09101E012-0
# 15 -0  A0~-D09101E012-0

# Trim the binary table to suit this input line.
nc=$(wc -l <$Work1)
# Test for no substitutions required.
if [ $nc -eq 0 ]; then
  echo $InLine >>$OutFile
  continue
fi

let nr=-1+2**$nc
cut -d" " -f1-$nc <$Work2 |rev |head -$nr >$Work3

# Work3 contains a table which looks like this:
# 0 0 1
# 0 1 0
# 0 1 1
# 1 0 0
# 1 0 1
# 1 1 0
# 1 1 1
 
# FNR==NR is true only when reading the first input file.
awk 'FNR==NR {p[NR]=$1; str=$3; next}
 {s=str; for (j=1;j<=NF;j++) if ($j>0) {s=substr(s,1,p[j])"o"substr(s,p[j]+2)};
    print s}' $Work1 $Work3 >>$OutFile
done <$InFile  # End of bash loop

#echo "OutFile ..."; cat $OutFile; echo "End Of File"


echo "Execution ended:"
date
echo
echo "The input file contains" $(wc -l <$InFile) "lines."
echo "The output file contains" $(wc -l <$OutFile) "lines."



echo; echo "Normal end of job."; echo; exit
Daniel B. Martin
 
Old 02-04-2014, 04:44 PM   #24
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
2 seconds?!?

absolutely great... amazing...

can you please explain this?
Quote:
Code:
awk 'FNR==NR {p[NR]=$1; str=$3; next}
 {s=str; for (j=1;j<=NF;j++) if ($j>0) {s=substr(s,1,p[j])"o"substr(s,p[j]+2)};
    print s}' $Work1 $Work3 >>$OutFile
i can understand what it does, but not how...
if i'm not wrong, this avoids the loop within the base2-string.

great... i'll test it against the c code the friend of mine is going to sell me...
 
Old 02-04-2014, 05:41 PM   #25
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
i gave my script a 2nd chance... 21 seconds!

and yours took 1s on my shiny fast ultrabook...
 
Old 02-04-2014, 05:43 PM   #26
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by masavini View Post
2 seconds?!?
Please test it on your computer to verify execution time and output validity.

Quote:
can you please explain this?
Code:
awk 'FNR==NR {p[NR]=$1; str=$3; next}
 {s=str; for (j=1;j<=NF;j++) if ($j>0) {s=substr(s,1,p[j])"o"substr(s,p[j]+2)};
    print s}' $Work1 $Work3 >>$OutFile
This part ...
Code:
FNR==NR {p[NR]=$1; str=$3; next}
... reads the first of two input files, Work1, which provides a list of substitution positions (columns) and the input line which is to be modified by those substitutions. The positions are saved in the awk array p (for position) and the input line (such as A0~-D09101E012-0) is saved in the awk variable str (for string).

The rest of this awk reads the second input file Work3 which guides the code to perform all possible permutations. For a three-substitution input line, Work3 looks like this ...
Code:
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
Reading top-to-bottom, you will recognize each row as binary 1, 2, ..., 7.


s=str copies the A0~-D09101E012-0 string before any substitutions are made.

for (j=1;j<=NF;j++) sets up a loop to apply each row (such as 1 1 0) from Work3.

if ($j>0) says a substitution is made only for a 1 in the row (such as 1 1 0).

s=substr(s,1,p[j])"o"substr(s,p[j]+2) makes one substitution. It is done by taking a "left hand" substring (such as A0~-D09101E), appending an "o", and appending a "right hand" substring (such as 12-0).

print s is performed after the j-loop has ended. It writes the awk variable s (the finished product) to the output file.

Whew!

Quote:
i'll test it against the c code the friend of mine is going to sell me...
My price beats his!

Daniel B. Martin

Last edited by danielbmartin; 02-04-2014 at 06:01 PM. Reason: Cosmetic improvement
 
Old 02-04-2014, 06:15 PM   #27
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
got it... i guessed that some "magic" awk script would have solved it, and it did...

a couple more of questions:
1) the script goes crazy if the first part of the record (the string before "~") contains spaces...
i guess it's because of $Work1 format:
# Work1 looks like this:
# 5 D0 A0~-D09101E012-0

i solved it by substituting all spaces with "_" before executing the script and applying the opposite substitution to the output file...

2) about $Work2: do you think it's worth to create a HUGE file, save it somewhere and reuse it everytime the script runs?

Last edited by masavini; 02-04-2014 at 06:26 PM.
 
Old 02-04-2014, 06:41 PM   #28
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
Quote:
Originally Posted by danielbmartin;5111596
[B
s=substr(s,1,p[j])"o"substr(s,p[j]+2)[/B] makes one substitution. It is done by taking a "left hand" substring (such as A0~-D09101E), appending an "o", and appending a "right hand" substring (such as 12-0).
this said, i'd expect the script to work with a regex like "[A-Z]0[A-Z]", but it doesn't seem to... it's late, i'm going to sleep... tomorrow i'll think about it... if you have suggestions (NOT the solution), they're always welcome...
 
Old 02-04-2014, 07:19 PM   #29
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by masavini View Post
2) about $Work2: do you think it's worth to create a HUGE file, save it somewhere and reuse it everytime the script runs?
Work2 is built only once per execution and therefore contributes little to overall execution time. Each line in the InFile causes a new Work3 to be created, and it is always a subset of Work2. There are only five possible "versions" of Work3. An input line with two substitutions always creates the same Work3 ...
Code:
0 1
1 0
1 1
... and an input line with three substitutions always creates the same Work3 ...
Code:
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
... and so forth. This suggests the possibility of building five versions at the start of program execution and selecting the appropriate one for each InFile line. However, the time saving would be small because it saves only the time to perform the subsetting. That task is handled by this code snippet:
Code:
# Trim the binary table to suit this input line.
nc=$(wc -l <$Work1)
let nr=-1+2**$nc
cut -d" " -f1-$nc <$Work2 |rev |head -$nr >$Work3
There's not much processing here. No loops, no function calls, nothing complicated. I think the rev can be eliminated; it was coded that way for esthetic reasons.

Daniel B. Martin
 
Old 02-05-2014, 12:46 AM   #30
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
I think as most of the hard yards are done in awk, which can also easily do arithmetic and most if not all of your other commands, it would been cleaner and make more sense to do the lot in awk as opposed
to asking bash to collate everything. You can also probably do away with all the temp files as well if you use a single interface.

I know you have someone doing the C option, but here is the tidied up ruby that now handles repetition, ie "A0" appears 2 or more times (thanks Daniel for the ideas)
Code:
#!/usr/bin/env ruby

# String array used to store regex matches
sa_scan = []

# Loop over the file passed in as the first argument to script
# 'if' at the closing brace ensures we received at least 1 argument and
# that the file exists
IO.foreach(ARGV[0]){
	|line|

	# Separate line using '~' into a string array and create a copy of the
	# original string
	sa_split = line.split('~')
	s_copy = sa_split[1].clone

	# Loop over regex items found and store the index at where they are in
	# the original string into a number array
	sa_split[1].scan(/[A-Z-]0/){|m|
		na_scan << s_copy.index(m) + 1
		
		# Replace regex found by underscores in case of repeated patterns
		s_copy.sub!(m, '__')
	}

	# Set n to size of array.  If no matches found print original entry
	n = na_scan.size
    puts line if n == 0

	# Loop over number of entries found
	1.upto(n){|i|
		# Loop over the possible combinations
		na_scan.combination(i){|a|
			s_original = sa_split[1].clone
			# Replace indexed item with new character
			a.each{|s|
				s_original[s] = 'o'
			}
			# Join back new string with original and insert '~' where necessary
			puts [sa_split[0],s_original,sa_split[2..-1]].flatten.join('~')
		}
	}

	# Clear array to be re-used
    na_scan.clear
} if ! ARGV.empty? && File.exists?(ARGV[0])
This one now reads from a file and performed the 100 lines Daniel used as per below times:
Code:
real	0m0.837s
user	0m0.177s
sys	0m0.650s
This was run on a VM.

The other beauty here is it would be simple enough to also supply a second argument on the line, first being the file name, to be the regex to use or if not available to use a default.

Last edited by grail; 02-06-2014 at 10:17 PM. Reason: Added correction found by Daniel; Added comments
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] substring using awk and sed ( binary ) iffarrukh Programming 7 05-15-2012 11:22 PM
awk substring command casperdaghost Linux - Newbie 1 07-31-2010 07:26 AM
Using Grep/Awk/Sed to get a substring from a command johnjust Programming 5 01-12-2010 08:02 PM
Bash - Substring Extraction and Substitution on same variable jax8 Programming 5 04-26-2009 06:20 AM
using awk substring function on a file in a bash script matt007 Programming 3 06-17-2008 08:17 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 01:26 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration