[SOLVED] Elimination of repeated letter pairs

danielbmartin · 08-16-2012, 03:38 PM

Have: a file of character strings, one string per line.

Want: a way to eliminate repeated letter pairs.

Example: AARDVARK has a repeated letter pair (AR).
AARDVARK should be transformed to ADVK.

This grep ...

Code:

grep -e '\(..\).*\1' < $InFile > $Work1

... identifies strings which have repeated letter pairs. They are candidates for the second step, elimination of repeated letter pairs. How may this be done?

Daniel B. Martin

rigor · 08-16-2012, 05:30 PM

It can be done with awk, perl, and a variety of things.

So what command/languages/capabilities are you willing to use in the solution to your problem?

danielbmartin · 08-16-2012, 06:22 PM

Quote:

Originally Posted by kakaka

It can be done with awk, perl, and a variety of things.

So what command/languages/capabilities are you willing to use in the solution to your problem?

I'm learning awk and would certainly like to use it. I know nothing of perl yet, so let's steer clear of that. Thanks for your interest.

Daniel B. Martin

ip_address · 08-16-2012, 07:43 PM

how about this:

SCRIPT:

Code:

#!/bin/bash

#variable
var="AARDVARK"

#variable length
var_len=${#var}

#array index
p=0;

#create possible combinations and search them against original word
for ((subt=1;subt<="$var_len"-2;subt++));do
	subt1=$((subt+1))
	for ((counter=0;counter<"$var_len"-"$subt";counter++));do
		var1=${var:counter:subt1}
		num=`echo "$var" | grep -o "$var1" | wc -l`
		if (( $num>1 )); then
			p=$((p+1))
			a[$p]=`echo "$var" | sed "s/$var1//g"`
		fi
	done
done

#take the unique values
echo ${a[@]} | tr ' ' '\n' | sort | uniq

RESULT:

Code:

ADVK

grail · 08-16-2012, 08:54 PM

Your description does not seem to match the output from the way I understand it

AARDVARK has a repeated letter pair (AR). - - This part I understand

And if I were to remove all A and R letters I would end up with - - DVK

As you do not keep the first R I am curious why one of the first A's has been kept in your example output? (ADVK)

ip_address · 08-16-2012, 09:13 PM

Quote:

Originally Posted by grail

Your description does not seem to match the output from the way I understand it

AARDVARK has a repeated letter pair (AR). - - This part I understand

And if I were to remove all A and R letters I would end up with - - DVK

As you do not keep the first R I am curious why one of the first A's has been kept in your example output? (ADVK)

@grail .. why you want to remove all A and R letters when danielbmartin is requesting a way to eliminate repeated letter pairs only

Quote:

Originally Posted by danielbmartin

Have: a file of character strings, one string per line.

Want: a way to eliminate repeated letter pairs.

Daniel B. Martin

In word: "AARDVARK" only repeated letter pair is "AR" and it occurred twice. Hence all the occurrences of "AR" were removed and result became "ADVK"

danielbmartin · 08-16-2012, 09:19 PM

Quote:

Originally Posted by grail

... if I were to remove all A and R letters I would end up with - - DVK

Emphasis on the word pairs. If the input string is
AARDVARK and we remove those AR pairs
the remaining string is ADVK.

Daniel B. Martin

grail · 08-16-2012, 10:15 PM

Thanks for the information folks ... missed that little twist

May I ask another question, what if the pairs are intertwined?

Example: AARDVARD ... so here there is also the duplicate pair of RD prior to AR being removed

So if AR is removed first we get - - ADVD
However if all pairs considered prior to any removal we would get - - AV

Sorry if I have my cloth ears on this morning

danielbmartin · 08-16-2012, 10:30 PM

Quote:

Originally Posted by grail

... what if the pairs are intertwined?

Example: AARDVARD ... so here there is also the duplicate pair of RD prior to AR being removed

So if AR is removed first we get - - ADVD
However if all pairs considered prior to any removal we would get - - AV

That's a complication I hadn't considered. For simplicity, let's say that intertwined (or we might say overlapped) letter pairs may be ignored.

Daniel B. Martin

ip_address · 08-16-2012, 11:03 PM

@grail .. you are right. It depends on the rules formed to remove those pairs. For example, for the word "PAARKEAAR", following combinations can form (not restricted to pairs only):

Quote:

AA
AAR
AR

It will depend on the rules formed that which combination should be removed first. One rule could be removing the longest one, other could be removing the first encountered combination. It depends on author how he/she wants to form rules.

The code that i submitted before will generate separate words after removing one particular combination at a time. Hence for word "PAARKEAAR", code will spit out three results:

Quote:

1) PAKEA (after substituting AA with "")

2) PKE (after substituting AAR with "")

and

3) PRKER (after substituting AR with "")

That being said code can be easily modified for pairs only, rules can be integrated and then it can substitute all the pairs and result only one word as final answer.

grail · 08-16-2012, 11:29 PM

Ok ... thanks. So I am sure there will be a stumper word for each solution, but here goes:

Code:

awk -F "" '{for(i=1;i < NF-2;i++)if(substr($0,i+2) ~ $i$(i+1))gsub($i$(i+1),"")}1' file

And the ruby looks clumsy (still working on this language)

Code:

ruby -pe '$_.gsub!($_.scan(/(..).*\1/)[0][0],"") while $_.scan(/(..).*\1/).length == 1' file

firstfire · 08-17-2012, 04:15 AM

Hi.

In sed:

Code:

$ echo BFAARBFDVARKARBBFF | sed -r ':a; s/([^<>]{2})(.*)\1/<\1>\2%/; p; ta; s/<+..>+//g; s/%//g' 
<BF>AARBFDVARKARB%F
<<BF>>AAR%DVARKARB%F
<<BF>>A<AR>%DVARK%B%F
<<BF>>A<<AR>>%DV%K%B%F
<<BF>>A<<AR>>%DV%K%B%F
ADVKBF

This script assumes that 'ABAAABBB' -> 'AABB'.

If one prefers 'ABAAABBB' -> '', then % symbol (and the last substitution command) should be removed.
The p(rint) command is only there to illustrate how the script works.

pixellany · 08-17-2012, 05:36 AM

that is REALLY clever!!

firstfire · 08-17-2012, 05:53 AM

I just realized that we have only pairs of characters, so there is no need in closing marker '>':

Code:

$ echo BFAARBFDVARKARBBFF | sed -r ':a; s/([^:]{2})(.*)\1/:\1\2%/; p; ta; s/:+..//g; s/%//g'
:BFAARBFDVARKARB%F
::BFAAR%DVARKARB%F
::BFA:AR%DVARK%B%F
::BFA::AR%DV%K%B%F
::BFA::AR%DV%K%B%F
ADVKBF

Quote:

Originally Posted by pixellany

that is REALLY clever!!

If you tell it to me, then thank you, I'm really glad you like it!

danielbmartin · 08-17-2012, 09:02 AM

Quote:

Originally Posted by firstfire

Code:

sed -r ':a; s/([^:]{2})(.*)\1/:\1\2%/; p; ta; s/:+..//g; s/%//g'

Brilliant! I'm working to understand it.

There are three segments to this sed.

Code:

#1... :a; s/([^:]{2})(.*)\1/:\1\2%/; p; ta;

#2... s/:+..//g;

#3...s/%//g

#1 has a loop which identifies the first instance of a repeated letter pair, prefixes it with a colon, and replaces the second instance with a percent sign. An intermediate result is printed, the operation is tested, and based on the test, another iteration of the loop labeled "a" is performed.

#2 eliminates all colons and the following two characters (i.e. the first instance of the repeated letter pair).

#3 eliminates all percent signs which are "ghosts" of the second instance of the letter pair.

Step #1 clearly does all the "heavy lifting". Steps #2 and #3 are cosmetic clean-ups.

I can see what step #1 does but am not clear on how it works.
Please review, correct, and expand this narrative.

Daniel B. Martin