[SOLVED] Elimination of repeated letter pairs

dugan · 08-17-2012, 09:17 AM

Python solution:

Code:

token = "AARDVARK"


def pairs(token):
    for i in xrange(len(token)):
        pair = token[i:i + 2]
        if len(pair) == 2:
            yield pair


def is_repeated(pair, token):
    reduced = token.replace(pair, '')
    return len(reduced) < len(token) - 2


repeated_pairs = {pair for pair in pairs(token)
        if is_repeated(pair, token)}

for pair in repeated_pairs:
    token = token.replace(pair, '')

print token

danielbmartin · 08-17-2012, 10:56 AM

[QUOTE=grail;4756321]

Code:

awk -F "" '{for(i=1;i < NF-2;i++)if(substr($0,i+2) ~ $i$(i+1))gsub($i$(i+1),"")}1' file

Nice!

I'm not ignoring your ruby solution, but that language is beyond my scope of knowledge.

Quote:

... I am sure there will be a stumper word for each solution ...

I'm using a large data file to test your awk and also the sed solution proposed by firstfire. I have not found any glitches.

Daniel B. Martin

grail · 08-17-2012, 11:37 AM

I am not sure where you are confused Daniel? Your explanation seems pretty bang on to me.

firstfire · 08-17-2012, 02:26 PM

Hi.

Excellent, Daniel! I could not explain it better, especially taking into account my bad english.

I'll try though..

Code:

s/([^:]{2})(.*)\1/:\1\2%/

The "robot head" [^:] is any non-colon character. Since we need two such characters, we write [^:]{2} and enclose it in parentheses to reference later.

Regular expression ([^:]{2})(.*)\1 matches any text beginning and ending with the same pair of characters, and this pair may be referenced by \1. The text in between these two pairs may be referenced using \2.

Now, if we found some matching text, we replace it by :\1\2%, that is replace second pair of characters by % and prepend (to the matched substring) colon to mark characters to be removed later. We can not just remove first pair of characters (\1) because remaining text may still contain \1 somewhere else (e.g. if there are 3 repeated pairs and we remove two of them on the first iteration, then there will be no way to find third pair on the second iteration). I use % in place of removed pairs to preserve the "structure" of the string between iterations. This way we get

Code:

ABAABB -> :ABA%B -> AB

instead of

Code:

ABAABB -> :ABAB ->::AB -> empty

(if we would just removed second pair)

Finally, if the substitution was successful, we jump to label a, otherwise there are no more repeated pairs and we proceed to cleanup.

Hope that helps.

danielbmartin · 08-17-2012, 04:59 PM

Quote:

Originally Posted by firstfire

... Hope that helps.

Yes, thank you, this is the level of detail I needed.

Thanks to your sed and also the excellent awk contributed by grail this thread could be marked SOLVED. However, I'll hold it open for a bit longer because I have a follow-on question for both of you.

In the best case each line in the output file would show the modified character string, the repeated letter pair, and the original (unmodified) string.

Example:

Code:

ADVK  AR  AARDVARK

This may be called "the icing on the cake" so don't bother with it unless the answer is easy.

Daniel B. Martin

pixellany · 08-17-2012, 07:01 PM

You have just about described the whole thing. "part 1" is basically a sed replace command (s = replace). It finds an instance in which a letter pair is repeated, and then tags the 2nd member of the pair in such a way that it will not be detected again. The "ta" command says: If s did a replacement, then branch back to the beginning of the loop and do it again. The last time thru, with nothing left to change, the "ta" will be bypasses, and then the various tags can be removed.

firstfire · 08-18-2012, 03:03 AM

Hi.

Formatting may be done as follows

Code:

$ echo BFAARBFDVARKARBBFF | sed -r 'h; :a; s/([^:]{2})(.*)\1/:\1\2%/; p; ta; G; :b;s/:+(..)(.*)\n/\2\n\1 /;tb; s/%//g; s/\n/ /;'
:BFAARBFDVARKARB%F
::BFAAR%DVARKARB%F
::BFA:AR%DVARK%B%F
::BFA::AR%DV%K%B%F
::BFA::AR%DV%K%B%F
ADVKBF AR BF BFAARBFDVARKARBBFF

but this is not elegant at all and hardly may be called a one-liner.

EDIT:
Well, here is another approach which leads to the desired output directly

Code:

$ echo BFAARBFDVARKARBBFF | sed -r 'h; :a; s/([^% ]{2})(.*)\1(.*)/%\2%\3 \1/; p; ta; G; l; s/%//g; s/\n/ /; s/ +/ /g'
%AARBFDVARKARB%F BF
%A%BFDVARK%B%F BF AR
%A%%DVARK%B%F % AR BF
%A%%DV%K%B%F % % BF AR
%A%%DV%K%B%F % % BF AR
%A%%DV%K%B%F % % BF AR\nBFAARBFDVARKARBBFF$
ADVKBF BF AR BFAARBFDVARKARBBFF

Again, p and l commands are optional.

grail · 08-18-2012, 03:26 AM

Code:

# Awk
$ echo BFAARBFDVARKARBBFF | awk -F "" '{printf "%s ",$0;for(i=1;i < NF-2;i++)if(substr($0,i+2) ~ $i$(i+1)){printf "%s ",$i$(i+1);gsub($i$(i+1),"")}}1'
BFAARBFDVARKARBBFF BF AR ADVKBF

#Ruby
$ echo BFAARBFDVARKARBBFF | ruby -pe 's = [$_.chomp];until s.nil?; print s[0] + " ";s = $_.scan(/(..).*\1/)[0];$_.gsub!(s[0],"") unless s.nil?;end'
BFAARBFDVARKARBBFF BF AR ADVKBF

danielbmartin · 08-18-2012, 04:44 PM

Quote:

Originally Posted by firstfire

Code:

 sed -r 'h; :a; s/([^% ]{2})(.*)\1(.*)/%\2%\3 \1/; p; ta; G; l; s/%//g; s/\n/ /; s/ +/ /g'

Thanks, once again, for your interest in this subject. This latest version provides the desired function... but... my output file contains unwanted ^M characters in the output file. Example:

Code:

a%dv%k^M ar
a%dv%k^M ar
a%dv%k\r ar\naardvark\r$
advk^M ar aardvark

They may be eliminated with a subsequent "cleanup" ...

Code:

|tr -d '\b\r'

... but I don't understand where they came from. Is it possible to prevent them from being created in the first place?

My program using your code:

Code:

echo
echo "Method of LQ member firstfire"
sed -r 'h; :a; s/([^% ]{2})(.*)\1(.*)/%\2%\3 \1/; p; ta;
   G; l;
   s/%//g;
   s/\n/ /;
   s/ +/ /g' < $Work01 > $Work07

My program using your code followed by a "cleanup":

Code:

echo
echo "Method of LQ member firstfire"
echo " .. and remove unwanted ^M strings"
sed -r 'h; :a; s/([^% ]{2})(.*)\1(.*)/%\2%\3 \1/; p; ta;
   G; l;
   s/%//g;
   s/\n/ /;
   s/ +/ /g' < $Work01  \
|tr -d '\b\r' \
> $Work08

Daniel B. Martin

firstfire · 08-18-2012, 11:25 PM

Hi, Daniel.

Your input file uses DOS line endings '\r\n'. You can remove \r together with %'s using

Code:

s/[%\r]//g

Or use dos2unix utility to convert input data.

danielbmartin · 08-19-2012, 01:57 PM

Thank you, grail, for this most recent version. Recast into shorter lines, it is:

Code:

echo "Method of LQ Guru grail"
awk -F "" '{printf "%s ", $0;
   for(i=1;i < NF-2;i++) 
   if(substr($0,i+2) ~ $i$(i+1))
   {printf "%s ",$i$(i+1);gsub($i$(i+1),"")}}1' < $Work01 > $Work13

This works nicely, it produces the desired character strings, but (minor nitpick) not in the desired order.
It produces aardvark ar advk but I prefer advk ar aardvark

Okay, no big deal, I'll fix it (or so I thought). This is my attempt which fails.

Code:

echo "Method of LQ Guru grail with extension,"
echo " modified to reverse the order of output strings."
awk -F "" '{
   for(i=1;i < NF-2;i++) 
   if(substr($0,i+2) ~ $i$(i+1))
   {printf "%s ",gsub($i$(i+1),"");$i$(i+1)};$0}1' < $Work01 > $Work14

Please show where this went wrong.

Daniel B. Martin

danielbmartin · 08-19-2012, 02:00 PM

Quote:

Originally Posted by firstfire

Your input file uses DOS line endings '\r\n'.

Yes, of course, the problem was in the data and not in the code. That's fixed now.

I'll hold this thread open, awaiting a response from grail regarding his solution.

Daniel B. Martin

grail · 08-19-2012, 08:54 PM

The easiest solution is to save $0 at the start and remove the 1 at the end, then use a single print to deliver both:

Code:

awk -F "" '{orig = $0;
   for(i=1;i < NF-2;i++) 
   if(substr($0,i+2) ~ $i$(i+1))
   {printf "%s ",$i$(i+1);gsub($i$(i+1),"")}print $0,orig}' < $Work01 > $Work13

Edit: oops ... just realised this is not what you wanted. You want the answer before the parts removed and then the question.
So let us try again:

Code:

awk -F "" '{orig = $0;
   for(i=1;i < NF-2;i++) 
   if(substr($0,i+2) ~ $i$(i+1))
   {parts = sprintf "%s ",$i$(i+1);gsub($i$(i+1),"")}print $0,parts orig}' < $Work01 > $Work13

danielbmartin · 08-19-2012, 09:20 PM

Quote:

Originally Posted by grail

The easiest solution is to save $0 at the start and remove the 1 at the end, then use a single print to deliver both ...

I used this ...

Code:

echo "Method of LQ Guru grail,"
echo " modified to reverse order of output words."
awk -F "" '{orig = $0;
   for(i=1;i < NF-2;i++) 
   if(substr($0,i+2) ~ $i$(i+1))
   {parts = sprintf "%s ",$i$(i+1);gsub($i$(i+1),"")}print $0,parts orig}' < $Work01 > $Work13

... and got this ...

Code:

: cmd. line:3:    {parts = sprintf "%s ",$i$(i+1);gsub($i$(i+1),"")}print $0,parts orig}
awk: cmd. line:3:                     ^ syntax error

Daniel B. Martin

grail · 08-19-2012, 09:37 PM

My bad ... forgot about the returning function rule, ie it needs brackets

Code:

awk -F "" '{orig = $0;
   for(i=1;i < NF-2;i++) 
   if(substr($0,i+2) ~ $i$(i+1))
   {parts = parts sprintf("%s ",$i$(i+1));gsub($i$(i+1),"")}print $0,parts orig}' < $Work01 > $Work13

And I added the concatenation i forgot as well .... on fire this morning