LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Elimination of repeated letter pairs (https://www.linuxquestions.org/questions/programming-9/elimination-of-repeated-letter-pairs-4175422484/)

dugan 08-17-2012 09:17 AM

Python solution:
Code:

token = "AARDVARK"


def pairs(token):
    for i in xrange(len(token)):
        pair = token[i:i + 2]
        if len(pair) == 2:
            yield pair


def is_repeated(pair, token):
    reduced = token.replace(pair, '')
    return len(reduced) < len(token) - 2


repeated_pairs = {pair for pair in pairs(token)
        if is_repeated(pair, token)}

for pair in repeated_pairs:
    token = token.replace(pair, '')

print token


danielbmartin 08-17-2012 10:56 AM

[QUOTE=grail;4756321]
Code:

awk -F "" '{for(i=1;i < NF-2;i++)if(substr($0,i+2) ~ $i$(i+1))gsub($i$(i+1),"")}1' file
Nice!

I'm not ignoring your ruby solution, but that language is beyond my scope of knowledge.

Quote:

... I am sure there will be a stumper word for each solution ...
I'm using a large data file to test your awk and also the sed solution proposed by firstfire. I have not found any glitches.

Daniel B. Martin

grail 08-17-2012 11:37 AM

I am not sure where you are confused Daniel? Your explanation seems pretty bang on to me.

firstfire 08-17-2012 02:26 PM

Hi.

Excellent, Daniel! I could not explain it better, especially taking into account my bad english.

I'll try though..
Code:

s/([^:]{2})(.*)\1/:\1\2%/
The "robot head" [^:] is any non-colon character. Since we need two such characters, we write [^:]{2} and enclose it in parentheses to reference later.

Regular expression ([^:]{2})(.*)\1 matches any text beginning and ending with the same pair of characters, and this pair may be referenced by \1. The text in between these two pairs may be referenced using \2.

Now, if we found some matching text, we replace it by :\1\2%, that is replace second pair of characters by % and prepend (to the matched substring) colon to mark characters to be removed later. We can not just remove first pair of characters (\1) because remaining text may still contain \1 somewhere else (e.g. if there are 3 repeated pairs and we remove two of them on the first iteration, then there will be no way to find third pair on the second iteration). I use % in place of removed pairs to preserve the "structure" of the string between iterations. This way we get
Code:

ABAABB -> :ABA%B -> AB
instead of
Code:

ABAABB -> :ABAB ->::AB -> empty
(if we would just removed second pair)

Finally, if the substitution was successful, we jump to label a, otherwise there are no more repeated pairs and we proceed to cleanup.

Hope that helps.

danielbmartin 08-17-2012 04:59 PM

Quote:

Originally Posted by firstfire (Post 4757006)
... Hope that helps.

Yes, thank you, this is the level of detail I needed.

Thanks to your sed and also the excellent awk contributed by grail this thread could be marked SOLVED. However, I'll hold it open for a bit longer because I have a follow-on question for both of you.

In the best case each line in the output file would show the modified character string, the repeated letter pair, and the original (unmodified) string.

Example:
Code:

ADVK  AR  AARDVARK
This may be called "the icing on the cake" so don't bother with it unless the answer is easy.

Daniel B. Martin

pixellany 08-17-2012 07:01 PM

You have just about described the whole thing. "part 1" is basically a sed replace command (s = replace). It finds an instance in which a letter pair is repeated, and then tags the 2nd member of the pair in such a way that it will not be detected again. The "ta" command says: If s did a replacement, then branch back to the beginning of the loop and do it again. The last time thru, with nothing left to change, the "ta" will be bypasses, and then the various tags can be removed.

firstfire 08-18-2012 03:03 AM

Hi.

Formatting may be done as follows
Code:

$ echo BFAARBFDVARKARBBFF | sed -r 'h; :a; s/([^:]{2})(.*)\1/:\1\2%/; p; ta; G; :b;s/:+(..)(.*)\n/\2\n\1 /;tb; s/%//g; s/\n/ /;'
:BFAARBFDVARKARB%F
::BFAAR%DVARKARB%F
::BFA:AR%DVARK%B%F
::BFA::AR%DV%K%B%F
::BFA::AR%DV%K%B%F
ADVKBF AR BF BFAARBFDVARKARBBFF

but this is not elegant at all and hardly may be called a one-liner.

EDIT:
Well, here is another approach which leads to the desired output directly
Code:

$ echo BFAARBFDVARKARBBFF | sed -r 'h; :a; s/([^% ]{2})(.*)\1(.*)/%\2%\3 \1/; p; ta; G; l; s/%//g; s/\n/ /; s/ +/ /g'
%AARBFDVARKARB%F BF
%A%BFDVARK%B%F BF AR
%A%%DVARK%B%F % AR BF
%A%%DV%K%B%F % % BF AR
%A%%DV%K%B%F % % BF AR
%A%%DV%K%B%F % % BF AR\nBFAARBFDVARKARBBFF$
ADVKBF BF AR BFAARBFDVARKARBBFF

Again, p and l commands are optional.

grail 08-18-2012 03:26 AM

Code:

# Awk
$ echo BFAARBFDVARKARBBFF | awk -F "" '{printf "%s ",$0;for(i=1;i < NF-2;i++)if(substr($0,i+2) ~ $i$(i+1)){printf "%s ",$i$(i+1);gsub($i$(i+1),"")}}1'
BFAARBFDVARKARBBFF BF AR ADVKBF

#Ruby
$ echo BFAARBFDVARKARBBFF | ruby -pe 's = [$_.chomp];until s.nil?; print s[0] + " ";s = $_.scan(/(..).*\1/)[0];$_.gsub!(s[0],"") unless s.nil?;end'
BFAARBFDVARKARBBFF BF AR ADVKBF


danielbmartin 08-18-2012 04:44 PM

Quote:

Originally Posted by firstfire (Post 4757365)
Code:

sed -r 'h; :a; s/([^% ]{2})(.*)\1(.*)/%\2%\3 \1/; p; ta; G; l; s/%//g; s/\n/ /; s/ +/ /g'

Thanks, once again, for your interest in this subject. This latest version provides the desired function... but... my output file contains unwanted ^M characters in the output file. Example:
Code:

a%dv%k^M ar
a%dv%k^M ar
a%dv%k\r ar\naardvark\r$
advk^M ar aardvark

They may be eliminated with a subsequent "cleanup" ...
Code:

|tr -d '\b\r'
... but I don't understand where they came from. Is it possible to prevent them from being created in the first place?

My program using your code:
Code:

echo
echo "Method of LQ member firstfire"
sed -r 'h; :a; s/([^% ]{2})(.*)\1(.*)/%\2%\3 \1/; p; ta;
  G; l;
  s/%//g;
  s/\n/ /;
  s/ +/ /g' < $Work01 > $Work07

My program using your code followed by a "cleanup":
Code:

echo
echo "Method of LQ member firstfire"
echo " .. and remove unwanted ^M strings"
sed -r 'h; :a; s/([^% ]{2})(.*)\1(.*)/%\2%\3 \1/; p; ta;
  G; l;
  s/%//g;
  s/\n/ /;
  s/ +/ /g' < $Work01  \
|tr -d '\b\r' \
> $Work08

Daniel B. Martin

firstfire 08-18-2012 11:25 PM

Hi, Daniel.

Your input file uses DOS line endings '\r\n'. You can remove \r together with %'s using
Code:

s/[%\r]//g
Or use dos2unix utility to convert input data.

danielbmartin 08-19-2012 01:57 PM

Thank you, grail, for this most recent version. Recast into shorter lines, it is:
Code:

echo "Method of LQ Guru grail"
awk -F "" '{printf "%s ", $0;
  for(i=1;i < NF-2;i++)
  if(substr($0,i+2) ~ $i$(i+1))
  {printf "%s ",$i$(i+1);gsub($i$(i+1),"")}}1' < $Work01 > $Work13

This works nicely, it produces the desired character strings, but (minor nitpick) not in the desired order.
It produces aardvark ar advk but I prefer advk ar aardvark

Okay, no big deal, I'll fix it (or so I thought). This is my attempt which fails.
Code:

echo "Method of LQ Guru grail with extension,"
echo " modified to reverse the order of output strings."
awk -F "" '{
  for(i=1;i < NF-2;i++)
  if(substr($0,i+2) ~ $i$(i+1))
  {printf "%s ",gsub($i$(i+1),"");$i$(i+1)};$0}1' < $Work01 > $Work14

Please show where this went wrong.

Daniel B. Martin

danielbmartin 08-19-2012 02:00 PM

Quote:

Originally Posted by firstfire (Post 4758038)
Your input file uses DOS line endings '\r\n'.

Yes, of course, the problem was in the data and not in the code. That's fixed now.

I'll hold this thread open, awaiting a response from grail regarding his solution.

Daniel B. Martin

grail 08-19-2012 08:54 PM

The easiest solution is to save $0 at the start and remove the 1 at the end, then use a single print to deliver both:
Code:

awk -F "" '{orig = $0;
  for(i=1;i < NF-2;i++)
  if(substr($0,i+2) ~ $i$(i+1))
  {printf "%s ",$i$(i+1);gsub($i$(i+1),"")}print $0,orig}' < $Work01 > $Work13

Edit: oops ... just realised this is not what you wanted. You want the answer before the parts removed and then the question.
So let us try again:
Code:

awk -F "" '{orig = $0;
  for(i=1;i < NF-2;i++)
  if(substr($0,i+2) ~ $i$(i+1))
  {parts = sprintf "%s ",$i$(i+1);gsub($i$(i+1),"")}print $0,parts orig}' < $Work01 > $Work13


danielbmartin 08-19-2012 09:20 PM

Quote:

Originally Posted by grail (Post 4758622)
The easiest solution is to save $0 at the start and remove the 1 at the end, then use a single print to deliver both ...

I used this ...
Code:

echo "Method of LQ Guru grail,"
echo " modified to reverse order of output words."
awk -F "" '{orig = $0;
  for(i=1;i < NF-2;i++)
  if(substr($0,i+2) ~ $i$(i+1))
  {parts = sprintf "%s ",$i$(i+1);gsub($i$(i+1),"")}print $0,parts orig}' < $Work01 > $Work13

... and got this ...
Code:

: cmd. line:3:    {parts = sprintf "%s ",$i$(i+1);gsub($i$(i+1),"")}print $0,parts orig}
awk: cmd. line:3:                    ^ syntax error

Daniel B. Martin

grail 08-19-2012 09:37 PM

My bad ... forgot about the returning function rule, ie it needs brackets :)
Code:

awk -F "" '{orig = $0;
  for(i=1;i < NF-2;i++)
  if(substr($0,i+2) ~ $i$(i+1))
  {parts = parts sprintf("%s ",$i$(i+1));gsub($i$(i+1),"")}print $0,parts orig}' < $Work01 > $Work13

And I added the concatenation i forgot as well .... on fire this morning :(


All times are GMT -5. The time now is 10:11 PM.