LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Elimination of repeated letter pairs (http://www.linuxquestions.org/questions/programming-9/elimination-of-repeated-letter-pairs-4175422484/)

danielbmartin 08-16-2012 03:38 PM

Elimination of repeated letter pairs
 
Have: a file of character strings, one string per line.

Want: a way to eliminate repeated letter pairs.

Example: AARDVARK has a repeated letter pair (AR).
AARDVARK should be transformed to ADVK.

This grep ...
Code:

grep -e '\(..\).*\1' < $InFile > $Work1
... identifies strings which have repeated letter pairs. They are candidates for the second step, elimination of repeated letter pairs. How may this be done?

Daniel B. Martin

kakaka 08-16-2012 05:30 PM

It can be done with awk, perl, and a variety of things.

So what command/languages/capabilities are you willing to use in the solution to your problem?

danielbmartin 08-16-2012 06:22 PM

Quote:

Originally Posted by kakaka (Post 4756074)
It can be done with awk, perl, and a variety of things.

So what command/languages/capabilities are you willing to use in the solution to your problem?

I'm learning awk and would certainly like to use it. I know nothing of perl yet, so let's steer clear of that. Thanks for your interest.

Daniel B. Martin

ip_address 08-16-2012 07:43 PM

how about this:

SCRIPT:

Code:

#!/bin/bash

#variable
var="AARDVARK"

#variable length
var_len=${#var}

#array index
p=0;

#create possible combinations and search them against original word
for ((subt=1;subt<="$var_len"-2;subt++));do
        subt1=$((subt+1))
        for ((counter=0;counter<"$var_len"-"$subt";counter++));do
                var1=${var:counter:subt1}
                num=`echo "$var" | grep -o "$var1" | wc -l`
                if (( $num>1 )); then
                        p=$((p+1))
                        a[$p]=`echo "$var" | sed "s/$var1//g"`
                fi
        done
done

#take the unique values
echo ${a[@]} | tr ' ' '\n' | sort | uniq


RESULT:

Code:

ADVK

grail 08-16-2012 08:54 PM

Your description does not seem to match the output from the way I understand it :(

AARDVARK has a repeated letter pair (AR). - - This part I understand

And if I were to remove all A and R letters I would end up with - - DVK

As you do not keep the first R I am curious why one of the first A's has been kept in your example output? (ADVK)

ip_address 08-16-2012 09:13 PM

Quote:

Originally Posted by grail (Post 4756215)
Your description does not seem to match the output from the way I understand it :(

AARDVARK has a repeated letter pair (AR). - - This part I understand

And if I were to remove all A and R letters I would end up with - - DVK

As you do not keep the first R I am curious why one of the first A's has been kept in your example output? (ADVK)

@grail .. why you want to remove all A and R letters when danielbmartin is requesting a way to eliminate repeated letter pairs only


Quote:

Originally Posted by danielbmartin (Post 4755963)
Have: a file of character strings, one string per line.

Want: a way to eliminate repeated letter pairs.

Daniel B. Martin


In word: "AARDVARK" only repeated letter pair is "AR" and it occurred twice. Hence all the occurrences of "AR" were removed and result became "ADVK"

danielbmartin 08-16-2012 09:19 PM

Quote:

Originally Posted by grail (Post 4756215)
... if I were to remove all A and R letters I would end up with - - DVK

Emphasis on the word pairs. If the input string is
AARDVARK and we remove those AR pairs
the remaining string is ADVK.

Daniel B. Martin

grail 08-16-2012 10:15 PM

Thanks for the information folks ... missed that little twist ;) May I ask another question, what if the pairs are intertwined?

Example: AARDVARD ... so here there is also the duplicate pair of RD prior to AR being removed

So if AR is removed first we get - - ADVD
However if all pairs considered prior to any removal we would get - - AV

Sorry if I have my cloth ears on this morning :(

danielbmartin 08-16-2012 10:30 PM

Quote:

Originally Posted by grail (Post 4756286)
... what if the pairs are intertwined?

Example: AARDVARD ... so here there is also the duplicate pair of RD prior to AR being removed

So if AR is removed first we get - - ADVD
However if all pairs considered prior to any removal we would get - - AV

That's a complication I hadn't considered. For simplicity, let's say that intertwined (or we might say overlapped) letter pairs may be ignored.

Daniel B. Martin

ip_address 08-16-2012 11:03 PM

@grail .. you are right. It depends on the rules formed to remove those pairs. For example, for the word "PAARKEAAR", following combinations can form (not restricted to pairs only):

Quote:

AA
AAR
AR
It will depend on the rules formed that which combination should be removed first. One rule could be removing the longest one, other could be removing the first encountered combination. It depends on author how he/she wants to form rules.

The code that i submitted before will generate separate words after removing one particular combination at a time. Hence for word "PAARKEAAR", code will spit out three results:

Quote:

1) PAKEA (after substituting AA with "")

2) PKE (after substituting AAR with "")

and

3) PRKER (after substituting AR with "")
That being said code can be easily modified for pairs only, rules can be integrated and then it can substitute all the pairs and result only one word as final answer.

grail 08-16-2012 11:29 PM

Ok ... thanks. So I am sure there will be a stumper word for each solution, but here goes:
Code:

awk -F "" '{for(i=1;i < NF-2;i++)if(substr($0,i+2) ~ $i$(i+1))gsub($i$(i+1),"")}1' file
And the ruby looks clumsy (still working on this language)
Code:

ruby -pe '$_.gsub!($_.scan(/(..).*\1/)[0][0],"") while $_.scan(/(..).*\1/).length == 1' file

firstfire 08-17-2012 04:15 AM

Hi.

In sed:
Code:

$ echo BFAARBFDVARKARBBFF | sed -r ':a; s/([^<>]{2})(.*)\1/<\1>\2%/; p; ta; s/<+..>+//g; s/%//g'
<BF>AARBFDVARKARB%F
<<BF>>AAR%DVARKARB%F
<<BF>>A<AR>%DVARK%B%F
<<BF>>A<<AR>>%DV%K%B%F
<<BF>>A<<AR>>%DV%K%B%F
ADVKBF

This script assumes that 'ABAAABBB' -> 'AABB'.

If one prefers 'ABAAABBB' -> '', then % symbol (and the last substitution command) should be removed.
The p(rint) command is only there to illustrate how the script works.

pixellany 08-17-2012 05:36 AM

that is REALLY clever!!

firstfire 08-17-2012 05:53 AM

I just realized that we have only pairs of characters, so there is no need in closing marker '>':
Code:

$ echo BFAARBFDVARKARBBFF | sed -r ':a; s/([^:]{2})(.*)\1/:\1\2%/; p; ta; s/:+..//g; s/%//g'
:BFAARBFDVARKARB%F
::BFAAR%DVARKARB%F
::BFA:AR%DVARK%B%F
::BFA::AR%DV%K%B%F
::BFA::AR%DV%K%B%F
ADVKBF

Quote:

Originally Posted by pixellany (Post 4756540)
that is REALLY clever!!

If you tell it to me, then thank you, I'm really glad you like it! :)

danielbmartin 08-17-2012 09:02 AM

Quote:

Originally Posted by firstfire (Post 4756551)
Code:

sed -r ':a; s/([^:]{2})(.*)\1/:\1\2%/; p; ta; s/:+..//g; s/%//g'

Brilliant! I'm working to understand it.

There are three segments to this sed.
Code:

#1... :a; s/([^:]{2})(.*)\1/:\1\2%/; p; ta;

#2... s/:+..//g;

#3...s/%//g

#1 has a loop which identifies the first instance of a repeated letter pair, prefixes it with a colon, and replaces the second instance with a percent sign. An intermediate result is printed, the operation is tested, and based on the test, another iteration of the loop labeled "a" is performed.

#2 eliminates all colons and the following two characters (i.e. the first instance of the repeated letter pair).

#3 eliminates all percent signs which are "ghosts" of the second instance of the letter pair.

Step #1 clearly does all the "heavy lifting". Steps #2 and #3 are cosmetic clean-ups.

I can see what step #1 does but am not clear on how it works.
Please review, correct, and expand this narrative.

Daniel B. Martin


All times are GMT -5. The time now is 01:31 AM.