LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 08-16-2012, 03:38 PM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,032

Rep: Reputation: 275Reputation: 275Reputation: 275
Elimination of repeated letter pairs


Have: a file of character strings, one string per line.

Want: a way to eliminate repeated letter pairs.

Example: AARDVARK has a repeated letter pair (AR).
AARDVARK should be transformed to ADVK.

This grep ...
Code:
grep -e '\(..\).*\1' < $InFile > $Work1
... identifies strings which have repeated letter pairs. They are candidates for the second step, elimination of repeated letter pairs. How may this be done?

Daniel B. Martin
 
Old 08-16-2012, 05:30 PM   #2
kakaka
Member
 
Registered: Sep 2003
Posts: 382

Rep: Reputation: 86
It can be done with awk, perl, and a variety of things.

So what command/languages/capabilities are you willing to use in the solution to your problem?
 
Old 08-16-2012, 06:22 PM   #3
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,032

Original Poster
Rep: Reputation: 275Reputation: 275Reputation: 275
Quote:
Originally Posted by kakaka View Post
It can be done with awk, perl, and a variety of things.

So what command/languages/capabilities are you willing to use in the solution to your problem?
I'm learning awk and would certainly like to use it. I know nothing of perl yet, so let's steer clear of that. Thanks for your interest.

Daniel B. Martin
 
Old 08-16-2012, 07:43 PM   #4
ip_address
Member
 
Registered: Apr 2012
Distribution: RedHat
Posts: 42

Rep: Reputation: 2
how about this:

SCRIPT:

Code:
#!/bin/bash

#variable
var="AARDVARK"

#variable length
var_len=${#var}

#array index
p=0;

#create possible combinations and search them against original word
for ((subt=1;subt<="$var_len"-2;subt++));do
	subt1=$((subt+1))
	for ((counter=0;counter<"$var_len"-"$subt";counter++));do
		var1=${var:counter:subt1}
		num=`echo "$var" | grep -o "$var1" | wc -l`
		if (( $num>1 )); then
			p=$((p+1))
			a[$p]=`echo "$var" | sed "s/$var1//g"`
		fi
	done
done

#take the unique values
echo ${a[@]} | tr ' ' '\n' | sort | uniq

RESULT:

Code:
ADVK
 
Old 08-16-2012, 08:54 PM   #5
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,195

Rep: Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796
Your description does not seem to match the output from the way I understand it

AARDVARK has a repeated letter pair (AR). - - This part I understand

And if I were to remove all A and R letters I would end up with - - DVK

As you do not keep the first R I am curious why one of the first A's has been kept in your example output? (ADVK)
 
Old 08-16-2012, 09:13 PM   #6
ip_address
Member
 
Registered: Apr 2012
Distribution: RedHat
Posts: 42

Rep: Reputation: 2
Quote:
Originally Posted by grail View Post
Your description does not seem to match the output from the way I understand it

AARDVARK has a repeated letter pair (AR). - - This part I understand

And if I were to remove all A and R letters I would end up with - - DVK

As you do not keep the first R I am curious why one of the first A's has been kept in your example output? (ADVK)
@grail .. why you want to remove all A and R letters when danielbmartin is requesting a way to eliminate repeated letter pairs only


Quote:
Originally Posted by danielbmartin View Post
Have: a file of character strings, one string per line.

Want: a way to eliminate repeated letter pairs.

Daniel B. Martin

In word: "AARDVARK" only repeated letter pair is "AR" and it occurred twice. Hence all the occurrences of "AR" were removed and result became "ADVK"
 
Old 08-16-2012, 09:19 PM   #7
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,032

Original Poster
Rep: Reputation: 275Reputation: 275Reputation: 275
Quote:
Originally Posted by grail View Post
... if I were to remove all A and R letters I would end up with - - DVK
Emphasis on the word pairs. If the input string is
AARDVARK and we remove those AR pairs
the remaining string is ADVK.

Daniel B. Martin
 
Old 08-16-2012, 10:15 PM   #8
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,195

Rep: Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796
Thanks for the information folks ... missed that little twist May I ask another question, what if the pairs are intertwined?

Example: AARDVARD ... so here there is also the duplicate pair of RD prior to AR being removed

So if AR is removed first we get - - ADVD
However if all pairs considered prior to any removal we would get - - AV

Sorry if I have my cloth ears on this morning
 
Old 08-16-2012, 10:30 PM   #9
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,032

Original Poster
Rep: Reputation: 275Reputation: 275Reputation: 275
Quote:
Originally Posted by grail View Post
... what if the pairs are intertwined?

Example: AARDVARD ... so here there is also the duplicate pair of RD prior to AR being removed

So if AR is removed first we get - - ADVD
However if all pairs considered prior to any removal we would get - - AV
That's a complication I hadn't considered. For simplicity, let's say that intertwined (or we might say overlapped) letter pairs may be ignored.

Daniel B. Martin
 
Old 08-16-2012, 11:03 PM   #10
ip_address
Member
 
Registered: Apr 2012
Distribution: RedHat
Posts: 42

Rep: Reputation: 2
@grail .. you are right. It depends on the rules formed to remove those pairs. For example, for the word "PAARKEAAR", following combinations can form (not restricted to pairs only):

Quote:
AA
AAR
AR
It will depend on the rules formed that which combination should be removed first. One rule could be removing the longest one, other could be removing the first encountered combination. It depends on author how he/she wants to form rules.

The code that i submitted before will generate separate words after removing one particular combination at a time. Hence for word "PAARKEAAR", code will spit out three results:

Quote:
1) PAKEA (after substituting AA with "")

2) PKE (after substituting AAR with "")

and

3) PRKER (after substituting AR with "")
That being said code can be easily modified for pairs only, rules can be integrated and then it can substitute all the pairs and result only one word as final answer.
 
Old 08-16-2012, 11:29 PM   #11
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,195

Rep: Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796Reputation: 1796
Ok ... thanks. So I am sure there will be a stumper word for each solution, but here goes:
Code:
awk -F "" '{for(i=1;i < NF-2;i++)if(substr($0,i+2) ~ $i$(i+1))gsub($i$(i+1),"")}1' file
And the ruby looks clumsy (still working on this language)
Code:
ruby -pe '$_.gsub!($_.scan(/(..).*\1/)[0][0],"") while $_.scan(/(..).*\1/).length == 1' file
 
1 members found this post helpful.
Old 08-17-2012, 04:15 AM   #12
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 608

Rep: Reputation: 352Reputation: 352Reputation: 352Reputation: 352
Hi.

In sed:
Code:
$ echo BFAARBFDVARKARBBFF | sed -r ':a; s/([^<>]{2})(.*)\1/<\1>\2%/; p; ta; s/<+..>+//g; s/%//g' 
<BF>AARBFDVARKARB%F
<<BF>>AAR%DVARKARB%F
<<BF>>A<AR>%DVARK%B%F
<<BF>>A<<AR>>%DV%K%B%F
<<BF>>A<<AR>>%DV%K%B%F
ADVKBF
This script assumes that 'ABAAABBB' -> 'AABB'.

If one prefers 'ABAAABBB' -> '', then % symbol (and the last substitution command) should be removed.
The p(rint) command is only there to illustrate how the script works.
 
1 members found this post helpful.
Old 08-17-2012, 05:36 AM   #13
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
that is REALLY clever!!
 
Old 08-17-2012, 05:53 AM   #14
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 608

Rep: Reputation: 352Reputation: 352Reputation: 352Reputation: 352
I just realized that we have only pairs of characters, so there is no need in closing marker '>':
Code:
$ echo BFAARBFDVARKARBBFF | sed -r ':a; s/([^:]{2})(.*)\1/:\1\2%/; p; ta; s/:+..//g; s/%//g'
:BFAARBFDVARKARB%F
::BFAAR%DVARKARB%F
::BFA:AR%DVARK%B%F
::BFA::AR%DV%K%B%F
::BFA::AR%DV%K%B%F
ADVKBF
Quote:
Originally Posted by pixellany View Post
that is REALLY clever!!
If you tell it to me, then thank you, I'm really glad you like it!
 
1 members found this post helpful.
Old 08-17-2012, 09:02 AM   #15
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,032

Original Poster
Rep: Reputation: 275Reputation: 275Reputation: 275
Quote:
Originally Posted by firstfire View Post
Code:
sed -r ':a; s/([^:]{2})(.*)\1/:\1\2%/; p; ta; s/:+..//g; s/%//g'
Brilliant! I'm working to understand it.

There are three segments to this sed.
Code:
#1... :a; s/([^:]{2})(.*)\1/:\1\2%/; p; ta;

#2... s/:+..//g;

#3...s/%//g
#1 has a loop which identifies the first instance of a repeated letter pair, prefixes it with a colon, and replaces the second instance with a percent sign. An intermediate result is printed, the operation is tested, and based on the test, another iteration of the loop labeled "a" is performed.

#2 eliminates all colons and the following two characters (i.e. the first instance of the repeated letter pair).

#3 eliminates all percent signs which are "ghosts" of the second instance of the letter pair.

Step #1 clearly does all the "heavy lifting". Steps #2 and #3 are cosmetic clean-ups.

I can see what step #1 does but am not clear on how it works.
Please review, correct, and expand this narrative.

Daniel B. Martin
 
  


Reply

Tags
awk, grep, regex, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Getting the no repeated pairs out of a a string charlitos Programming 2 02-17-2009 02:52 AM
LXer: 25 Arguments for the Elimination of Copy Protection LXer Syndicated Linux News 0 10-20-2008 07:50 AM
Script to move directories based on first letter to a new directory of that letter tworkemon Linux - Newbie 8 01-30-2007 07:18 PM
+performance +gcc +elimination C++Boar Programming 6 09-02-2004 11:48 PM
Cups elimination beowulf405 Linux - Newbie 5 06-30-2004 09:16 PM


All times are GMT -5. The time now is 12:01 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration