LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 08-17-2012, 09:17 AM   #16
dugan
Senior Member
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 4,614

Rep: Reputation: 1415Reputation: 1415Reputation: 1415Reputation: 1415Reputation: 1415Reputation: 1415Reputation: 1415Reputation: 1415Reputation: 1415Reputation: 1415

Python solution:
Code:
token = "AARDVARK"


def pairs(token):
    for i in xrange(len(token)):
        pair = token[i:i + 2]
        if len(pair) == 2:
            yield pair


def is_repeated(pair, token):
    reduced = token.replace(pair, '')
    return len(reduced) < len(token) - 2


repeated_pairs = {pair for pair in pairs(token)
        if is_repeated(pair, token)}

for pair in repeated_pairs:
    token = token.replace(pair, '')

print token

Last edited by dugan; 08-17-2012 at 12:11 PM. Reason: The correct answer this time.
 
Old 08-17-2012, 10:56 AM   #17
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,083

Original Poster
Rep: Reputation: 287Reputation: 287Reputation: 287
[QUOTE=grail;4756321]
Code:
awk -F "" '{for(i=1;i < NF-2;i++)if(substr($0,i+2) ~ $i$(i+1))gsub($i$(i+1),"")}1' file
Nice!

I'm not ignoring your ruby solution, but that language is beyond my scope of knowledge.

Quote:
... I am sure there will be a stumper word for each solution ...
I'm using a large data file to test your awk and also the sed solution proposed by firstfire. I have not found any glitches.

Daniel B. Martin

Last edited by danielbmartin; 08-17-2012 at 10:57 AM. Reason: Fix a cosmetic blemish
 
Old 08-17-2012, 11:37 AM   #18
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,476

Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
I am not sure where you are confused Daniel? Your explanation seems pretty bang on to me.
 
Old 08-17-2012, 02:26 PM   #19
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 623

Rep: Reputation: 364Reputation: 364Reputation: 364Reputation: 364
Hi.

Excellent, Daniel! I could not explain it better, especially taking into account my bad english.

I'll try though..
Code:
s/([^:]{2})(.*)\1/:\1\2%/
The "robot head" [^:] is any non-colon character. Since we need two such characters, we write [^:]{2} and enclose it in parentheses to reference later.

Regular expression ([^:]{2})(.*)\1 matches any text beginning and ending with the same pair of characters, and this pair may be referenced by \1. The text in between these two pairs may be referenced using \2.

Now, if we found some matching text, we replace it by :\1\2%, that is replace second pair of characters by % and prepend (to the matched substring) colon to mark characters to be removed later. We can not just remove first pair of characters (\1) because remaining text may still contain \1 somewhere else (e.g. if there are 3 repeated pairs and we remove two of them on the first iteration, then there will be no way to find third pair on the second iteration). I use % in place of removed pairs to preserve the "structure" of the string between iterations. This way we get
Code:
ABAABB -> :ABA%B -> AB
instead of
Code:
ABAABB -> :ABAB ->::AB -> empty
(if we would just removed second pair)

Finally, if the substitution was successful, we jump to label a, otherwise there are no more repeated pairs and we proceed to cleanup.

Hope that helps.
 
1 members found this post helpful.
Old 08-17-2012, 04:59 PM   #20
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,083

Original Poster
Rep: Reputation: 287Reputation: 287Reputation: 287
Quote:
Originally Posted by firstfire View Post
... Hope that helps.
Yes, thank you, this is the level of detail I needed.

Thanks to your sed and also the excellent awk contributed by grail this thread could be marked SOLVED. However, I'll hold it open for a bit longer because I have a follow-on question for both of you.

In the best case each line in the output file would show the modified character string, the repeated letter pair, and the original (unmodified) string.

Example:
Code:
ADVK  AR  AARDVARK
This may be called "the icing on the cake" so don't bother with it unless the answer is easy.

Daniel B. Martin
 
Old 08-17-2012, 07:01 PM   #21
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
You have just about described the whole thing. "part 1" is basically a sed replace command (s = replace). It finds an instance in which a letter pair is repeated, and then tags the 2nd member of the pair in such a way that it will not be detected again. The "ta" command says: If s did a replacement, then branch back to the beginning of the loop and do it again. The last time thru, with nothing left to change, the "ta" will be bypasses, and then the various tags can be removed.
 
Old 08-18-2012, 03:03 AM   #22
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 623

Rep: Reputation: 364Reputation: 364Reputation: 364Reputation: 364
Hi.

Formatting may be done as follows
Code:
$ echo BFAARBFDVARKARBBFF | sed -r 'h; :a; s/([^:]{2})(.*)\1/:\1\2%/; p; ta; G; :b;s/:+(..)(.*)\n/\2\n\1 /;tb; s/%//g; s/\n/ /;'
:BFAARBFDVARKARB%F
::BFAAR%DVARKARB%F
::BFA:AR%DVARK%B%F
::BFA::AR%DV%K%B%F
::BFA::AR%DV%K%B%F
ADVKBF AR BF BFAARBFDVARKARBBFF
but this is not elegant at all and hardly may be called a one-liner.

EDIT:
Well, here is another approach which leads to the desired output directly
Code:
$ echo BFAARBFDVARKARBBFF | sed -r 'h; :a; s/([^% ]{2})(.*)\1(.*)/%\2%\3 \1/; p; ta; G; l; s/%//g; s/\n/ /; s/ +/ /g'
%AARBFDVARKARB%F BF
%A%BFDVARK%B%F BF AR
%A%%DVARK%B%F % AR BF
%A%%DV%K%B%F % % BF AR
%A%%DV%K%B%F % % BF AR
%A%%DV%K%B%F % % BF AR\nBFAARBFDVARKARBBFF$
ADVKBF BF AR BFAARBFDVARKARBBFF
Again, p and l commands are optional.

Last edited by firstfire; 08-18-2012 at 03:47 AM.
 
1 members found this post helpful.
Old 08-18-2012, 03:26 AM   #23
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,476

Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
Code:
# Awk
$ echo BFAARBFDVARKARBBFF | awk -F "" '{printf "%s ",$0;for(i=1;i < NF-2;i++)if(substr($0,i+2) ~ $i$(i+1)){printf "%s ",$i$(i+1);gsub($i$(i+1),"")}}1'
BFAARBFDVARKARBBFF BF AR ADVKBF

#Ruby
$ echo BFAARBFDVARKARBBFF | ruby -pe 's = [$_.chomp];until s.nil?; print s[0] + " ";s = $_.scan(/(..).*\1/)[0];$_.gsub!(s[0],"") unless s.nil?;end'
BFAARBFDVARKARBBFF BF AR ADVKBF
 
1 members found this post helpful.
Old 08-18-2012, 04:44 PM   #24
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,083

Original Poster
Rep: Reputation: 287Reputation: 287Reputation: 287
Quote:
Originally Posted by firstfire View Post
Code:
 sed -r 'h; :a; s/([^% ]{2})(.*)\1(.*)/%\2%\3 \1/; p; ta; G; l; s/%//g; s/\n/ /; s/ +/ /g'
Thanks, once again, for your interest in this subject. This latest version provides the desired function... but... my output file contains unwanted ^M characters in the output file. Example:
Code:
a%dv%k^M ar
a%dv%k^M ar
a%dv%k\r ar\naardvark\r$
advk^M ar aardvark
They may be eliminated with a subsequent "cleanup" ...
Code:
|tr -d '\b\r'
... but I don't understand where they came from. Is it possible to prevent them from being created in the first place?

My program using your code:
Code:
echo
echo "Method of LQ member firstfire"
sed -r 'h; :a; s/([^% ]{2})(.*)\1(.*)/%\2%\3 \1/; p; ta;
   G; l;
   s/%//g;
   s/\n/ /;
   s/ +/ /g' < $Work01 > $Work07
My program using your code followed by a "cleanup":
Code:
echo
echo "Method of LQ member firstfire"
echo " .. and remove unwanted ^M strings"
sed -r 'h; :a; s/([^% ]{2})(.*)\1(.*)/%\2%\3 \1/; p; ta;
   G; l;
   s/%//g;
   s/\n/ /;
   s/ +/ /g' < $Work01  \
|tr -d '\b\r' \
> $Work08
Daniel B. Martin
 
Old 08-18-2012, 11:25 PM   #25
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 623

Rep: Reputation: 364Reputation: 364Reputation: 364Reputation: 364
Hi, Daniel.

Your input file uses DOS line endings '\r\n'. You can remove \r together with %'s using
Code:
s/[%\r]//g
Or use dos2unix utility to convert input data.
 
1 members found this post helpful.
Old 08-19-2012, 01:57 PM   #26
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,083

Original Poster
Rep: Reputation: 287Reputation: 287Reputation: 287
Thank you, grail, for this most recent version. Recast into shorter lines, it is:
Code:
echo "Method of LQ Guru grail"
awk -F "" '{printf "%s ", $0;
   for(i=1;i < NF-2;i++) 
   if(substr($0,i+2) ~ $i$(i+1))
   {printf "%s ",$i$(i+1);gsub($i$(i+1),"")}}1' < $Work01 > $Work13
This works nicely, it produces the desired character strings, but (minor nitpick) not in the desired order.
It produces aardvark ar advk but I prefer advk ar aardvark

Okay, no big deal, I'll fix it (or so I thought). This is my attempt which fails.
Code:
echo "Method of LQ Guru grail with extension,"
echo " modified to reverse the order of output strings."
awk -F "" '{
   for(i=1;i < NF-2;i++) 
   if(substr($0,i+2) ~ $i$(i+1))
   {printf "%s ",gsub($i$(i+1),"");$i$(i+1)};$0}1' < $Work01 > $Work14
Please show where this went wrong.

Daniel B. Martin
 
Old 08-19-2012, 02:00 PM   #27
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,083

Original Poster
Rep: Reputation: 287Reputation: 287Reputation: 287
Quote:
Originally Posted by firstfire View Post
Your input file uses DOS line endings '\r\n'.
Yes, of course, the problem was in the data and not in the code. That's fixed now.

I'll hold this thread open, awaiting a response from grail regarding his solution.

Daniel B. Martin
 
Old 08-19-2012, 08:54 PM   #28
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,476

Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
The easiest solution is to save $0 at the start and remove the 1 at the end, then use a single print to deliver both:
Code:
awk -F "" '{orig = $0;
   for(i=1;i < NF-2;i++) 
   if(substr($0,i+2) ~ $i$(i+1))
   {printf "%s ",$i$(i+1);gsub($i$(i+1),"")}print $0,orig}' < $Work01 > $Work13
Edit: oops ... just realised this is not what you wanted. You want the answer before the parts removed and then the question.
So let us try again:
Code:
awk -F "" '{orig = $0;
   for(i=1;i < NF-2;i++) 
   if(substr($0,i+2) ~ $i$(i+1))
   {parts = sprintf "%s ",$i$(i+1);gsub($i$(i+1),"")}print $0,parts orig}' < $Work01 > $Work13

Last edited by grail; 08-19-2012 at 09:02 PM.
 
Old 08-19-2012, 09:20 PM   #29
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,083

Original Poster
Rep: Reputation: 287Reputation: 287Reputation: 287
Quote:
Originally Posted by grail View Post
The easiest solution is to save $0 at the start and remove the 1 at the end, then use a single print to deliver both ...
I used this ...
Code:
echo "Method of LQ Guru grail,"
echo " modified to reverse order of output words."
awk -F "" '{orig = $0;
   for(i=1;i < NF-2;i++) 
   if(substr($0,i+2) ~ $i$(i+1))
   {parts = sprintf "%s ",$i$(i+1);gsub($i$(i+1),"")}print $0,parts orig}' < $Work01 > $Work13
... and got this ...
Code:
: cmd. line:3:    {parts = sprintf "%s ",$i$(i+1);gsub($i$(i+1),"")}print $0,parts orig}
awk: cmd. line:3:                     ^ syntax error
Daniel B. Martin
 
Old 08-19-2012, 09:37 PM   #30
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,476

Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
My bad ... forgot about the returning function rule, ie it needs brackets
Code:
awk -F "" '{orig = $0;
   for(i=1;i < NF-2;i++) 
   if(substr($0,i+2) ~ $i$(i+1))
   {parts = parts sprintf("%s ",$i$(i+1));gsub($i$(i+1),"")}print $0,parts orig}' < $Work01 > $Work13
And I added the concatenation i forgot as well .... on fire this morning
 
  


Reply

Tags
awk, grep, regex, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Getting the no repeated pairs out of a a string charlitos Programming 2 02-17-2009 02:52 AM
LXer: 25 Arguments for the Elimination of Copy Protection LXer Syndicated Linux News 0 10-20-2008 07:50 AM
Script to move directories based on first letter to a new directory of that letter tworkemon Linux - Newbie 8 01-30-2007 07:18 PM
+performance +gcc +elimination C++Boar Programming 6 09-02-2004 11:48 PM
Cups elimination beowulf405 Linux - Newbie 5 06-30-2004 09:16 PM


All times are GMT -5. The time now is 11:19 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration