Python solution:
Code:
token = "AARDVARK" |
[QUOTE=grail;4756321]
Code:
awk -F "" '{for(i=1;i < NF-2;i++)if(substr($0,i+2) ~ $i$(i+1))gsub($i$(i+1),"")}1' file I'm not ignoring your ruby solution, but that language is beyond my scope of knowledge. Quote:
Daniel B. Martin |
I am not sure where you are confused Daniel? Your explanation seems pretty bang on to me.
|
Hi.
Excellent, Daniel! I could not explain it better, especially taking into account my bad english. I'll try though.. Code:
s/([^:]{2})(.*)\1/:\1\2%/ Regular expression ([^:]{2})(.*)\1 matches any text beginning and ending with the same pair of characters, and this pair may be referenced by \1. The text in between these two pairs may be referenced using \2. Now, if we found some matching text, we replace it by :\1\2%, that is replace second pair of characters by % and prepend (to the matched substring) colon to mark characters to be removed later. We can not just remove first pair of characters (\1) because remaining text may still contain \1 somewhere else (e.g. if there are 3 repeated pairs and we remove two of them on the first iteration, then there will be no way to find third pair on the second iteration). I use % in place of removed pairs to preserve the "structure" of the string between iterations. This way we get Code:
ABAABB -> :ABA%B -> AB Code:
ABAABB -> :ABAB ->::AB -> empty Finally, if the substitution was successful, we jump to label a, otherwise there are no more repeated pairs and we proceed to cleanup. Hope that helps. |
Quote:
Thanks to your sed and also the excellent awk contributed by grail this thread could be marked SOLVED. However, I'll hold it open for a bit longer because I have a follow-on question for both of you. In the best case each line in the output file would show the modified character string, the repeated letter pair, and the original (unmodified) string. Example: Code:
ADVK AR AARDVARK Daniel B. Martin |
You have just about described the whole thing. "part 1" is basically a sed replace command (s = replace). It finds an instance in which a letter pair is repeated, and then tags the 2nd member of the pair in such a way that it will not be detected again. The "ta" command says: If s did a replacement, then branch back to the beginning of the loop and do it again. The last time thru, with nothing left to change, the "ta" will be bypasses, and then the various tags can be removed.
|
Hi.
Formatting may be done as follows Code:
$ echo BFAARBFDVARKARBBFF | sed -r 'h; :a; s/([^:]{2})(.*)\1/:\1\2%/; p; ta; G; :b;s/:+(..)(.*)\n/\2\n\1 /;tb; s/%//g; s/\n/ /;' EDIT: Well, here is another approach which leads to the desired output directly Code:
$ echo BFAARBFDVARKARBBFF | sed -r 'h; :a; s/([^% ]{2})(.*)\1(.*)/%\2%\3 \1/; p; ta; G; l; s/%//g; s/\n/ /; s/ +/ /g' |
Code:
# Awk |
Quote:
Code:
a%dv%k^M ar Code:
|tr -d '\b\r' My program using your code: Code:
echo Code:
echo |
Hi, Daniel.
Your input file uses DOS line endings '\r\n'. You can remove \r together with %'s using Code:
s/[%\r]//g |
Thank you, grail, for this most recent version. Recast into shorter lines, it is:
Code:
echo "Method of LQ Guru grail" It produces aardvark ar advk but I prefer advk ar aardvark Okay, no big deal, I'll fix it (or so I thought). This is my attempt which fails. Code:
echo "Method of LQ Guru grail with extension," Daniel B. Martin |
Quote:
I'll hold this thread open, awaiting a response from grail regarding his solution. Daniel B. Martin |
The easiest solution is to save $0 at the start and remove the 1 at the end, then use a single print to deliver both:
Code:
awk -F "" '{orig = $0; So let us try again: Code:
awk -F "" '{orig = $0; |
Quote:
Code:
echo "Method of LQ Guru grail," Code:
: cmd. line:3: {parts = sprintf "%s ",$i$(i+1);gsub($i$(i+1),"")}print $0,parts orig} |
My bad ... forgot about the returning function rule, ie it needs brackets :)
Code:
awk -F "" '{orig = $0; |
All times are GMT -5. The time now is 10:11 PM. |