ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Excellent, Daniel! I could not explain it better, especially taking into account my bad english.
I'll try though..
Code:
s/([^:]{2})(.*)\1/:\1\2%/
The "robot head" [^:] is any non-colon character. Since we need two such characters, we write [^:]{2} and enclose it in parentheses to reference later.
Regular expression ([^:]{2})(.*)\1 matches any text beginning and ending with the same pair of characters, and this pair may be referenced by \1. The text in between these two pairs may be referenced using \2.
Now, if we found some matching text, we replace it by :\1\2%, that is replace second pair of characters by % and prepend (to the matched substring) colon to mark characters to be removed later. We can not just remove first pair of characters (\1) because remaining text may still contain \1 somewhere else (e.g. if there are 3 repeated pairs and we remove two of them on the first iteration, then there will be no way to find third pair on the second iteration). I use % in place of removed pairs to preserve the "structure" of the string between iterations. This way we get
Code:
ABAABB -> :ABA%B -> AB
instead of
Code:
ABAABB -> :ABAB ->::AB -> empty
(if we would just removed second pair)
Finally, if the substitution was successful, we jump to label a, otherwise there are no more repeated pairs and we proceed to cleanup.
Yes, thank you, this is the level of detail I needed.
Thanks to your sed and also the excellent awk contributed by grail this thread could be marked SOLVED. However, I'll hold it open for a bit longer because I have a follow-on question for both of you.
In the best case each line in the output file would show the modified character string, the repeated letter pair, and the original (unmodified) string.
Example:
Code:
ADVK AR AARDVARK
This may be called "the icing on the cake" so don't bother with it unless the answer is easy.
You have just about described the whole thing. "part 1" is basically a sed replace command (s = replace). It finds an instance in which a letter pair is repeated, and then tags the 2nd member of the pair in such a way that it will not be detected again. The "ta" command says: If s did a replacement, then branch back to the beginning of the loop and do it again. The last time thru, with nothing left to change, the "ta" will be bypasses, and then the various tags can be removed.
Thanks, once again, for your interest in this subject. This latest version provides the desired function... but... my output file contains unwanted ^M characters in the output file. Example:
Code:
a%dv%k^M ar
a%dv%k^M ar
a%dv%k\r ar\naardvark\r$
advk^M ar aardvark
They may be eliminated with a subsequent "cleanup" ...
Code:
|tr -d '\b\r'
... but I don't understand where they came from. Is it possible to prevent them from being created in the first place?
My program using your code:
Code:
echo
echo "Method of LQ member firstfire"
sed -r 'h; :a; s/([^% ]{2})(.*)\1(.*)/%\2%\3 \1/; p; ta;
G; l;
s/%//g;
s/\n/ /;
s/ +/ /g' < $Work01 > $Work07
My program using your code followed by a "cleanup":
This works nicely, it produces the desired character strings, but (minor nitpick) not in the desired order.
It produces aardvark ar advk but I prefer advk ar aardvark
Okay, no big deal, I'll fix it (or so I thought). This is my attempt which fails.
Code:
echo "Method of LQ Guru grail with extension,"
echo " modified to reverse the order of output strings."
awk -F "" '{
for(i=1;i < NF-2;i++)
if(substr($0,i+2) ~ $i$(i+1))
{printf "%s ",gsub($i$(i+1),"");$i$(i+1)};$0}1' < $Work01 > $Work14
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.