RegEx remove duplicate words - How?
Hello
I want to remove repetitive duplicate words in a text. Like in the following example 'The the'. Quote:
Code:
([a-zA-Z]+)\s+\1 Thanks |
I'm not sure if a regex will do the job, I'm a bit rusty on my language theory. This would be easy to do with a for loop or tail recursion. Something like this:
psuedo code example of a for loop doing it: Code:
words = split_into_words(stdin); |
Another Example:
Code:
text=(one One one oNe ONe two two three three four four Four) |
what you posted is just a regexp, I don't really know how should that work.
You need a programming language. like sed/awk/perl/python/whatever to do the job. Perl regexps are really powerful, they have upper/lower case conversions too. The solution also may depend on other things, like the size of the text. So please post your program, not only a useless part of it. |
Quote:
Here is the link: https://regex101.com/r/f0AKe5/1 |
Is this a homework question?
All a regular expression does is match characters. You need something that takes action when a match occurs, hence my initial thoughts and pan64's post. Formally, regular expressions need to be implementable with a deterministic finite state automaton (DFA). If you don't have a CS background, that isn't as complicated as it sounds. What this means is that a regex has no memory and takes no actions. All regex processor does is take an input and attempt to run it through the equivalent of a DFA. If it ends up in a stop state, you have a match, if not, you don't have a match. Hopefully this makes sense to you. |
A RE is case sensitive. Your sample in post#1 works for "the the" but not for "The the".
Furthermore, it wrongly matches "the theme". |
You are on the right track, but as noted by others you need a little more than just the regex.
If you use your regular expression with sed and a replacement expression it should work with only a little touch up. Something like this example I just wrote... Code:
$ cat example.txt UPDATE: As MadeInGermany points out, you also need to make it case insensitive which is easy with a simple sed option. My own BACKREFERENCE_EXPN slightly different from yours handles the word boundary problem - also an exercise for the student! UPDATE2: Your specification is somewhat vague by saying "repetitive duplicate words". Duplicate means exactly two, whereas repetitive means two or more. Another exercise for the student - handle any number of repeated words. |
Resolved finally :).
I got some clues from these posts https://stackoverflow.com/questions/...displaying-the http://shrenoid.com/hackerrank-prblm...iwords-solutn/ https://www.regular-expressions.info/modifiers.html So the RegEx to find repetitive words is Code:
(?i)\b([a-z]+)\b(?:\s+\1\b)+ Thanks, everybody. |
This is THE book (imho ;) ) for regexes http://regex.info/book.html (& that exercise is used as an example, partly because the author used it to check the book before sending to the printers ...)
|
All times are GMT -5. The time now is 09:41 AM. |