LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   RegEx remove duplicate words - How? (https://www.linuxquestions.org/questions/linux-newbie-8/regex-remove-duplicate-words-how-4175655508/)

ddenial 06-11-2019 09:27 AM

RegEx remove duplicate words - How?
 
Hello

I want to remove repetitive duplicate words in a text. Like in the following example 'The the'.

Quote:

You’re Editing a document and would like to check it for any incorrectly repeated words. You want to find these doubled words despite capitalization differences, such as with The the. You also want to allow differing amounts of whitespace between words, even if this causes the words to extend across more than one line.
I can't figure it out. The only thing I came up is this
Code:

([a-zA-Z]+)\s+\1
But its not working. Appreciate any help.

Thanks

tyler2016 06-11-2019 12:46 PM

I'm not sure if a regex will do the job, I'm a bit rusty on my language theory. This would be easy to do with a for loop or tail recursion. Something like this:

psuedo code example of a for loop doing it:

Code:

words = split_into_words(stdin);
for(i=0, i < (words.length - 1); i+=1)
{
  if(words[i] == words[i+1])
  {
      delete(words[i+1]);
      i = i - 1;
  }
}


teckk 06-11-2019 01:12 PM

Another Example:
Code:

text=(one One one oNe ONe two two three three four four Four)
 
for i in "${text[@],,}"; do
    if [ "$i" != "$a" ]; then
        echo "$i"
    fi
    a="$i"
done


pan64 06-11-2019 01:39 PM

what you posted is just a regexp, I don't really know how should that work.
You need a programming language. like sed/awk/perl/python/whatever to do the job. Perl regexps are really powerful, they have upper/lower case conversions too.
The solution also may depend on other things, like the size of the text.
So please post your program, not only a useless part of it.

ddenial 06-11-2019 01:49 PM

Quote:

Originally Posted by pan64 (Post 6004165)
what you posted is just a regexp, I don't really know how should that work.
You need a programming language. like sed/awk/perl/python/whatever to do the job. Perl regexps are really powerful, they have upper/lower case conversions too.
The solution also may depend on other things, like the size of the text.
So please post your program, not only a useless part of it.

I'm not using any programming language. I'm just using Online RegEx Tester 101. As for the flavor, it says PCRE (PHP), which is the default.

Here is the link: https://regex101.com/r/f0AKe5/1

tyler2016 06-11-2019 02:34 PM

Is this a homework question?

All a regular expression does is match characters. You need something that takes action when a match occurs, hence my initial thoughts and pan64's post. Formally, regular expressions need to be implementable with a deterministic finite state automaton (DFA). If you don't have a CS background, that isn't as complicated as it sounds. What this means is that a regex has no memory and takes no actions. All regex processor does is take an input and attempt to run it through the equivalent of a DFA. If it ends up in a stop state, you have a match, if not, you don't have a match.

Hopefully this makes sense to you.

MadeInGermany 06-11-2019 02:47 PM

A RE is case sensitive. Your sample in post#1 works for "the the" but not for "The the".
Furthermore, it wrongly matches "the theme".

astrogeek 06-11-2019 03:00 PM

You are on the right track, but as noted by others you need a little more than just the regex.

If you use your regular expression with sed and a replacement expression it should work with only a little touch up.

Something like this example I just wrote...

Code:

$ cat example.txt
A sentence with repeated    repeated words sepatated by one or more spaces in the mix mix.

$ sed -r 's/(BACKREFERENCE_EXPN)\s*\1/REPLACE_EXPN/g' example.txt
A sentence with repeated words sepatated by one or more spaces in the mix.

I left the BACKREFERENCE_EXPN and the replacement REPLACE_EXPN as an exercise for you. Hint: Not far from what you already have!

UPDATE: As MadeInGermany points out, you also need to make it case insensitive which is easy with a simple sed option. My own BACKREFERENCE_EXPN slightly different from yours handles the word boundary problem - also an exercise for the student!

UPDATE2: Your specification is somewhat vague by saying "repetitive duplicate words". Duplicate means exactly two, whereas repetitive means two or more. Another exercise for the student - handle any number of repeated words.

ddenial 06-11-2019 10:17 PM

Resolved finally :).

I got some clues from these posts
https://stackoverflow.com/questions/...displaying-the
http://shrenoid.com/hackerrank-prblm...iwords-solutn/
https://www.regular-expressions.info/modifiers.html

So the RegEx to find repetitive words is
Code:

(?i)\b([a-z]+)\b(?:\s+\1\b)+
Here is the RegEx101 link: https://regex101.com/r/f0AKe5/3

Thanks, everybody.

chrism01 06-11-2019 11:56 PM

This is THE book (imho ;) ) for regexes http://regex.info/book.html (& that exercise is used as an example, partly because the author used it to check the book before sending to the printers ...)


All times are GMT -5. The time now is 09:41 AM.