LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 06-11-2019, 09:27 AM   #1
ddenial
Member
 
Registered: Dec 2016
Distribution: CentOS, Fedora, Ubuntu
Posts: 177

Rep: Reputation: 35
RegEx remove duplicate words - How?


Hello

I want to remove repetitive duplicate words in a text. Like in the following example 'The the'.

Quote:
You’re Editing a document and would like to check it for any incorrectly repeated words. You want to find these doubled words despite capitalization differences, such as with The the. You also want to allow differing amounts of whitespace between words, even if this causes the words to extend across more than one line.
I can't figure it out. The only thing I came up is this
Code:
([a-zA-Z]+)\s+\1
But its not working. Appreciate any help.

Thanks
 
Old 06-11-2019, 12:46 PM   #2
tyler2016
Member
 
Registered: Sep 2018
Distribution: Debian, CentOS, FreeBSD
Posts: 204

Rep: Reputation: Disabled
I'm not sure if a regex will do the job, I'm a bit rusty on my language theory. This would be easy to do with a for loop or tail recursion. Something like this:

psuedo code example of a for loop doing it:

Code:
words = split_into_words(stdin);
for(i=0, i < (words.length - 1); i+=1)
{
   if(words[i] == words[i+1])
   {
      delete(words[i+1]);
      i = i - 1;
   }
}

Last edited by tyler2016; 06-11-2019 at 12:49 PM.
 
1 members found this post helpful.
Old 06-11-2019, 01:12 PM   #3
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 2,167

Rep: Reputation: 437Reputation: 437Reputation: 437Reputation: 437Reputation: 437
Another Example:
Code:
text=(one One one oNe ONe two two three three four four Four)
 
for i in "${text[@],,}"; do
    if [ "$i" != "$a" ]; then
        echo "$i"
    fi
    a="$i"
done
 
1 members found this post helpful.
Old 06-11-2019, 01:39 PM   #4
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 12,830

Rep: Reputation: 4033Reputation: 4033Reputation: 4033Reputation: 4033Reputation: 4033Reputation: 4033Reputation: 4033Reputation: 4033Reputation: 4033Reputation: 4033Reputation: 4033
what you posted is just a regexp, I don't really know how should that work.
You need a programming language. like sed/awk/perl/python/whatever to do the job. Perl regexps are really powerful, they have upper/lower case conversions too.
The solution also may depend on other things, like the size of the text.
So please post your program, not only a useless part of it.
 
1 members found this post helpful.
Old 06-11-2019, 01:49 PM   #5
ddenial
Member
 
Registered: Dec 2016
Distribution: CentOS, Fedora, Ubuntu
Posts: 177

Original Poster
Rep: Reputation: 35
Quote:
Originally Posted by pan64 View Post
what you posted is just a regexp, I don't really know how should that work.
You need a programming language. like sed/awk/perl/python/whatever to do the job. Perl regexps are really powerful, they have upper/lower case conversions too.
The solution also may depend on other things, like the size of the text.
So please post your program, not only a useless part of it.
I'm not using any programming language. I'm just using Online RegEx Tester 101. As for the flavor, it says PCRE (PHP), which is the default.

Here is the link: https://regex101.com/r/f0AKe5/1
 
Old 06-11-2019, 02:34 PM   #6
tyler2016
Member
 
Registered: Sep 2018
Distribution: Debian, CentOS, FreeBSD
Posts: 204

Rep: Reputation: Disabled
Is this a homework question?

All a regular expression does is match characters. You need something that takes action when a match occurs, hence my initial thoughts and pan64's post. Formally, regular expressions need to be implementable with a deterministic finite state automaton (DFA). If you don't have a CS background, that isn't as complicated as it sounds. What this means is that a regex has no memory and takes no actions. All regex processor does is take an input and attempt to run it through the equivalent of a DFA. If it ends up in a stop state, you have a match, if not, you don't have a match.

Hopefully this makes sense to you.
 
1 members found this post helpful.
Old 06-11-2019, 02:47 PM   #7
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 1,133

Rep: Reputation: 512Reputation: 512Reputation: 512Reputation: 512Reputation: 512Reputation: 512
A RE is case sensitive. Your sample in post#1 works for "the the" but not for "The the".
Furthermore, it wrongly matches "the theme".
 
2 members found this post helpful.
Old 06-11-2019, 03:00 PM   #8
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=14, FreeBSD_12{.0|.1}
Posts: 5,157
Blog Entries: 11

Rep: Reputation: 3077Reputation: 3077Reputation: 3077Reputation: 3077Reputation: 3077Reputation: 3077Reputation: 3077Reputation: 3077Reputation: 3077Reputation: 3077Reputation: 3077
You are on the right track, but as noted by others you need a little more than just the regex.

If you use your regular expression with sed and a replacement expression it should work with only a little touch up.

Something like this example I just wrote...

Code:
$ cat example.txt
A sentence with repeated    repeated words sepatated by one or more spaces in the mix mix.

$ sed -r 's/(BACKREFERENCE_EXPN)\s*\1/REPLACE_EXPN/g' example.txt
A sentence with repeated words sepatated by one or more spaces in the mix.
I left the BACKREFERENCE_EXPN and the replacement REPLACE_EXPN as an exercise for you. Hint: Not far from what you already have!

UPDATE: As MadeInGermany points out, you also need to make it case insensitive which is easy with a simple sed option. My own BACKREFERENCE_EXPN slightly different from yours handles the word boundary problem - also an exercise for the student!

UPDATE2: Your specification is somewhat vague by saying "repetitive duplicate words". Duplicate means exactly two, whereas repetitive means two or more. Another exercise for the student - handle any number of repeated words.

Last edited by astrogeek; 06-11-2019 at 03:24 PM. Reason: tpoys
 
1 members found this post helpful.
Old 06-11-2019, 10:17 PM   #9
ddenial
Member
 
Registered: Dec 2016
Distribution: CentOS, Fedora, Ubuntu
Posts: 177

Original Poster
Rep: Reputation: 35
Resolved finally .

I got some clues from these posts
https://stackoverflow.com/questions/...displaying-the
http://shrenoid.com/hackerrank-prblm...iwords-solutn/
https://www.regular-expressions.info/modifiers.html

So the RegEx to find repetitive words is
Code:
(?i)\b([a-z]+)\b(?:\s+\1\b)+
Here is the RegEx101 link: https://regex101.com/r/f0AKe5/3

Thanks, everybody.
 
Old 06-11-2019, 11:56 PM   #10
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.10, Centos 7.5
Posts: 17,645

Rep: Reputation: 2482Reputation: 2482Reputation: 2482Reputation: 2482Reputation: 2482Reputation: 2482Reputation: 2482Reputation: 2482Reputation: 2482Reputation: 2482Reputation: 2482
This is THE book (imho ) for regexes http://regex.info/book.html (& that exercise is used as an example, partly because the author used it to check the book before sending to the printers ...)
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] How to remove duplicate words within a particular text in a file? guessity Linux - Newbie 16 07-22-2011 02:15 PM
[SOLVED] differences between shell regex and php regex and perl regex and javascript and mysql golden_boy615 Linux - General 2 04-19-2011 01:10 AM
Removing white spaces between words and joining the words in a given format Priyabio Linux - General 4 08-20-2009 07:42 AM
How do I create words.db from words.txt using gdbm? kline General 8 12-14-2008 08:48 PM
Search and Replace: Asian Words to English Words ieeestd802 Linux - Software 0 10-27-2004 07:48 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 01:40 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration