LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 05-02-2011, 11:07 AM   #1
cocostaec
LQ Newbie
 
Registered: May 2011
Posts: 9

Rep: Reputation: 0
finding and removing duplicate consecutive words


i have a problem,in my opinion a big one...i want to find and remove duplicate consecutive words from atext file...i've tried working with array but is very difficult..then i've tried using sed...somebody hint me with this sed :
sed ':f;N;$!bf; s/\b\(.*\)\n\1\b/\1\n/g; s/\b\(.*\)\b\1\b/\1/g'...it works fine but if i have 3 consecutive identical words it only remove first one and the last two remain intacts...any suggestions?pls help me

thanks
 
Old 05-02-2011, 11:09 AM   #2
penguiniator
Member
 
Registered: Feb 2004
Location: Olympia, WA
Distribution: SolydK
Posts: 442
Blog Entries: 3

Rep: Reputation: 60
Run it twice?
 
Old 05-02-2011, 11:26 AM   #3
cepheus11
Member
 
Registered: Nov 2010
Location: Germany
Distribution: Gentoo
Posts: 269

Rep: Reputation: 83
I think you can put the regex in "greedy" mode, it will then match the longest possible substring. But I'm no regex expert. Try searching for "lazy" vs. "greedy" in regular expressions.
 
Old 05-02-2011, 11:57 AM   #4
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
1) Please use [code][/code] tags around your code, to preserve formatting and to improve readability.

2) Give us a real example of the input text and desired output, so we can properly test possible solutions.

3) This may be something better done with awk.

Edit: Played around a bit and came up with this:

Code:
awk 'BEGIN{ RS="[[:space:]]+" ; ORS="" ; x="" } ( $0 != x ) { print } { print RT; x=$0 }'
This requires gnu awk, as it uses the non-standard RT variable ( grail). It seems to work well in my tests, removing any number of duplicated words, but can't guarantee that it will work perfectly in every situation.

To explain the code:

The BEGIN block sets the record separator to strings of whitespace (including newlines), so that each word is processed individually.

The output record separator is nulled because RT will be used instead. Also declare a variable "x".

Then test to see if the current word equals x (the previous word). Only print if it doesn't match.

Outside the test, re-insert the whitespace string that delimited the word (that's what RT holds), so that as much of the original formatting as possible is kept. In other words, this will remove the duplicate words, but not the space around them.

Finally, store the current word as x for the next loop.

Edit2: If you want the separating space to be removed as well, you can move RT into the test block. This means that trailing space will not get printed after duplicate words.
Code:
awk 'BEGIN{ RS="[[:space:]]+" ; ORS="" } ( $0 != x ) { print $0,RT }{ x=$0 }'
This might do things like remove newlines, however.

Last edited by David the H.; 05-02-2011 at 01:13 PM. Reason: as stated
 
Old 05-03-2011, 12:10 AM   #5
cocostaec
LQ Newbie
 
Registered: May 2011
Posts: 9

Original Poster
Rep: Reputation: 0
for example my input text is like following:
"ana are are are are mere
mere
ion are prune prune."
and the output that i want is:
"ana are mere
ion are prune."
 
Old 05-03-2011, 06:19 AM   #6
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Thank you for the example text. It's exposed a fatal flaw in the solutions so far: punctuation.

Testing whether a simple string is the same as the one before it isn't too hard, but since "prune" and "prune." are not equal strings, both of them will remain.

I'm trying to think of some way to work around this, but I'm worried that it may end up being almost too complex to handle.

So far I've tried stripping punctuation from both strings for testing, but that leaves only the first word in the file, and so the final period is lost. Reversing it to print the final instance of a word is a much more complex operation.

And what should it do if two duplicates of a word have different punctuation, such as one with a comma and one with a period?
 
Old 05-03-2011, 06:33 AM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,243

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
This is close but has the same issues with punctuation as now it is only printing the first occurrence so in this case prune with a space gets printed
before prune with .":
Code:
awk 'BEGIN{RS="[ \n[:punct:]]"}{ORS = RT}!_[$0]++' file
 
Old 05-03-2011, 07:26 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,243

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
SO I thought a bit and it seems that if we are editing say a novel and punctuation falls in all sorts of places it may be too tricky.
However, using the current example, the following works (I also tested by adding a word after the last punctuation and it seemed ok too):
Code:
awk 'BEGIN{RS="[ \n[:punct:]]+";ORS=""}{if(!_[$0]++){print x$0;y=1}if(x ~ /[[:punct:]]/ && !y)print x;x=RT;y=0}END{if(x ~ /[[:punct:]]/ && !y)print x}' file
My solution was to append the previous separator to the next record and then check for punctuation and print if found.
 
Old 05-03-2011, 07:28 AM   #9
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850
Quote:
Originally Posted by grail View Post
Code:
awk 'BEGIN{RS="[ \n[:punct:]]+";ORS=""}{if(!_[$0]++){print x$0;y=1}if(x ~ /[[:punct:]]/ && !y)print x;x=RT;y=0}END{if(x ~ /[[:punct:]]/ && !y)print x}' file
this doesn't work for me
Code:
markus@samsung:~/Programmierung/awk$ awk 'BEGIN{RS="[ \n[:punct:]]+";ORS=""}{if(!_[$0]++){print x$0;y=1}if(x ~ /[[:punct:]]/ && !y)print x;x=RT;y=0}END{if(x ~ /[[:punct:]]/ && !y)print x}' bsp.txt
"ana are mere
ion prune."
markus@samsung:~/Programmierung/awk$
I thought there should only be consecutive multiples removed but not all multiples.

Markus

Last edited by markush; 05-03-2011 at 07:34 AM. Reason: made a mistake
 
Old 05-03-2011, 08:38 AM   #10
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
I think I may have (mostly) cracked it.

The following awk script breaks up any words that end in punctuation marks into two variables. Then it doesn't print anything immediately, but stores the values until the next word is tested. Only if the next word fails to match does the stored value+punctuation get printed.

Since this doesn't work on the final word of the file, a simple statement to print that is added at the end.

Caveats I know of so far are:

1) Only punctuation at the end of the word is tested for. If there's anything at the beginning or in the middle of the words that's different, they won't match.

2) Only the punctuation at the end of the final word in the series, if any, is printed.

3) It will delete any line breaks, along with all other whitespace, that come in the middle of a run of consecutive words.

Code:
#!/usr/bin/awk -f

BEGIN{
RS="[[:space:]]+"
ORS=pw=""
}

{
if ( $0 ~ /.*[[:punct:]]$/ ){
	w=gensub(/(.*[^[:punct:]])[[:punct:]]+$/,"\\1",$0)
	p=gensub(/.*[^[:punct:]]([[:punct:]]+)$/,"\\1",$0)
	}

else {
     w=$0
     p=""
     }

if ( w != pw ) {
     print (pw)(pp)(pRT)
     }

pw=w
pp=p
pRT=RT
}

END{ print $0 }
I'd appreciate some testing, and any hints on how to improve it. Thanks!

(Ignore my previous edit if you saw it. I was wrong, and have reinstated the original.

Last edited by David the H.; 05-03-2011 at 09:47 AM. Reason: As stated
 
Old 05-03-2011, 10:06 AM   #11
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,243

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
Hey Markus ... thanks for the heads up .. I was off in my own little world
Simple enough edit though:
Code:
awk 'BEGIN{RS="[ \n[:punct:]]+";ORS=""}{if(a != $0){print x$0;y=1}if(x ~ /[[:punct:]]/ && !y)print x;x=RT;y=0;a=$0}END{if(x ~ /[[:punct:]]/ && !y)print x}' f1
Edit: Here is another way to look at it (and slightly cleaner):
Code:
awk 'BEGIN{RS="[[:space:]]+";ORS=""}match($0,/^([[:punct:]]*)([^[:punct:]]+)([[:punct:]]*)$/,f){if(x != f[2]){print y$0;z = FNR}x = f[2];y = RT}END{if(z != FNR)print f[3]"\n"}' file
Edit 2: David - your code has an issue if the first word (ana in this case) is repeated missed your caveats ... ignore me

Last edited by grail; 05-03-2011 at 10:34 AM.
 
Old 05-03-2011, 10:57 AM   #12
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Grail, your revised version is pretty good. It seems to keep the first punctuation mark it finds, whereas mine keeps the last. Who can say which is better? But there's still a bit of trouble at the end of yours.
Code:
$ cat fileA.txt
ana ana are:
are, are are mere--
mere mere,
ion are prune, prune, prune.

$ grails_script.sh fileA.txt
ana are: mere--
ion are prune,.

$ davids_script.sh fileA.txt
ana are mere,
ion are prune.


Edit: here's another limitation I found, in your match function.
It may break if there's punctuation in the middle of the string.
Code:
cat fileA.txt
ana ana are:
are, are are mere--
me.re me.ntos,
ion are prune, prune, prune.

$ grails_script.sh fileA.txt
ana are: mere--
ion are prune,.
mere--, me.re, and me.ntos should rightly be treated as separate entities, I think. But the match regex means it only compares the characters before the first punctuation mark.

Last edited by David the H.; 05-03-2011 at 11:28 AM. Reason: as stated
 
1 members found this post helpful.
Old 05-03-2011, 11:42 AM   #13
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
As a semi-off-topic question, I've been trying to re-format your one-liner into a stand-alone awk script (so I can understand it better), but it's not working.
Code:
#!/usr/bin/awk -f

BEGIN{
RS="[[:space:]]+"
ORS=""
}

match( $0 , /^([[:punct:]]*)([^[:punct:]]+)([[:punct:]]*)$/ , f )

{
if ( x != f[2] ) {
     print y$0
     z=FNR
     }

x=f[2]
y=RT
}

END{
if ( z != FNR ) print f[3]"\n"
}
I can't see where I've changed anything significant, but it doesn't remove anything and the output is all compacted.
 
Old 05-03-2011, 11:45 PM   #14
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,243

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
It is your match line. I find with awk it is a good habit to teat the start of all curly braces as you do with BEGIN / END, ie have them start immediately after the test.
Otherwise, the lone curly start brace is enacted on all lines which is not the desired affect.
Change is:
Code:
match( $0 , /^([[:punct:]]*)([^[:punct:]]+)([[:punct:]]*)$/ , f ){
I am going to hold off on further solutions until the OP provides more details as some of the outcomes that my earlier script is providing and based on original request,
the output is correct. Also there are plenty of scenarios I came up with that depending on what is required the output is completely wrong
 
Old 05-04-2011, 10:53 AM   #15
cocostaec
LQ Newbie
 
Registered: May 2011
Posts: 9

Original Poster
Rep: Reputation: 0
thanks David it works well...but there still are a little problem.for example for the input file:
"ana ana are are are mere
mere si portocale
ion are prune prune,prune.
"
the out put will be:
"ana are mere si portocale
ion are prune"
the second line will be concatenated with the first(\n is removed) and the the dot from "prune" is missing(no matter,it is not so important )
thanks

Last edited by cocostaec; 05-04-2011 at 11:34 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
finding words in files. ukwho Linux - Newbie 3 04-23-2010 10:36 AM
Removing white spaces between words and joining the words in a given format Priyabio Linux - General 4 08-20-2009 08:42 AM
duplicate entry finding cmontr Linux - General 1 05-22-2008 12:04 AM
removing duplicate entries shabev Linux - Enterprise 3 03-25-2008 11:36 AM
Finding duplicate files SlowCoder Linux - General 6 10-12-2007 09:25 AM


All times are GMT -5. The time now is 05:58 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration