finding and removing duplicate consecutive words

cocostaec · 05-02-2011, 10:07 AM

i have a problem,in my opinion a big one...i want to find and remove duplicate consecutive words from atext file...i've tried working with array but is very difficult..then i've tried using sed...somebody hint me with this sed :
sed ':f;N;$!bf; s/\b$.*$\n\1\b/\1\n/g; s/\b$.*$\b\1\b/\1/g'...it works fine but if i have 3 consecutive identical words it only remove first one and the last two remain intacts...any suggestions?pls help me

thanks

penguiniator · 05-02-2011, 10:09 AM

Run it twice?

cepheus11 · 05-02-2011, 10:26 AM

I think you can put the regex in "greedy" mode, it will then match the longest possible substring. But I'm no regex expert. Try searching for "lazy" vs. "greedy" in regular expressions.

David the H. · 05-02-2011, 10:57 AM

1) Please use [code][/code] tags around your code, to preserve formatting and to improve readability.

2) Give us a real example of the input text and desired output, so we can properly test possible solutions.

3) This may be something better done with awk.

Edit: Played around a bit and came up with this:

Code:

awk 'BEGIN{ RS="[[:space:]]+" ; ORS="" ; x="" } ( $0 != x ) { print } { print RT; x=$0 }'

This requires gnu awk, as it uses the non-standard RT variable (

grail). It seems to work well in my tests, removing any number of duplicated words, but can't guarantee that it will work perfectly in every situation.

To explain the code:

The BEGIN block sets the record separator to strings of whitespace (including newlines), so that each word is processed individually.

The output record separator is nulled because RT will be used instead. Also declare a variable "x".

Then test to see if the current word equals x (the previous word). Only print if it doesn't match.

Outside the test, re-insert the whitespace string that delimited the word (that's what RT holds), so that as much of the original formatting as possible is kept. In other words, this will remove the duplicate words, but not the space around them.

Finally, store the current word as x for the next loop.

Edit2: If you want the separating space to be removed as well, you can move RT into the test block. This means that trailing space will not get printed after duplicate words.

Code:

awk 'BEGIN{ RS="[[:space:]]+" ; ORS="" } ( $0 != x ) { print $0,RT }{ x=$0 }'

This might do things like remove newlines, however.

cocostaec · 05-02-2011, 11:10 PM

for example my input text is like following:
"ana are are are are mere
mere
ion are prune prune."
and the output that i want is:
"ana are mere
ion are prune."

David the H. · 05-03-2011, 05:19 AM

Thank you for the example text. It's exposed a fatal flaw in the solutions so far: punctuation.

Testing whether a simple string is the same as the one before it isn't too hard, but since "prune" and "prune." are not equal strings, both of them will remain.

I'm trying to think of some way to work around this, but I'm worried that it may end up being almost too complex to handle.

So far I've tried stripping punctuation from both strings for testing, but that leaves only the first word in the file, and so the final period is lost. Reversing it to print the final instance of a word is a much more complex operation.

And what should it do if two duplicates of a word have different punctuation, such as one with a comma and one with a period?

grail · 05-03-2011, 05:33 AM

This is close but has the same issues with punctuation as now it is only printing the first occurrence so in this case prune with a space gets printed
before prune with .":

Code:

awk 'BEGIN{RS="[ \n[:punct:]]"}{ORS = RT}!_[$0]++' file

grail · 05-03-2011, 06:26 AM

SO I thought a bit and it seems that if we are editing say a novel and punctuation falls in all sorts of places it may be too tricky.
However, using the current example, the following works (I also tested by adding a word after the last punctuation and it seemed ok too):

Code:

awk 'BEGIN{RS="[ \n[:punct:]]+";ORS=""}{if(!_[$0]++){print x$0;y=1}if(x ~ /[[:punct:]]/ && !y)print x;x=RT;y=0}END{if(x ~ /[[:punct:]]/ && !y)print x}' file

My solution was to append the previous separator to the next record and then check for punctuation and print if found.

markush · 05-03-2011, 06:28 AM

Quote:

Originally Posted by grail

Code:

awk 'BEGIN{RS="[ \n[:punct:]]+";ORS=""}{if(!_[$0]++){print x$0;y=1}if(x ~ /[[:punct:]]/ && !y)print x;x=RT;y=0}END{if(x ~ /[[:punct:]]/ && !y)print x}' file

this doesn't work for me

Code:

markus@samsung:~/Programmierung/awk$ awk 'BEGIN{RS="[ \n[:punct:]]+";ORS=""}{if(!_[$0]++){print x$0;y=1}if(x ~ /[[:punct:]]/ && !y)print x;x=RT;y=0}END{if(x ~ /[[:punct:]]/ && !y)print x}' bsp.txt
"ana are mere
ion prune."
markus@samsung:~/Programmierung/awk$

I thought there should only be consecutive multiples removed but not all multiples.

Markus

David the H. · 05-03-2011, 07:38 AM

I think I may have (mostly) cracked it.

The following awk script breaks up any words that end in punctuation marks into two variables. Then it doesn't print anything immediately, but stores the values until the next word is tested. Only if the next word fails to match does the stored value+punctuation get printed.

Since this doesn't work on the final word of the file, a simple statement to print that is added at the end.

Caveats I know of so far are:

1) Only punctuation at the end of the word is tested for. If there's anything at the beginning or in the middle of the words that's different, they won't match.

2) Only the punctuation at the end of the final word in the series, if any, is printed.

3) It will delete any line breaks, along with all other whitespace, that come in the middle of a run of consecutive words.

Code:

#!/usr/bin/awk -f

BEGIN{
RS="[[:space:]]+"
ORS=pw=""
}

{
if ( $0 ~ /.*[[:punct:]]$/ ){
	w=gensub(/(.*[^[:punct:]])[[:punct:]]+$/,"\\1",$0)
	p=gensub(/.*[^[:punct:]]([[:punct:]]+)$/,"\\1",$0)
	}

else {
     w=$0
     p=""
     }

if ( w != pw ) {
     print (pw)(pp)(pRT)
     }

pw=w
pp=p
pRT=RT
}

END{ print $0 }

I'd appreciate some testing, and any hints on how to improve it. Thanks!

(Ignore my previous edit if you saw it. I was wrong, and have reinstated the original.

grail · 05-03-2011, 09:06 AM

Hey Markus ... thanks for the heads up .. I was off in my own little world

Simple enough edit though:

Code:

awk 'BEGIN{RS="[ \n[:punct:]]+";ORS=""}{if(a != $0){print x$0;y=1}if(x ~ /[[:punct:]]/ && !y)print x;x=RT;y=0;a=$0}END{if(x ~ /[[:punct:]]/ && !y)print x}' f1

Edit: Here is another way to look at it (and slightly cleaner):

Code:

awk 'BEGIN{RS="[[:space:]]+";ORS=""}match($0,/^([[:punct:]]*)([^[:punct:]]+)([[:punct:]]*)$/,f){if(x != f[2]){print y$0;z = FNR}x = f[2];y = RT}END{if(z != FNR)print f[3]"\n"}' file

Edit 2: David - your code has an issue if the first word (ana in this case) is repeated

missed your caveats ... ignore me

David the H. · 05-03-2011, 09:57 AM

Grail, your revised version is pretty good. It seems to keep the first punctuation mark it finds, whereas mine keeps the last. Who can say which is better? But there's still a bit of trouble at the end of yours.

Code:

$ cat fileA.txt
ana ana are:
are, are are mere--
mere mere,
ion are prune, prune, prune.

$ grails_script.sh fileA.txt
ana are: mere--
ion are prune,.

$ davids_script.sh fileA.txt
ana are mere,
ion are prune.

Edit: here's another limitation I found, in your match function.
It may break if there's punctuation in the middle of the string.

Code:

cat fileA.txt
ana ana are:
are, are are mere--
me.re me.ntos,
ion are prune, prune, prune.

$ grails_script.sh fileA.txt
ana are: mere--
ion are prune,.

mere--, me.re, and me.ntos should rightly be treated as separate entities, I think. But the match regex means it only compares the characters before the first punctuation mark.

David the H. · 05-03-2011, 10:42 AM

As a semi-off-topic question, I've been trying to re-format your one-liner into a stand-alone awk script (so I can understand it better), but it's not working.

Code:

#!/usr/bin/awk -f

BEGIN{
RS="[[:space:]]+"
ORS=""
}

match( $0 , /^([[:punct:]]*)([^[:punct:]]+)([[:punct:]]*)$/ , f )

{
if ( x != f[2] ) {
     print y$0
     z=FNR
     }

x=f[2]
y=RT
}

END{
if ( z != FNR ) print f[3]"\n"
}

I can't see where I've changed anything significant, but it doesn't remove anything and the output is all compacted.

grail · 05-03-2011, 10:45 PM

It is your match line. I find with awk it is a good habit to teat the start of all curly braces as you do with BEGIN / END, ie have them start immediately after the test.
Otherwise, the lone curly start brace is enacted on all lines which is not the desired affect.
Change is:

Code:

match( $0 , /^([[:punct:]]*)([^[:punct:]]+)([[:punct:]]*)$/ , f ){

I am going to hold off on further solutions until the OP provides more details as some of the outcomes that my earlier script is providing and based on original request,
the output is correct. Also there are plenty of scenarios I came up with that depending on what is required the output is completely wrong

cocostaec · 05-04-2011, 09:53 AM

thanks David it works well...but there still are a little problem.for example for the input file:
"ana ana are are are mere
mere si portocale
ion are prune prune,prune.
"
the out put will be:
"ana are mere si portocale
ion are prune"
the second line will be concatenated with the first(\n is removed) and the the dot from "prune" is missing(no matter,it is not so important

)
thanks