finding and removing duplicate consecutive words

crts · 05-04-2011, 11:38 AM

Hi,

I'd like to suggest an alternative sed:

Code:

sed -rn ':a $! {N;ba}; s/([^ [:punct:]\n]+)([ [:punct:]\n]+\1)+/\1/gp' file

I ran the following test:

Code:

$ cat file
ana ana are are are mere,mere ,
, ,mere si portocale
portocale.
ion are prune prune, prune?prune,,prune.
banana are prune prune, prune?prune,,prune.

$ sed -rn ':a $! {N;ba}; s/([^ [:punct:]\n]+)([ [:punct:]\n]+\1)*/\1/gp' file
ana are mere si portocale.
ion are prune.
banana are prune.

Seems ok, if the concatenation is not a big issue.
The awks had trouble with the above format:

Code:

$ awk 'BEGIN{RS="[[:space:]]+";ORS=""}match($0,/^([[:punct:]]*)([^[:punct:]]+)([[:punct:]]*)$/,f){if(x != f[2]){print y$0;z = FNR}x = f[2];y = RT}END{if(z != FNR)print f[3]"\n"}' file
ana are ,mere si portocale
ion are prune banana are prune

$ ./david_awk.scr file
ana are mere,mere ,, ,mere si portocale.
ion are prune, prune?prune,,prune.
banana are prune, prune?prune,,prune.$

David the H. · 05-04-2011, 03:13 PM

Sigh, thanks grail. I could've sworn I tried that. I really hate awk's bracketing syntax. It never seems to do what I think it should do.

Quote:

Originally Posted by cocostaec

the second line will be concatenated with the first(\n is removed) and the the dot from "prune" is missing(no matter,it is not so important

)
thanks

Yes, that was my third caveat above. You'd have to add some kind of function to count and store newlines between each word, and print them back at the right time. I made a few attempts at doing something like that, but no matter what I did I couldn't get it to work right, so I kind of gave up on it.

But I'm not getting the missing period at the end. It works for me. Indeed, all the END code does is print the final field as-is, so it shouldn't remove anything.

@crts Good job. Of course it all depends on the assumptions and trade-offs you make in what to remove and what to keep. Our awk scripts have assumed that each word is space-delimited, with optional punctuation at the end.

Thinking about it a bit, simply adding punctuation to the record separator makes for short, crisp script that does a pretty good job. It prints only the final instance of a duplicate and its trailing punctuation, if any.

Of course all formatting inside the string of dupes is still lost. Also, since all punctuation is now part of the delimiter, beginning punctuation is handled as part of the previous separator. A string like

--foo :foo, +foo? foo-foo,

will all be condensed into one word, "foo", with "--" being printed in front of it as part of the previous separator, and "," afterwards as part of the trailing separator.

Code:

#!/usr/bin/awk -f

BEGIN{
RS="[[:space:][:punct:]]+"
ORS=pw=""
}

if ( $0 != pw ) {
     print (pw)(pRT)
     }

pw=$0
pRT=RT
}

END{ print $0RT }

In the end, there's only so much simple scripts like these can accomplish, and it would take a rather complex program to handle all possible situations.

cocostaec · 05-07-2011, 03:29 AM

let's say that it works fine(hope \n don't be a problem),thanks...now i want expand the problem...i want to search to block of identical consecutive strings...i think that sed and awk won't work in this situation...i've tried working with "for" but i don't know how to the blocks...i think i have to take to compare the firsts 2 words,then the firsts 2 words with the next 2 words and so on...then the second word with 3,the the 2,3 with 4 and 5 and so on...it seems to be difficult...

markush · 05-07-2011, 01:25 PM

I'd recommend to take a look at Perl.

Referring to the for-loop, you will have to use two for-loops, the first (outer one) will have to count the number of words in the block, whereas the inner loop examines the possible blocks. n = number of words in the whole text, you may take the following as pseudocode

Code:

for (words=2; words<=n/2; words++) 
   for (start=1; start<=n-2*words; start++)
       compare block1 (from start to start+words) with block2 (from start+words+1 to start+2*words+1) 
   next start
next words

this will become more complicated if you assume that the words in the blocks are in the same order but with different separators between the words.

BTW, you should use the report-button and ask the moderators to move this thread into the Programming-section of LQ.

Markus