finding and removing duplicate consecutive words
i have a problem,in my opinion a big one...i want to find and remove duplicate consecutive words from atext file...i've tried working with array but is very difficult..then i've tried using sed...somebody hint me with this sed :
sed ':f;N;$!bf; s/\b\(.*\)\n\1\b/\1\n/g; s/\b\(.*\)\b\1\b/\1/g'...it works fine but if i have 3 consecutive identical words it only remove first one and the last two remain intacts...any suggestions?pls help me thanks |
Run it twice?
|
I think you can put the regex in "greedy" mode, it will then match the longest possible substring. But I'm no regex expert. Try searching for "lazy" vs. "greedy" in regular expressions.
|
1) Please use [code][/code] tags around your code, to preserve formatting and to improve readability.
2) Give us a real example of the input text and desired output, so we can properly test possible solutions. 3) This may be something better done with awk. Edit: Played around a bit and came up with this: Code:
To explain the code: The BEGIN block sets the record separator to strings of whitespace (including newlines), so that each word is processed individually. The output record separator is nulled because RT will be used instead. Also declare a variable "x". Then test to see if the current word equals x (the previous word). Only print if it doesn't match. Outside the test, re-insert the whitespace string that delimited the word (that's what RT holds), so that as much of the original formatting as possible is kept. In other words, this will remove the duplicate words, but not the space around them. Finally, store the current word as x for the next loop. Edit2: If you want the separating space to be removed as well, you can move RT into the test block. This means that trailing space will not get printed after duplicate words. Code:
awk 'BEGIN{ RS="[[:space:]]+" ; ORS="" } ( $0 != x ) { print $0,RT }{ x=$0 }' |
for example my input text is like following:
"ana are are are are mere mere ion are prune prune." and the output that i want is: "ana are mere ion are prune." |
Thank you for the example text. It's exposed a fatal flaw in the solutions so far: punctuation.
Testing whether a simple string is the same as the one before it isn't too hard, but since "prune" and "prune." are not equal strings, both of them will remain. I'm trying to think of some way to work around this, but I'm worried that it may end up being almost too complex to handle. So far I've tried stripping punctuation from both strings for testing, but that leaves only the first word in the file, and so the final period is lost. Reversing it to print the final instance of a word is a much more complex operation. And what should it do if two duplicates of a word have different punctuation, such as one with a comma and one with a period? |
This is close but has the same issues with punctuation as now it is only printing the first occurrence so in this case prune with a space gets printed
before prune with .": Code:
awk 'BEGIN{RS="[ \n[:punct:]]"}{ORS = RT}!_[$0]++' file |
SO I thought a bit and it seems that if we are editing say a novel and punctuation falls in all sorts of places it may be too tricky.
However, using the current example, the following works (I also tested by adding a word after the last punctuation and it seemed ok too): Code:
awk 'BEGIN{RS="[ \n[:punct:]]+";ORS=""}{if(!_[$0]++){print x$0;y=1}if(x ~ /[[:punct:]]/ && !y)print x;x=RT;y=0}END{if(x ~ /[[:punct:]]/ && !y)print x}' file |
Quote:
Code:
markus@samsung:~/Programmierung/awk$ awk 'BEGIN{RS="[ \n[:punct:]]+";ORS=""}{if(!_[$0]++){print x$0;y=1}if(x ~ /[[:punct:]]/ && !y)print x;x=RT;y=0}END{if(x ~ /[[:punct:]]/ && !y)print x}' bsp.txt Markus |
I think I may have (mostly) cracked it.
The following awk script breaks up any words that end in punctuation marks into two variables. Then it doesn't print anything immediately, but stores the values until the next word is tested. Only if the next word fails to match does the stored value+punctuation get printed. Since this doesn't work on the final word of the file, a simple statement to print that is added at the end. Caveats I know of so far are: 1) Only punctuation at the end of the word is tested for. If there's anything at the beginning or in the middle of the words that's different, they won't match. 2) Only the punctuation at the end of the final word in the series, if any, is printed. 3) It will delete any line breaks, along with all other whitespace, that come in the middle of a run of consecutive words. Code:
#!/usr/bin/awk -f (Ignore my previous edit if you saw it. I was wrong, and have reinstated the original. |
Hey Markus ... thanks for the heads up .. I was off in my own little world :)
Simple enough edit though: Code:
awk 'BEGIN{RS="[ \n[:punct:]]+";ORS=""}{if(a != $0){print x$0;y=1}if(x ~ /[[:punct:]]/ && !y)print x;x=RT;y=0;a=$0}END{if(x ~ /[[:punct:]]/ && !y)print x}' f1 Code:
awk 'BEGIN{RS="[[:space:]]+";ORS=""}match($0,/^([[:punct:]]*)([^[:punct:]]+)([[:punct:]]*)$/,f){if(x != f[2]){print y$0;z = FNR}x = f[2];y = RT}END{if(z != FNR)print f[3]"\n"}' file |
Grail, your revised version is pretty good. It seems to keep the first punctuation mark it finds, whereas mine keeps the last. Who can say which is better? But there's still a bit of trouble at the end of yours.
Code:
$ cat fileA.txt Edit: here's another limitation I found, in your match function. It may break if there's punctuation in the middle of the string. Code:
cat fileA.txt |
As a semi-off-topic question, I've been trying to re-format your one-liner into a stand-alone awk script (so I can understand it better), but it's not working.
Code:
#!/usr/bin/awk -f :scratch: |
It is your match line. I find with awk it is a good habit to teat the start of all curly braces as you do with BEGIN / END, ie have them start immediately after the test.
Otherwise, the lone curly start brace is enacted on all lines which is not the desired affect. Change is: Code:
match( $0 , /^([[:punct:]]*)([^[:punct:]]+)([[:punct:]]*)$/ , f ){ the output is correct. Also there are plenty of scenarios I came up with that depending on what is required the output is completely wrong :( |
thanks David it works well...but there still are a little problem.for example for the input file:
"ana ana are are are mere mere si portocale ion are prune prune,prune. " the out put will be: "ana are mere si portocale ion are prune" the second line will be concatenated with the first(\n is removed) and the the dot from "prune" is missing(no matter,it is not so important :D) thanks |
All times are GMT -5. The time now is 11:28 PM. |