Sigh, thanks grail. I could've sworn I tried that. I really hate awk's bracketing syntax. It never seems to do what I think it should do.
Quote:
Originally Posted by cocostaec
the second line will be concatenated with the first(\n is removed) and the the dot from "prune" is missing(no matter,it is not so important )
thanks
|
Yes, that was my third caveat above. You'd have to add some kind of function to count and store newlines between each word, and print them back at the right time. I made a few attempts at doing something like that, but no matter what I did I couldn't get it to work right, so I kind of gave up on it.
But I'm not getting the missing period at the end. It works for me. Indeed, all the END code does is print the final field as-is, so it shouldn't remove anything.
@crts Good job. Of course it all depends on the assumptions and trade-offs you make in what to remove and what to keep. Our awk scripts have assumed that each word is space-delimited, with optional punctuation at the end.
Thinking about it a bit, simply adding punctuation to the record separator makes for short, crisp script that does a pretty good job. It prints only the final instance of a duplicate and its trailing punctuation, if any.
Of course all formatting inside the string of dupes is still lost. Also, since all punctuation is now part of the delimiter, beginning punctuation is handled as part of the previous separator. A string like
--foo :foo, +foo? foo-foo,
will all be condensed into one word, "
foo", with "
--" being printed in front of it as part of the previous separator, and "
," afterwards as part of the trailing separator.
Code:
#!/usr/bin/awk -f
BEGIN{
RS="[[:space:][:punct:]]+"
ORS=pw=""
}
if ( $0 != pw ) {
print (pw)(pRT)
}
pw=$0
pRT=RT
}
END{ print $0RT }
In the end, there's only so much simple scripts like these can accomplish, and it would take a rather complex program to handle all possible situations.