Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
i have a problem,in my opinion a big one...i want to find and remove duplicate consecutive words from atext file...i've tried working with array but is very difficult..then i've tried using sed...somebody hint me with this sed :
sed ':f;N;$!bf; s/\b\(.*\)\n\1\b/\1\n/g; s/\b\(.*\)\b\1\b/\1/g'...it works fine but if i have 3 consecutive identical words it only remove first one and the last two remain intacts...any suggestions?pls help me
I think you can put the regex in "greedy" mode, it will then match the longest possible substring. But I'm no regex expert. Try searching for "lazy" vs. "greedy" in regular expressions.
This requires gnu awk, as it uses the non-standard RT variable ( grail). It seems to work well in my tests, removing any number of duplicated words, but can't guarantee that it will work perfectly in every situation.
To explain the code:
The BEGIN block sets the record separator to strings of whitespace (including newlines), so that each word is processed individually.
The output record separator is nulled because RT will be used instead. Also declare a variable "x".
Then test to see if the current word equals x (the previous word). Only print if it doesn't match.
Outside the test, re-insert the whitespace string that delimited the word (that's what RT holds), so that as much of the original formatting as possible is kept. In other words, this will remove the duplicate words, but not the space around them.
Finally, store the current word as x for the next loop.
Edit2: If you want the separating space to be removed as well, you can move RT into the test block. This means that trailing space will not get printed after duplicate words.
for example my input text is like following:
"ana are are are are mere
mere
ion are prune prune."
and the output that i want is:
"ana are mere
ion are prune."
Thank you for the example text. It's exposed a fatal flaw in the solutions so far: punctuation.
Testing whether a simple string is the same as the one before it isn't too hard, but since "prune" and "prune." are not equal strings, both of them will remain.
I'm trying to think of some way to work around this, but I'm worried that it may end up being almost too complex to handle.
So far I've tried stripping punctuation from both strings for testing, but that leaves only the first word in the file, and so the final period is lost. Reversing it to print the final instance of a word is a much more complex operation.
And what should it do if two duplicates of a word have different punctuation, such as one with a comma and one with a period?
This is close but has the same issues with punctuation as now it is only printing the first occurrence so in this case prune with a space gets printed
before prune with .":
SO I thought a bit and it seems that if we are editing say a novel and punctuation falls in all sorts of places it may be too tricky.
However, using the current example, the following works (I also tested by adding a word after the last punctuation and it seemed ok too):
The following awk script breaks up any words that end in punctuation marks into two variables. Then it doesn't print anything immediately, but stores the values until the next word is tested. Only if the next word fails to match does the stored value+punctuation get printed.
Since this doesn't work on the final word of the file, a simple statement to print that is added at the end.
Caveats I know of so far are:
1) Only punctuation at the end of the word is tested for. If there's anything at the beginning or in the middle of the words that's different, they won't match.
2) Only the punctuation at the end of the final word in the series, if any, is printed.
3) It will delete any line breaks, along with all other whitespace, that come in the middle of a run of consecutive words.
Grail, your revised version is pretty good. It seems to keep the first punctuation mark it finds, whereas mine keeps the last. Who can say which is better? But there's still a bit of trouble at the end of yours.
Code:
$ cat fileA.txt
ana ana are:
are, are are mere--
mere mere,
ion are prune, prune, prune.
$ grails_script.sh fileA.txt
ana are: mere--
ion are prune,.
$ davids_script.sh fileA.txt
ana are mere,
ion are prune.
Edit: here's another limitation I found, in your match function.
It may break if there's punctuation in the middle of the string.
Code:
cat fileA.txt
ana ana are:
are, are are mere--
me.re me.ntos,
ion are prune, prune, prune.
$ grails_script.sh fileA.txt
ana are: mere--
ion are prune,.
mere--, me.re, and me.ntos should rightly be treated as separate entities, I think. But the match regex means it only compares the characters before the first punctuation mark.
Last edited by David the H.; 05-03-2011 at 10:28 AM.
Reason: as stated
As a semi-off-topic question, I've been trying to re-format your one-liner into a stand-alone awk script (so I can understand it better), but it's not working.
Code:
#!/usr/bin/awk -f
BEGIN{
RS="[[:space:]]+"
ORS=""
}
match( $0 , /^([[:punct:]]*)([^[:punct:]]+)([[:punct:]]*)$/ , f )
{
if ( x != f[2] ) {
print y$0
z=FNR
}
x=f[2]
y=RT
}
END{
if ( z != FNR ) print f[3]"\n"
}
I can't see where I've changed anything significant, but it doesn't remove anything and the output is all compacted.
It is your match line. I find with awk it is a good habit to teat the start of all curly braces as you do with BEGIN / END, ie have them start immediately after the test.
Otherwise, the lone curly start brace is enacted on all lines which is not the desired affect.
Change is:
Code:
match( $0 , /^([[:punct:]]*)([^[:punct:]]+)([[:punct:]]*)$/ , f ){
I am going to hold off on further solutions until the OP provides more details as some of the outcomes that my earlier script is providing and based on original request,
the output is correct. Also there are plenty of scenarios I came up with that depending on what is required the output is completely wrong
thanks David it works well...but there still are a little problem.for example for the input file:
"ana ana are are are mere
mere si portocale
ion are prune prune,prune.
"
the out put will be:
"ana are mere si portocale
ion are prune"
the second line will be concatenated with the first(\n is removed) and the the dot from "prune" is missing(no matter,it is not so important )
thanks
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.