LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 05-04-2011, 11:38 AM   #16
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757

Hi,

I'd like to suggest an alternative sed:
Code:
sed -rn ':a $! {N;ba}; s/([^ [:punct:]\n]+)([ [:punct:]\n]+\1)+/\1/gp' file
I ran the following test:
Code:
$ cat file
ana ana are are are mere,mere ,
, ,mere si portocale
portocale.
ion are prune prune, prune?prune,,prune.
banana are prune prune, prune?prune,,prune.

$ sed -rn ':a $! {N;ba}; s/([^ [:punct:]\n]+)([ [:punct:]\n]+\1)*/\1/gp' file
ana are mere si portocale.
ion are prune.
banana are prune.
Seems ok, if the concatenation is not a big issue.
The awks had trouble with the above format:
Code:
$ awk 'BEGIN{RS="[[:space:]]+";ORS=""}match($0,/^([[:punct:]]*)([^[:punct:]]+)([[:punct:]]*)$/,f){if(x != f[2]){print y$0;z = FNR}x = f[2];y = RT}END{if(z != FNR)print f[3]"\n"}' file
ana are ,mere si portocale
ion are prune banana are prune

$ ./david_awk.scr file
ana are mere,mere ,, ,mere si portocale.
ion are prune, prune?prune,,prune.
banana are prune, prune?prune,,prune.$
 
1 members found this post helpful.
Old 05-04-2011, 03:13 PM   #17
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Sigh, thanks grail. I could've sworn I tried that. I really hate awk's bracketing syntax. It never seems to do what I think it should do.

Quote:
Originally Posted by cocostaec View Post
the second line will be concatenated with the first(\n is removed) and the the dot from "prune" is missing(no matter,it is not so important )
thanks
Yes, that was my third caveat above. You'd have to add some kind of function to count and store newlines between each word, and print them back at the right time. I made a few attempts at doing something like that, but no matter what I did I couldn't get it to work right, so I kind of gave up on it.

But I'm not getting the missing period at the end. It works for me. Indeed, all the END code does is print the final field as-is, so it shouldn't remove anything.



@crts Good job. Of course it all depends on the assumptions and trade-offs you make in what to remove and what to keep. Our awk scripts have assumed that each word is space-delimited, with optional punctuation at the end.


Thinking about it a bit, simply adding punctuation to the record separator makes for short, crisp script that does a pretty good job. It prints only the final instance of a duplicate and its trailing punctuation, if any.

Of course all formatting inside the string of dupes is still lost. Also, since all punctuation is now part of the delimiter, beginning punctuation is handled as part of the previous separator. A string like

--foo :foo, +foo? foo-foo,

will all be condensed into one word, "foo", with "--" being printed in front of it as part of the previous separator, and "," afterwards as part of the trailing separator.

Code:
#!/usr/bin/awk -f

BEGIN{
RS="[[:space:][:punct:]]+"
ORS=pw=""
}

if ( $0 != pw ) {
     print (pw)(pRT)
     }

pw=$0
pRT=RT
}

END{ print $0RT }
In the end, there's only so much simple scripts like these can accomplish, and it would take a rather complex program to handle all possible situations.
 
Old 05-07-2011, 03:29 AM   #18
cocostaec
LQ Newbie
 
Registered: May 2011
Posts: 9

Original Poster
Rep: Reputation: 0
Unhappy

let's say that it works fine(hope \n don't be a problem),thanks...now i want expand the problem...i want to search to block of identical consecutive strings...i think that sed and awk won't work in this situation...i've tried working with "for" but i don't know how to the blocks...i think i have to take to compare the firsts 2 words,then the firsts 2 words with the next 2 words and so on...then the second word with 3,the the 2,3 with 4 and 5 and so on...it seems to be difficult...
 
Old 05-07-2011, 01:25 PM   #19
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: Disabled
I'd recommend to take a look at Perl.

Referring to the for-loop, you will have to use two for-loops, the first (outer one) will have to count the number of words in the block, whereas the inner loop examines the possible blocks. n = number of words in the whole text, you may take the following as pseudocode
Code:
for (words=2; words<=n/2; words++) 
   for (start=1; start<=n-2*words; start++)
       compare block1 (from start to start+words) with block2 (from start+words+1 to start+2*words+1) 
   next start
next words
this will become more complicated if you assume that the words in the blocks are in the same order but with different separators between the words.

BTW, you should use the report-button and ask the moderators to move this thread into the Programming-section of LQ.

Markus

Last edited by markush; 05-07-2011 at 01:27 PM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
finding words in files. ukwho Linux - Newbie 3 04-23-2010 09:36 AM
Removing white spaces between words and joining the words in a given format Priyabio Linux - General 4 08-20-2009 07:42 AM
duplicate entry finding cmontr Linux - General 1 05-21-2008 11:04 PM
removing duplicate entries shabev Linux - Enterprise 3 03-25-2008 10:36 AM
Finding duplicate files SlowCoder Linux - General 6 10-12-2007 08:25 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 06:34 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration