LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-19-2011, 09:28 AM   #1
cocostaec
LQ Newbie
 
Registered: May 2011
Posts: 9

Rep: Reputation: 0
finding and removing block of identical strings


i have a problem in finding block of identical strings...i solved the problem in finding consecutive identical words and now i want to expand the code in order to find and remove consecutive identical block of strings...
for example the awk code removing consecutive identical word is:
Code:
Code:
#!/usr/bin/awk -f
BEGIN{
RS="[[:space:]]+";
ORS=""
}
match($0,/^([[:punct:]]*)([^[:punct:]]+)([[:punct:]]*)$/,f){
if(x != f[2])
{
print y$0;z = FNR
}
x = f[2];
y = RT
}
END{
if(z != FNR)
print f[3]"\n"
}
input:
"ana are mere mere
mere si portocale
ion are prune prune."

output:
"ana are mere si portocale
ion are prune."

and now i want to expand the code to do the following:
input:
"ana are ana are mere
ion are prune ion are prune"

output:
"ana are mere
ion are prune"

pls help me
thanks
 
Old 05-19-2011, 09:40 AM   #2
Ramurd
Member
 
Registered: Mar 2009
Location: Rotterdam, the Netherlands
Distribution: Slackwarelinux
Posts: 703

Rep: Reputation: 111Reputation: 111
And if the input is:

"ana are ure mere ion are prune"

would the output be:
"ana are ure mere ion pr"

?
 
Old 05-19-2011, 10:52 AM   #3
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Hi,

I recall the original problem. Will the strings also be spanning over several lines? If not you can try this:
Code:
sed -r 's/([^[:blank:]]+)(.+) +\1\2/\1\2/g' file
I haven't tested it for all possible combinations. So play around with the data a bit and let me know on which occasion it fails.
 
Old 05-19-2011, 10:59 AM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
As the complexity of this is growing large I would suggest moving to something like perl so you can have more control. Just my 2 cents
 
Old 05-19-2011, 01:10 PM   #5
cocostaec
LQ Newbie
 
Registered: May 2011
Posts: 9

Original Poster
Rep: Reputation: 0
no...only consecutive identical block of strings are considered errors and must be removed...
 
Old 05-20-2011, 06:03 AM   #6
cocostaec
LQ Newbie
 
Registered: May 2011
Posts: 9

Original Poster
Rep: Reputation: 0
if i have the input file

"ana ana are ana are
ion merge ion merge."

using
Code:
sed -r 's/([^[:blank:]]+)(.+) +\1\2/\1\2/g' file
the output will be

"ana are ana are
ion merge."

is sed recursive?

thanks
 
Old 05-20-2011, 06:29 AM   #7
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Quote:
is sed recursive?
No, by default sed is not recursive. Recursion has to be enforced. I noticed another problem with the previous sed if the patterns occur more than twice. This takes care of this and handles the recursion:
Code:
sed -r ':a s/([^[:blank:]]+)(.+) +\1\2/\1\2/;t a' file
Notice, that the g-flag is no longer needed.
 
Old 05-20-2011, 08:01 AM   #8
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940
Here is an efficient solution.

There is a subtle "tail recursion" element to this problem which can be exploited: every occurrence of 'n' consecutive strings can be reduced to occurrences of <i>two</i> consecutive strings (that is to say, "2-tuples"), each tuple of which occurs more than one time.

For example, in the string "paris in the the spring in the paris in the spring" contains many repeating 2-tuples, and, when we observe a repetition of more-than-2 words, another way to describe this is to say that the tuples overlap. For instance, "paris in the the spring mother in the spring" contains these tuples, all of which are found more than once in the text:
  • paris in
  • in the
  • the the
  • the spring.

In order for a word to appear in a group that is "an 'n'-tuple," ("a phrase that is 'n' wods long") it by-definition also appears in '(n-1)' consecutive 2-tuples, all of which repeat.

You can solve the problem as follows: (not with a bash-script, please!)
  1. Parse the text into a flat-file where each record contains a 2-tuple and the ordinal position of that tuple in the text.
  2. Sort the file by the tuples and, by comparing the now-adjacent records to one another, eliminate the tuples that don't repeat.
  3. This file is "your basic answer." If you want to eliminate duplicate phrases, here they are.
  4. If you want to eliminate longer phrases, Re-sort the file (of repeating tuples...) by the "position" column, and, by comparing the now-adjacent records to one another, eliminate "consecutive runs" of position numbers that are sufficiently long. These are where the tuples overlap enough times to build up a "long-enough string."

No memory is required. You could have done this using punched cards. (And in fact, early cryptographers did this using punched cards.)
 
Old 05-22-2011, 01:41 PM   #9
cocostaec
LQ Newbie
 
Registered: May 2011
Posts: 9

Original Poster
Rep: Reputation: 0
crts can you explain a little the sed code?
thanks
 
Old 05-22-2011, 03:06 PM   #10
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Quote:
Originally Posted by cocostaec View Post
crts can you explain a little the sed code?
thanks
Ok, the key is to understand how the RegEx works. Suppose you have the following text:
Code:
some text some text repeating itself
Now let's have a look at the first part of the RegEx in the substitution command:
Code:
([^[:blank:]]+)(.+) +
The brackets simply indicate to store the matches in backreferences. The match of the first match is stored in '\1' and the second one in '\2'.
So the first brackets will match 'some' in the above example. The second pair of brackets will match any character.
In order to see where the matching process will stop we have to look at the backreferences, too.
Code:
([^[:blank:]]+)(.+) +\1\2
The matching will stop when the first backreference - in this case the word 'some' - is encountered if it is preceeded by a space and followed by the match of the second backreference.
The second backreference is in this case ' text'. So the complete RegEx matches finally:
Code:
(some)( text) (some)( text)
  \1     \2     \1     \2
This whole pattern is the replaced with the backreferences '\1\2' which are in this case 'some text'.
The next thing you need to understand is the conditional jump command after the substitution command:
't a'
This command will jump to point ':a' only if the preceeding 's///' command has made a substitution in the pattern buffer. Suppose you have the following pattern:
Code:
some text some text some text repeating itself
After the 's///' command finishes the first time your pattern will look like this:
Code:
some text some text repeating itself
Since a substitution was made the script will jump to point ':a' and execute the 's///' command again.
The pattern now looks like:
Code:
some text repeating itself
A substitution was made so the 't' command jumps back to 'a' again. This time, however, the RegEx does not match anything. Therefore the no substitution is made and the 't' command does not jump. The cycle ends and the next line is read into the pattern buffer.

The RegEx also works if you have two consecutive words. But it works in a non-obvious manner.
I tried to keep the explanation as simple as possible.

I also reviewed your other thread again. It raised the issue of punctuation. So the 'sed' *might* be extended like
Code:
sed -r ':a s/([^[:blank:]]+)(.+)[ [:punct:]]+\1\2/\1\2/;t a' file
But your sample data you provided in this thread does not indicate the need for this extension.

Hope this helps.

Last edited by crts; 05-22-2011 at 03:44 PM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
finding and remove block of identical strings cocostaec Linux - Newbie 2 05-20-2011 06:24 AM
[SOLVED] Can not match strings that appear to be identical dnoob Programming 4 02-27-2011 11:42 AM
Grep -- finding files that contain two separate strings Fliggerty Linux - Newbie 4 12-10-2010 08:40 AM
block strings with pf. posible? what about other firewalls ? da1 *BSD 1 06-21-2007 08:29 AM
Is it possible to block text strings with IP tables? abefroman Linux - Security 27 06-29-2005 05:36 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:43 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration