finding and removing block of identical strings

cocostaec · 05-19-2011, 09:28 AM

i have a problem in finding block of identical strings...i solved the problem in finding consecutive identical words and now i want to expand the code in order to find and remove consecutive identical block of strings...
for example the awk code removing consecutive identical word is:
Code:

Code:

#!/usr/bin/awk -f
BEGIN{
RS="[[:space:]]+";
ORS=""
}
match($0,/^([[:punct:]]*)([^[:punct:]]+)([[:punct:]]*)$/,f){
if(x != f[2])
{
print y$0;z = FNR
}
x = f[2];
y = RT
}
END{
if(z != FNR)
print f[3]"\n"
}

input:
"ana are mere mere
mere si portocale
ion are prune prune."

output:
"ana are mere si portocale
ion are prune."

and now i want to expand the code to do the following:
input:
"ana are ana are mere
ion are prune ion are prune"

output:
"ana are mere
ion are prune"

pls help me
thanks

Ramurd · 05-19-2011, 09:40 AM

And if the input is:

"ana are ure mere ion are prune"

would the output be:
"ana are ure mere ion pr"

?

crts · 05-19-2011, 10:52 AM

Hi,

I recall the original problem. Will the strings also be spanning over several lines? If not you can try this:

Code:

sed -r 's/([^[:blank:]]+)(.+) +\1\2/\1\2/g' file

I haven't tested it for all possible combinations. So play around with the data a bit and let me know on which occasion it fails.

grail · 05-19-2011, 10:59 AM

As the complexity of this is growing large I would suggest moving to something like perl so you can have more control. Just my 2 cents

cocostaec · 05-19-2011, 01:10 PM

no...only consecutive identical block of strings are considered errors and must be removed...

cocostaec · 05-20-2011, 06:03 AM

if i have the input file

"ana ana are ana are
ion merge ion merge."

using

Code:

sed -r 's/([^[:blank:]]+)(.+) +\1\2/\1\2/g' file

the output will be

"ana are ana are
ion merge."

is sed recursive?

thanks

crts · 05-20-2011, 06:29 AM

Quote:

is sed recursive?

No, by default sed is not recursive. Recursion has to be enforced. I noticed another problem with the previous sed if the patterns occur more than twice. This takes care of this and handles the recursion:

Code:

sed -r ':a s/([^[:blank:]]+)(.+) +\1\2/\1\2/;t a' file

Notice, that the g-flag is no longer needed.

sundialsvcs · 05-20-2011, 08:01 AM

Here is an efficient solution.

There is a subtle "tail recursion" element to this problem which can be exploited: every occurrence of 'n' consecutive strings can be reduced to occurrences of <i>two</i> consecutive strings (that is to say, "2-tuples"), each tuple of which occurs more than one time.

For example, in the string "paris in the the spring in the paris in the spring" contains many repeating 2-tuples, and, when we observe a repetition of more-than-2 words, another way to describe this is to say that the tuples overlap. For instance, "paris in the the spring mother in the spring" contains these tuples, all of which are found more than once in the text:

paris in
in the
the the
the spring.

In order for a word to appear in a group that is "an 'n'-tuple," ("a phrase that is 'n' wods long") it by-definition also appears in '(n-1)' consecutive 2-tuples, all of which repeat.

You can solve the problem as follows: (not with a bash-script, please!)

Parse the text into a flat-file where each record contains a 2-tuple and the ordinal position of that tuple in the text.
Sort the file by the tuples and, by comparing the now-adjacent records to one another, eliminate the tuples that don't repeat.
This file is "your basic answer." If you want to eliminate duplicate phrases, here they are.
If you want to eliminate longer phrases, Re-sort the file (of repeating tuples...) by the "position" column, and, by comparing the now-adjacent records to one another, eliminate "consecutive runs" of position numbers that are sufficiently long. These are where the tuples overlap enough times to build up a "long-enough string."

No memory is required. You could have done this using punched cards. (And in fact, early cryptographers did this using punched cards.)

cocostaec · 05-22-2011, 01:41 PM

crts can you explain a little the sed code?
thanks

crts · 05-22-2011, 03:06 PM

Quote:

Originally Posted by cocostaec

crts can you explain a little the sed code?
thanks

Ok, the key is to understand how the RegEx works. Suppose you have the following text:

Code:

some text some text repeating itself

Now let's have a look at the first part of the RegEx in the substitution command:

Code:

([^[:blank:]]+)(.+) +

The brackets simply indicate to store the matches in backreferences. The match of the first match is stored in '\1' and the second one in '\2'.
So the first brackets will match 'some' in the above example. The second pair of brackets will match any character.
In order to see where the matching process will stop we have to look at the backreferences, too.

Code:

([^[:blank:]]+)(.+) +\1\2

The matching will stop when the first backreference - in this case the word 'some' - is encountered if it is preceeded by a space and followed by the match of the second backreference.
The second backreference is in this case ' text'. So the complete RegEx matches finally:

Code:

(some)( text) (some)( text)
  \1     \2     \1     \2

This whole pattern is the replaced with the backreferences '\1\2' which are in this case 'some text'.
The next thing you need to understand is the conditional jump command after the substitution command:
't a'
This command will jump to point ':a' only if the preceeding 's///' command has made a substitution in the pattern buffer. Suppose you have the following pattern:

Code:

some text some text some text repeating itself

After the 's///' command finishes the first time your pattern will look like this:

Code:

some text some text repeating itself

Since a substitution was made the script will jump to point ':a' and execute the 's///' command again.
The pattern now looks like:

Code:

some text repeating itself

A substitution was made so the 't' command jumps back to 'a' again. This time, however, the RegEx does not match anything. Therefore the no substitution is made and the 't' command does not jump. The cycle ends and the next line is read into the pattern buffer.

The RegEx also works if you have two consecutive words. But it works in a non-obvious manner.
I tried to keep the explanation as simple as possible.

I also reviewed your other thread again. It raised the issue of punctuation. So the 'sed' *might* be extended like

Code:

sed -r ':a s/([^[:blank:]]+)(.+)[ [:punct:]]+\1\2/\1\2/;t a' file

But your sample data you provided in this thread does not indicate the need for this extension.

Hope this helps.