Quote:
Originally Posted by cocostaec
crts can you explain a little the sed code?
thanks
|
Ok, the key is to understand how the RegEx works. Suppose you have the following text:
Code:
some text some text repeating itself
Now let's have a look at the first part of the RegEx in the substitution command:
Code:
([^[:blank:]]+)(.+) +
The brackets simply indicate to store the matches in backreferences. The match of the first match is stored in '\1' and the second one in '\2'.
So the first brackets will match '
some' in the above example. The second pair of brackets will match any character.
In order to see where the matching process will stop we have to look at the backreferences, too.
Code:
([^[:blank:]]+)(.+) +\1\2
The matching will stop when the first backreference - in this case the word '
some' - is encountered if it is preceeded by a space
and followed by the match of the second backreference.
The second backreference is in this case '
text'. So the
complete RegEx matches finally:
Code:
(some)( text) (some)( text)
\1 \2 \1 \2
This whole pattern is the replaced with the backreferences '\1\2' which are in this case 'some text'.
The next thing you need to understand is the conditional jump command after the substitution command:
'
t a'
This command will jump to point '
:a' only if the preceeding 's///' command has made a substitution in the pattern buffer. Suppose you have the following pattern:
Code:
some text some text some text repeating itself
After the 's///' command finishes the first time your pattern will look like this:
Code:
some text some text repeating itself
Since a substitution was made the script will jump to point ':a' and execute the 's///' command again.
The pattern now looks like:
Code:
some text repeating itself
A substitution was made so the '
t' command jumps back to '
a' again. This time, however, the RegEx does
not match anything. Therefore the no substitution is made and the '
t' command does
not jump. The cycle ends and the next line is read into the pattern buffer.
The RegEx also works if you have two consecutive words. But it works in a non-obvious manner.
I tried to keep the explanation as simple as possible.
I also reviewed your other thread again. It raised the issue of punctuation. So the 'sed' *might* be extended like
Code:
sed -r ':a s/([^[:blank:]]+)(.+)[ [:punct:]]+\1\2/\1\2/;t a' file
But your sample data you provided in this thread does not indicate the need for this extension.
Hope this helps.