Hi,
I have been very busy and did not find any spare time to deal with explanations. I finally have some time to go into some details of the 'sed' solution. I will split it up first and then rebuild the most important part step by step. I have marked the main part in bold:
Code:
sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+) *([^ \n]*) *(.*\n)\2 +([^ \n]+)/\1 \5 \3 \4/g;tb;s/\n+/\n/gp'
The other parts are not so interesting at the moment. The first part
simply reads the whole file into its pattern-buffer. The last substitution command
replaces multiple, consecutive newlines with just one newline. That is because the previous bold part will produce empty lines which we do not want.
Let us now try to understand how the bold part works. We will build it up step by step. Therefor we will use the following simplified data set:
Code:
$ cat simple-file
Janice Flavor
Linda Brown
Janice Taylor
Janice Wafer
Now let us try to identify the first two names:
Code:
sed -nr ':a N;$! ba;:b s/([^ ]+ +[^ \n]+)/|\1|\1/p' simple-file
Notice the brackets. They mark a group that can be back-referenced. That means, whatever pattern will be matched inside this braces will be stored in a *special* buffer. The content of this buffer can be accessed by backreferences, in this case with '\1'. Try the above example to see what is stored inside '\1'. Whatever is stored in '\1' will appear between '|'.
So we see that the RegEx
will match "Janice Flavor" which should, hopefully, be obvious why; I am not sure how deep your sed knowledge is at this point.
The first character-class
matches one or more characters that are NOT space. Then it should be followed by at least one (or more) space(s). The next character-class will match at least one or more characters that are NEITHER space NOR newlines. This is important since 'Flavor' is followed by a newline at this point.
So now we have matched 'Janice Flavor'. Our next objective is to somehow identify the *other* Janices and retrieve their second name. Remember what I said about backreferences? Any pattern that is matched inside () is stored in a *special* buffer. You have 9 of those buffers. You can access them with
\n
where n is a number from 1 to 9, e.g. \1 refers to the content inside the first pair of brackets, \2 stores the content of the second pair of braces.
Let us capture 'Janice' in a *special* buffer:
Code:
sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)/|\1|\2|/p'
As you see, the groups can be nested! The first pair of braces (bold) still holds 'Janice Flavor'. The second pair (italic) holds 'Janice' alone.
Let us refine our RegEx a bit more:
Code:
sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)\n\2/|\1|\2|/p'
Notice the bold part. Until now we have only used backreferences on the right-hand side of the substitution command. But we can also use it in the left-hand side. Now our RegEx looks for a first and a second name which is followed by a newline and then the first name again. We do not match 'Janice Flavor' anymore because she is followed by 'Linda'. 'Janice Taylor', however, is followed by 'Janice Wafer' on the next line. So our RegEx does match.
When we substitute we do not need the back-reference \2 since 'Janice' is already in \1. It would be nice if we can obtain 'Wafer'. Well, once again we use another group () that we can back-reference:
Code:
sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)\n\2 +([^ \n]+)/|\1|\3|/p'
After we matched 'Janice' there can be one or more spaces until 'Wafer'. We match 'Wafer' itself by matching any character that is NEITHER a newline NOR a space. We negate space in order to accomodate for possible trailing spaces. Our first pair of braces matches 'Janice Taylor' and the third pair matches 'Wafer'. Those are our substitutes.
Now let us see if we can work around interfering 'Linda'. We want 'Janice Flavor' as our first match. 'Flavor' can be followed any character, which includes 'Linda Brown' and some newlines until we meet 'Janice' again in the third line. So let us add '.*\n' after our first pair of braces to account for that:
Code:
sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+).*\n\2 +([^ \n]+)/|\1|\3|/p' simple-file
It finally gets interesting! Notice, that you do NOT match 'Taylor' with your third group. RegExes are GREEDY. I.e., that '.*\n\2' will look for the longest possible match! And that is
Code:
Linda Brown\nJanice Taylor\nJanice
So the third group will match 'Wafer'. We are getting closer to our goal.
Our next step is to preserve 'Linda' and basically everything that has been matched by '.*\n'. Yes, once more we use another group that we can backreference:
Code:
sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)(.*\n)\2 +([^ \n]+)/\1 \4 \3/p'
^ 3. br ^ 4. br
Notice, that 'Wafer' is now matched by the 4th group and therefore must be back-referenced by \4. \3 holds our previously lost information. I also do not use the '|' on the RHS as a visual aide since they would interfere in the next step if we kept them.
We still need to get 'Taylor' between 'Flavor' and 'Wafer'. Therefor we will extend our RegEx to match 'Janice Flavor' and anything else that follows on that
same line:
Code:
sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+) *([^\n]*)(.*\n)\2 +([^ \n]+)/\1 \5 \3 \4/;tb;p'
Two things happen here. We use ' *([^\n]*)' to match anything after 'Janice Flavor'. We are using the '*' quantifier for that which matches zero or more occurences of the pattern. So if 'Janice Flavor' is still alone on the first line the additional pattern will match nothing. When 'Wafer' has been added after the 's' command runs the first time it will match 'Wafer'. Also notice, that our back-references have shifted again.
In order to force the 's' command to execute again we use the conditional jump 't' command. This will jump back to point ':b' only if the previous 's' command has made any changes to the pattern space. If our RegEx does not find any more matches then we are finished and the 't' command does not jump and the print command ('p') will execute and sed will finally exit.
That's basically it. As I said at the beginning of the post, our RegEx produces some empty lines. This can be taken care of by using
before we print the pattern space. There are some minor differences between this solution and the one I provided earlier. This is to account for possible trailing spaces. As it turns out, you also do not need the global flag in the first 's' command.
One final note. My main point in my previous post was:
Don't do it this way.
Use awk instead.
The right tool for the right job can spare you some headache
Since I do like a good brain teaser every now and then I thought of this cumbersome sed solution.
But normally I would not have posted it.
I hope this clears things up a bit.
PS:
Earlier you said that you are doing a sed tutorial but you did not say which one.
To be sure that you are doing the
right one, this is the tutorial to start with:
http://www.grymoire.com/Unix/Sed.html