[SOLVED] Combining lines based on key

theNbomr · 12-10-2011, 11:43 AM

So, would a clean definition of your requirement be 'no branching/looping constructs allowed'? That would make things quite a bit more challenging for most problems. I haven't inherited your background, so I won't pretend to understand how you see that as helpful. I do wonder if it isn't just a bit severe; it certainly limits one of your stated goals, being 'learn Linux'.
Now I'm going to have to actually figure out what the posted sed solution does. 8-(

--- rod.

danielbmartin · 12-11-2011, 11:21 AM

Quote:

Originally Posted by theNbomr

So, would a clean definition of your requirement be 'no branching/looping constructs allowed'?

Let's say "preferred" rather than "required." I may pose a question (such as the first post in this thread) to which I already have a Rexx solution which uses loops. Therefore the question is not, "how can this be done?" It is, "how may this be done with sed or grep?"

Quote:

Originally Posted by theNbomr

That would make things quite a bit more challenging for most problems.

For some problems, anyway. Sometimes I ask for advice thinking "there might be a clever option which does this but I haven't sussed it out of the manual." I *never* post a question without having first made a sincere effort to solve on my own.

Quote:

Originally Posted by theNbomr

... it certainly limits one of your stated goals, being 'learn Linux'.

I've got to start somewhere and have chosen to start by developing a competence with sed, grep, cut, paste, sort, uniq, nl, rev, comm. Not expertise, but competence.

Quote:

Originally Posted by theNbomr

Now I'm going to have to actually figure out what the posted sed solution does.

I've been picking it apart hoping to figure it out but haven't made much progress. sed may be compared to the APL language (part of my distant past) in this respect: the function is impressive, the syntax is daunting, the code is not self-documenting, the learning curve is difficult... but once you master it, coding is fun!

crts · 12-11-2011, 07:59 PM

Hi,

I have been very busy and did not find any spare time to deal with explanations. I finally have some time to go into some details of the 'sed' solution. I will split it up first and then rebuild the most important part step by step. I have marked the main part in bold:

Code:

sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+) *([^ \n]*) *(.*\n)\2 +([^ \n]+)/\1 \5 \3 \4/g;tb;s/\n+/\n/gp'

The other parts are not so interesting at the moment. The first part

Code:

:a N;$! ba

simply reads the whole file into its pattern-buffer. The last substitution command

Code:

s/\n+/\n/gp

replaces multiple, consecutive newlines with just one newline. That is because the previous bold part will produce empty lines which we do not want.
Let us now try to understand how the bold part works. We will build it up step by step. Therefor we will use the following simplified data set:

Code:

$ cat simple-file
Janice Flavor
Linda Brown
Janice Taylor
Janice Wafer

Now let us try to identify the first two names:

Code:

sed -nr ':a N;$! ba;:b s/([^ ]+ +[^ \n]+)/|\1|\1/p' simple-file

Notice the brackets. They mark a group that can be back-referenced. That means, whatever pattern will be matched inside this braces will be stored in a *special* buffer. The content of this buffer can be accessed by backreferences, in this case with '\1'. Try the above example to see what is stored inside '\1'. Whatever is stored in '\1' will appear between '|'.
So we see that the RegEx

Code:

([^ ]+ +[^ \n]+)

will match "Janice Flavor" which should, hopefully, be obvious why; I am not sure how deep your sed knowledge is at this point.
The first character-class

Code:

[^ ]+

matches one or more characters that are NOT space. Then it should be followed by at least one (or more) space(s). The next character-class will match at least one or more characters that are NEITHER space NOR newlines. This is important since 'Flavor' is followed by a newline at this point.
So now we have matched 'Janice Flavor'. Our next objective is to somehow identify the *other* Janices and retrieve their second name. Remember what I said about backreferences? Any pattern that is matched inside () is stored in a *special* buffer. You have 9 of those buffers. You can access them with
\n

where n is a number from 1 to 9, e.g. \1 refers to the content inside the first pair of brackets, \2 stores the content of the second pair of braces.
Let us capture 'Janice' in a *special* buffer:

Code:

sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)/|\1|\2|/p'

As you see, the groups can be nested! The first pair of braces (bold) still holds 'Janice Flavor'. The second pair (italic) holds 'Janice' alone.
Let us refine our RegEx a bit more:

Code:

sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)\n\2/|\1|\2|/p'

Notice the bold part. Until now we have only used backreferences on the right-hand side of the substitution command. But we can also use it in the left-hand side. Now our RegEx looks for a first and a second name which is followed by a newline and then the first name again. We do not match 'Janice Flavor' anymore because she is followed by 'Linda'. 'Janice Taylor', however, is followed by 'Janice Wafer' on the next line. So our RegEx does match.
When we substitute we do not need the back-reference \2 since 'Janice' is already in \1. It would be nice if we can obtain 'Wafer'. Well, once again we use another group () that we can back-reference:

Code:

sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)\n\2 +([^ \n]+)/|\1|\3|/p'

After we matched 'Janice' there can be one or more spaces until 'Wafer'. We match 'Wafer' itself by matching any character that is NEITHER a newline NOR a space. We negate space in order to accomodate for possible trailing spaces. Our first pair of braces matches 'Janice Taylor' and the third pair matches 'Wafer'. Those are our substitutes.

Now let us see if we can work around interfering 'Linda'. We want 'Janice Flavor' as our first match. 'Flavor' can be followed any character, which includes 'Linda Brown' and some newlines until we meet 'Janice' again in the third line. So let us add '.*\n' after our first pair of braces to account for that:

Code:

sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+).*\n\2 +([^ \n]+)/|\1|\3|/p' simple-file

It finally gets interesting! Notice, that you do NOT match 'Taylor' with your third group. RegExes are GREEDY. I.e., that '.*\n\2' will look for the longest possible match! And that is

Code:

Linda Brown\nJanice Taylor\nJanice

So the third group will match 'Wafer'. We are getting closer to our goal.
Our next step is to preserve 'Linda' and basically everything that has been matched by '.*\n'. Yes, once more we use another group that we can backreference:

Code:

sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)(.*\n)\2 +([^ \n]+)/\1 \4 \3/p'
                                           ^ 3. br   ^ 4. br

Notice, that 'Wafer' is now matched by the 4th group and therefore must be back-referenced by \4. \3 holds our previously lost information. I also do not use the '|' on the RHS as a visual aide since they would interfere in the next step if we kept them.
We still need to get 'Taylor' between 'Flavor' and 'Wafer'. Therefor we will extend our RegEx to match 'Janice Flavor' and anything else that follows on that same line:

Code:

sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+) *([^\n]*)(.*\n)\2 +([^ \n]+)/\1 \5 \3 \4/;tb;p'

Two things happen here. We use ' *([^\n]*)' to match anything after 'Janice Flavor'. We are using the '*' quantifier for that which matches zero or more occurences of the pattern. So if 'Janice Flavor' is still alone on the first line the additional pattern will match nothing. When 'Wafer' has been added after the 's' command runs the first time it will match 'Wafer'. Also notice, that our back-references have shifted again.
In order to force the 's' command to execute again we use the conditional jump 't' command. This will jump back to point ':b' only if the previous 's' command has made any changes to the pattern space. If our RegEx does not find any more matches then we are finished and the 't' command does not jump and the print command ('p') will execute and sed will finally exit.
That's basically it. As I said at the beginning of the post, our RegEx produces some empty lines. This can be taken care of by using

Code:

s/\n+/\n/g

before we print the pattern space. There are some minor differences between this solution and the one I provided earlier. This is to account for possible trailing spaces. As it turns out, you also do not need the global flag in the first 's' command.

One final note. My main point in my previous post was:
Don't do it this way.
Use awk instead.
The right tool for the right job can spare you some headache

Since I do like a good brain teaser every now and then I thought of this cumbersome sed solution.
But normally I would not have posted it.

I hope this clears things up a bit.

PS:
Earlier you said that you are doing a sed tutorial but you did not say which one.
To be sure that you are doing the right one, this is the tutorial to start with:
http://www.grymoire.com/Unix/Sed.html

danielbmartin · 12-12-2011, 09:08 AM

Quote:

Originally Posted by crts

I finally have some time to go into some details of the 'sed' solution...

Wow! Thank you for this detailed breakdown. It illustrates that sed has multiple levels of functionality comparable to the frequently-referenced layers of an onion. The sed you constructed and explained introduce me to layers I'd never known. That's great!

Daniel B. Martin

timetraveler · 12-12-2011, 02:47 PM

Quote:

Originally Posted by David the H.

But do all of them have it installed by default?

I can't think of one that does not, can you?

Quote:

Originally Posted by David the H.

Can you walk up to any random Linux computer and be certain that your perl script will run on it?

Yes, see above.

Your line of thinking used to apply a long time ago but no longer.