LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Combining lines based on key (https://www.linuxquestions.org/questions/programming-9/combining-lines-based-on-key-917402/)

danielbmartin 12-06-2011 09:57 AM

Combining lines based on key
 
In this contrived example the key field is the first name.

Input file:
Doris Fletcher
Jane Baker
Jane Simmons
Janice Taylor
Linda Archer
Linda Brown
Linda Green
Mary Carter

Desired output file:
Doris Fletcher
Jane Baker Simmons
Janice Taylor
Linda Archer Brown Green
Mary Carter

I am improving self-written REXX programs by replacing REXX code with Linux commands. This provides several benefits:
- more concise programs
- shorter execution times
- learn Linux (learn by doing)

The desired function is already working in REXX, so an awk or Perl solution is not sought. I hope to find a Linux command (or combination of commands) which do this task.

Please advise.

Daniel B. Martin

TB0ne 12-06-2011 10:41 AM

Quote:

Originally Posted by danielbmartin (Post 4543491)
In this contrived example the key field is the first name.

Input file:
Doris Fletcher
Jane Baker
Jane Simmons
Janice Taylor
Linda Archer
Linda Brown
Linda Green
Mary Carter

Desired output file:
Doris Fletcher
Jane Baker Simmons
Janice Taylor
Linda Archer Brown Green
Mary Carter

I am improving self-written REXX programs by replacing REXX code with Linux commands. This provides several benefits:
- more concise programs
- shorter execution times
- learn Linux (learn by doing)

The desired function is already working in REXX, so an awk or Perl solution is not sought. I hope to find a Linux command (or combination of commands) which do this task.

Without using awk or writing a shell script, not sure how you'd do it. I'd use awk to break each line up into two variables (FN and LN), compare the FN with the previous value, and if it's the same, output F2 on the same line. If it's NOT the same, start a new line, and put both on it.

Since you want to 'learn by doing', reference the shell scripting tutorial at http://tldp.org/LDP/abs/html/. Also, when asking for advice, it's probably best to avoid telling people what you don't want to hear, since we're all just trying to help each other. Perl could probably do this with a one-liner, and (if not), the code would be VERY tight and fast.

danielbmartin 12-06-2011 12:30 PM

Quote:

Originally Posted by TB0ne (Post 4543533)
Also, when asking for advice, it's probably best to avoid telling people what you don't want to hear, since we're all just trying to help each other.

Telling people what I don't want to hear is intended as a courtesy to the reader. Otherwise he may devote time to creating a solution which won't be used. That annoys the person who was "just trying to help."

Daniel B. Martin

TB0ne 12-06-2011 12:52 PM

Quote:

Originally Posted by danielbmartin (Post 4543614)
Telling people what I don't want to hear is intended as a courtesy to the reader. Otherwise he may devote time to creating a solution which won't be used. That annoys the person who was "just trying to help."

Only if you come back and post, saying "I didn't use your solution, because it wasn't exactly what I wanted". And since you're doing this to 'learn by doing', none of us here are going to create your solution, since that would (obviously), defeat the purpose of you learning anything. Ruling out obvious solutions would tend to indicate a homework-assignment.

Perl was created exactly for such things. You wanted Linux commands to do this...awk would be it, since it would split the based on whatever field delimiter you see fit, in this case, a space. Since you have the means to assign the first/last name fields to variables, and you've already GOT working logic, it should be simple for you to use these things (along with the bash tutorial), to get done what you'd like. A bash script would be Linux commands, so it would seem your original query has been answered.

danielbmartin 12-06-2011 01:18 PM

Quote:

Originally Posted by TB0ne (Post 4543627)
Ruling out obvious solutions would tend to indicate a homework-assignment.

I assure you, this is *not* homework! I am well into retirement (17 years, now) and dabble in programming as a hobby, hoping to keep my brain from atrophying. Any LQ member who has lingering doubts is invited to contact me off-forum. I will respond with details about my employment history, detail which should convince you that I am in compliance with LQ forum rules.

TB0ne 12-06-2011 01:58 PM

Quote:

Originally Posted by danielbmartin (Post 4543642)
I assure you, this is *not* homework! I am well into retirement (17 years, now) and dabble in programming as a hobby, hoping to keep my brain from atrophying. Any LQ member who has lingering doubts is invited to contact me off-forum. I will respond with details about my employment history, detail which should convince you that I am in compliance with LQ forum rules.

Not really needed, but the phrasing of your question and conditions set forth does tend to point in the 'homework' direction.

Regardless...the awk command is what you need to easily do this. Cut can also be used, and you've got man pages for both. These commands/man pages plus the scripting guide should be all you need.

David the H. 12-06-2011 03:04 PM

From what I see, you appear to be assuming that there is an adequate solution for your problem that doesn't use awk or perl. You also don't seem to recognize that awk is one of the core utilities found by default on all *nix boxes and is used ubiquitously in scripting.

Indeed, awk is exactly what any linux/unix user would tell you to use first off, because your request is exactly the kind of thing that it excels at above all other unix tools. As it stands, the three solutions I would suggest are an awk script, a perl script, or a bash script, probably in that order (although I'm most proficient at bash personally and would probably start with that myself).

Whichever the language used, I believe the simplest solution is simply to populate an associative array/hash with the first field as the index string, and then tacking the second field onto that entry as subsequent hits are made. Then you can simply follow up by printing out the whole array at the end.

Other than that, none of the other commonly-available tools will do exactly what you want, although it might be possible to cobble together a working solution by chaining together multiple commands. But why bother when we have awk at hand? Of course there may also be some lesser-known tool floating around that does exactly this, but you'd be just as likely to find them on your own as me, if you tried searching for them.

Tinkster 12-06-2011 07:19 PM

Moved: This thread is more suitable in <PROGRAMMING> and has been moved accordingly to help your thread/question get the exposure it deserves.


And I have a strong feeling of deja-vu :} reading this thread.

If you're on bash4 you're lucky, because you can use the first
column as the subscript for an array (older bash' only allow
numeric subscripts). Your reluctance you utilise awk still
baffles me; it's not like using awk on Linux is that different
from using REXX on zOS, OS/2 or even the Amiga. It's there,
it's free, does what you ask, and does it quickly (and easily).


Cheers,
Tink

crts 12-06-2011 11:11 PM

Hi,

is 'sed' a viable alternative?
Code:

$ cat file
Janice Flavor
Doris Fletcher
Jane Baker
Jane Simmons
Janice Taylor
Linda Archer
Linda Brown
Janice Wafer
Linda Green
Janice Joice
Mary Carter

$ sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+) *([^ \n]*) *(.*\n)\2 +([^ \n]+)/\1 \5 \3 \4/g;tb;s/\n+/\n/gp' file
Janice Flavor Taylor Wafer Joice
Doris Fletcher
Jane Baker Simmons 
Linda Archer Brown Green
Mary Carter

It will produce the desired output even if the input is unsorted.

I am actually not really serious about doing such tasks with 'sed'. As others have already pointed out, 'awk' is far more appropriate for this kind of things.

danielbmartin 12-07-2011 07:09 AM

Quote:

Originally Posted by crts (Post 4543941)
is 'sed' a viable alternative?
It will produce the desired output even if the input is unsorted.

I am actually not really serious about doing such tasks with 'sed'. As others have already pointed out, 'awk' is far more appropriate for this kind of things.

Wow! This is the type of solution I asked for but I may have bitten off more than I can chew. As a Linux newbie I have used sed but only timidly, being awed by the power of this command. Please give an overview explanation of your code. Guided by this, I will tiptoe through the manual to get a better understanding. Perhaps this experience will overcome my reluctance to delve into awk. Thank you, thank you!

catkin 12-07-2011 07:24 AM

Quote:

Originally Posted by danielbmartin (Post 4543491)
I hope to find a Linux command (or combination of commands) which do this task.

Hello Daniel :)

Good to learn you are still going with this project, especially as I have fond memories of ReXX from VM/CMS days and partly wrote (not finished) a ReXX interpreter on UNIX as an exercise to learn C, UNIX and emacs.

I was going to ask if you regarded a bash script as a "combination of commands" but crts' sed fulfils your "a command" criterion.

Incidentally I find awk a lot easier than sed because it's more of a programming language -- especially if you do everything in the BEGIN section and use getline to read all the lines instead of using awk's pattern matching! :D

@crts: that's great :)

danielbmartin 12-07-2011 10:18 AM

Quote:

Originally Posted by catkin (Post 4544207)
... I have fond memories of ReXX from VM/CMS days ...

As do I. Seventeen years ago I retired after a long career as a mainframe engineer/programmer working for a major computer manufacturer. Knew nothing of PCs, nothing of Linux. During my working years I became proficient with REXX and CMS Pipelines.

Two years ago I installed Ubuntu at the recommendation of a friend. I was enchanted by the similarity of Linux commands to CMS Pipelines. I've made a choice to write code using Linux commands (those few which I have learned) in a style which is frankly imitative of CMS Pipelines. This includes an abhorrence of explicit loops. Someday I may depart from this style, but for the time being I am not using Bash or Perl or awk.

Cedrik 12-07-2011 11:02 AM

Quote:

Originally Posted by crts (Post 4543941)
Code:

$ sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+) *([^ \n]*) *(.*\n)\2 +([^ \n]+)/\1 \5 \3 \4/g;tb;s/\n+/\n/gp' file

I quit ! :eek: :p

Nominal Animal 12-07-2011 02:08 PM

Although the OP is not interested in awk solutions, I would personally use a combination of awk and sort in Linux:
Code:

awk '{ for (i = 2; i <= NF; i++) list[$1] = list[$1] " " $i } END { for (i in list) printf("%s%s\n", i, list[i]) }' file | sort
The final sort is needed because the list traversal order is undefined. The input does not need to be sorted, but the output might be unsorted. This should work well in any awk variant available in Linux (gawk, mawk).

On an embedded linux there might not be any awk available, so I would first sort the input, then combine consecutive lines using a simple POSIX shell loop:
Code:

sort file | sh -c '
    currkey=""
    currval=""
    while read key val ; do
        if [ "$key" = "$currkey" ]; then
            currval="$currval $val"
        else
            [ -n "$currkey$currval" ] && echo "$currkey $currval"
            currkey="$key"
            currval="$val"
        fi
    done
    [ -n "$currkey$currval" ] && echo "$currkey $currval"
'

or, written as a standalone utility script,
Code:

#!/bin/sh
if [ $# -lt 1 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
    exec >&2
    echo ""
    echo "Usage: $0 [ -h | --help ]"
    echo "      $0 file(s)..."
    echo ""
    echo "This script will combine all records with the same initial field."
    echo "Duplicates are not removed. The input is considered unsorted."
    echo "The output is always sorted."
    echo ""
    exit 0
fi
sort "$@" | (
  currkey=""
  currval=""
  while read key val ; do
      if [ ":$key" = ":$currkey" ]; then
          currval="$currval $val"
      else
          [ -n "$currkey$currval" ] && echo "$currkey $currval"
          currkey="$key"
          currval="$val"
      fi
  done
  [ -n "$currkey$currval" ] && echo "$currkey $currval"
)

It is pretty common nowadays to use dash (a POSIX shell) instead of Bash, when resources are tight (or the script is simple and minimal execution time is desired; dash loads faster than bash). For example, most initial ramdisks used by Linux distributions use shell scripts written for dash. Some Linux distributions may still have sh symlinked to bash, however, so it might be prudent to specify dash explicitly instead of just using generic sh.

timetraveler 12-07-2011 07:18 PM

Well you have to use some shell to run sed, so do you mean you won't use a bash shell?
Just curious which shell meets your requirements.

Can you post your Rexx code to perform this task?

danielbmartin 12-07-2011 09:33 PM

Quote:

Originally Posted by timetraveler (Post 4544686)
Can you post your Rexx code to perform this task?

Code:


RecsWritten = 0
Key = subword(InRec.1,01,01)
OutRec = InRec.1
do j = 2 to InRec.0
  NextKey = subword(InRec.j,01,01)
  if NextKey = Key then
    do
      OutRec = OutRec subword(InRec.j,02,01)  /* Extend the output record */
    end
                  else
    do
      rc = LineOut(OutFile,OutRec)          /* Write completed record  */
      RecsWritten = RecsWritten + 1
      OutRec = InRec.j
      Key = NextKey
    end
end j
rc = LineOut(OutFile,OutRec)  /* Flush buffer  */
RecsWritten = RecsWritten + 1


danielbmartin 12-07-2011 09:46 PM

Quote:

Originally Posted by timetraveler (Post 4544686)
Well you have to use some shell to run sed, so do you mean you won't use a bash shell?

This thread was originally posted on the Newbie forum and moved to the Programming forum by a moderator. I am a newbie. I am so uneducated in Linux that I don't even know what a shell is, so it's difficult to answer your question. As stated earlier in this thread, I have a number of self-written REXX programs. I'm improving them by substituting small chunks of Linux commands for large chunks of REXX code. Does this mean my REXX program is a shell? It is still a REXX program which executes by invoking the REGINA interpreter.

theNbomr 12-07-2011 10:44 PM

I won't dispute the OP's wish to preclude Perl and AWK solutions (although the problem clearly wants a solution that uses associative arrays), however I am curious how Perl solutions are seen as something other than Linux, while a sed solution is not. Hard to interpret the requirements based on any logic I can derive from that.

--- rod.

timetraveler 12-08-2011 12:52 AM

Quote:

Originally Posted by danielbmartin
Code:


RecsWritten = 0
... rexx code ...


Sure I could have looked up rexx code but this way I can connect the dots between rexx and the other examples given here. Thanks for sharing.

Here's a linux command to do same:
(perl one-liners count as linux commands, but you don't have to use them)

perl -lane '$n{$F[0]} = $n{$F[0]} . " $F[1]";if(eof){print "$_$n{$_}" for keys %n}' names

Linda Archer Brown Green
Jane Baker Simmons
Mary Carter
Janice Taylor
Doris Fletcher

If there's a good rexx compiler/interpreter on linux then rexx counts too. That's the gnu/linux way and that's a huge part of the gnu/linux attraction, for many. Lots of choices.

For some, bash shell programming is their favorite system programming language. Others like Python, others Perl, etc. Shell programming infers awk,sed,tail,head,cut, etc.
Perl and Python (and others) can do those things natively. Limit your self or don't limit your self, gnu/linux lets you have it any way you want.

timetraveler 12-08-2011 01:03 AM

Quote:

Originally Posted by danielbmartin
This thread was originally posted on the Newbie forum and moved to the Programming forum by a moderator. I am a newbie. I am so uneducated in Linux that I don't even know what a shell is, so it's difficult to answer your question. As stated earlier in this thread, I have a number of self-written REXX programs. I'm improving them by substituting small chunks of Linux commands for large chunks of REXX code. Does this mean my REXX program is a shell? It is still a REXX program which executes by invoking the REGINA interpreter.

I don't know Rexx but I'm not convinced your improving on them if they already worked. It seems to me that you're using your knowledge of Rexx as a launch pad and bridge to learning more about gnu/linux. That's a great way to do it and you're probably already seeing plenty of similarities.

The shell is the software that gives you a gnu/linux command line. There are several shells around. Bash is probably the most common. There is also tcsh, csh, zsh, ksh and others. Sed, awk, etc. are separate and distinct from the shell but are run from a shell.

The Regina interpreter might be a shell but I don't know. It probably is not but instead is run from a shell. Most likely you are using bash inside your terminal program.

Main thing is to have some fun exploring gnu/linux and use whatever tools you like.

David the H. 12-08-2011 01:14 AM

A shell is a command-line interpreter, a cli interface into your system. They also generally have their own scripting language and the ability to act as interpreters for executing them. I'm not too familiar with REXX, but I believe it's more of a stand-alone interpreted language that can be easily used for scripting tasks. I don't know if it offers a shell interface per-se, but in the end much of the functionality is probably very similar.


@theNbomr: sed and awk are core programs found in all *nix systems, as (I believe) specified by posix. perl, OTOH, is an optional, multi-platform language, and can't be guaranteed to exist on any given system. So unlike the first two I can understand eliminating it as not a specifically "linux" solution.

danielbmartin 12-08-2011 09:48 AM

Quote:

Originally Posted by timetraveler (Post 4544860)
It seems to me that you're using your knowledge of Rexx as a launch pad and bridge to learning more about gnu/linux.

Yes. Moreover, the similarities between Linux commands and CMS Pipelines provides motivation.

Quote:

Originally Posted by timetraveler (Post 4544860)
I don't know Rexx but I'm not convinced your improving on them if they already worked.

Execution time was not the original subject of his thread but it is worth mentioning. This is an example.

I had a self-written REXX program which operates on the voter registration list from the county where I live. (This file is a public record, readily downloadable by anyone.) The program sifts the data, slices and dices it, sorts, reformats, etc. This program worked, i.e. it generated the desired result. Then I discovered that I could replace large chunks of REXX code with smaller chunks of Linux commands.

Now we get to the punch line, and that is execution time.
Same input file, 500,000+ records.
Same output file, 220,000+ records.
Execution time for the original REXX-only version: 9+ hours (an overnight run).
Execution time for the new mixed REXX+Linux version: 1 minute.

A breathtaking improvement! As a consequence, execution time for this program is now of small concern. It is still a REXX program but now the Linux commands do all the heavy lifting.

This dramatic reduction in execution time provides the motivation to learn more Linux commands and rework more of my REXX programs.

Daniel B. Martin

timetraveler 12-08-2011 08:13 PM

Quote:

Originally Posted by danielbmartin

....execution time.
Same input file, 500,000+ records.
Same output file, 220,000+ records.
Execution time for the original REXX-only version: 9+ hours (an overnight run).
Execution time for the new mixed REXX+Linux version: 1 minute.

Nice improvement. Your gnu/linux exploration started paying dividends right away it seems.

timetraveler 12-08-2011 08:19 PM

Quote:

Originally Posted by David the H.
....perl, OTOH, is an optional, multi-platform language, and can't be guaranteed to exist on any given system. So unlike the first two I can understand eliminating it as not a specifically "linux" solution.

Can you name one linux distro that doesn't contain perl. I can't think of one. But if the OP doesn't want to try perl it's his choice. Linux is all about choices.

Reuti 12-09-2011 06:29 AM

Oh, REXX - I used it at the time my employer changed from EXEC2 to it. Besides Regina there is also ooREXX which was open sourced by IBM several years ago and it includes a compiler. Maybe the original REXX script can execute faster in precompiled form too.

David the H. 12-09-2011 09:29 AM

Quote:

Originally Posted by timetraveler (Post 4545585)
Can you name one linux distro that doesn't contain perl. I can't think of one. But if the OP doesn't want to try perl it's his choice. Linux is all about choices.

Pretty much every distribution includes perl in their repositories, sure. But do all of them have it installed by default? Can you walk up to any random Linux computer and be certain that your perl script will run on it?

Full agreement here on the second point. :cool:

ntubski 12-09-2011 10:00 AM

Quote:

Originally Posted by danielbmartin (Post 4545204)
Now we get to the punch line, and that is execution time.
Same input file, 500,000+ records.
Same output file, 220,000+ records.
Execution time for the original REXX-only version: 9+ hours (an overnight run).
Execution time for the new mixed REXX+Linux version: 1 minute.

Seems excessive, did you use some bad algorithms (eg bubblesort) in the REXX-only version?

theNbomr 12-09-2011 10:19 AM

That is a factor of ~500 in speed. The older hardware supporting REXX (I assume) could easily account for all of that difference. But without knowing anything about the hardware, it is hard to make any realistic comparison. However, just having an OS that will run on commodity hardware must be a good thing.
I think it would be a rare distro that doesn't include Perl out of the box. I'm not sure the OP meant to limit the discussion to POSIX-only distros, and as I stated earlier I can respect his wish for 'Linux-only' solutions. I just don't know how he defines that, either either conceptually, or by some defined standard.

--- rod.

danielbmartin 12-10-2011 09:50 AM

Quote:

Originally Posted by ntubski (Post 4545939)
Seems excessive, did you use some bad algorithms (eg bubblesort) in the REXX-only version?

Bubble sort is awful; I used a QuickSort routine.

The long execution time of the REXX-only version may be attributed to:
1) Regina is interpreted, not compiled, and sorting large files takes a long time.
2) Regina I/O is painfully slow compared to Linux.

danielbmartin 12-10-2011 09:59 AM

Quote:

Originally Posted by theNbomr (Post 4545963)
... I can respect his wish for 'Linux-only' solutions. I just don't know how he defines that, either either conceptually, or by some defined standard.

Perhaps I'm using incorrect terminology if I say "Linux-only." Allow me to clarify by repeating part of a previous post in this thread.

Seventeen years ago I retired from a mainframe engineer/programmer job. During my working years I became proficient with REXX and CMS Pipelines.

Two years ago I installed Ubuntu on my home PC. (Good-bye Microsoft! Good-bye forever!!) I was enchanted by the similarity of Linux commands to CMS Pipelines. I've made a choice to write code using Linux commands (those few which I have learned) in a style which is frankly imitative of CMS Pipelines. This includes an abhorrence of explicit loops. Someday I may depart from this style, but for the time being I am not using Bash or Perl or awk.

theNbomr 12-10-2011 11:43 AM

So, would a clean definition of your requirement be 'no branching/looping constructs allowed'? That would make things quite a bit more challenging for most problems. I haven't inherited your background, so I won't pretend to understand how you see that as helpful. I do wonder if it isn't just a bit severe; it certainly limits one of your stated goals, being 'learn Linux'.
Now I'm going to have to actually figure out what the posted sed solution does. 8-(

--- rod.

danielbmartin 12-11-2011 11:21 AM

Quote:

Originally Posted by theNbomr (Post 4546641)
So, would a clean definition of your requirement be 'no branching/looping constructs allowed'?

Let's say "preferred" rather than "required." I may pose a question (such as the first post in this thread) to which I already have a Rexx solution which uses loops. Therefore the question is not, "how can this be done?" It is, "how may this be done with sed or grep?"

Quote:

Originally Posted by theNbomr (Post 4546641)
That would make things quite a bit more challenging for most problems.

For some problems, anyway. Sometimes I ask for advice thinking "there might be a clever option which does this but I haven't sussed it out of the manual." I *never* post a question without having first made a sincere effort to solve on my own.

Quote:

Originally Posted by theNbomr (Post 4546641)
... it certainly limits one of your stated goals, being 'learn Linux'.

I've got to start somewhere and have chosen to start by developing a competence with sed, grep, cut, paste, sort, uniq, nl, rev, comm. Not expertise, but competence.


Quote:

Originally Posted by theNbomr (Post 4546641)
Now I'm going to have to actually figure out what the posted sed solution does.

I've been picking it apart hoping to figure it out but haven't made much progress. sed may be compared to the APL language (part of my distant past) in this respect: the function is impressive, the syntax is daunting, the code is not self-documenting, the learning curve is difficult... but once you master it, coding is fun!

crts 12-11-2011 07:59 PM

Hi,

I have been very busy and did not find any spare time to deal with explanations. I finally have some time to go into some details of the 'sed' solution. I will split it up first and then rebuild the most important part step by step. I have marked the main part in bold:
Code:

sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+) *([^ \n]*) *(.*\n)\2 +([^ \n]+)/\1 \5 \3 \4/g;tb;s/\n+/\n/gp'
The other parts are not so interesting at the moment. The first part
Code:

:a N;$! ba
simply reads the whole file into its pattern-buffer. The last substitution command
Code:

s/\n+/\n/gp
replaces multiple, consecutive newlines with just one newline. That is because the previous bold part will produce empty lines which we do not want.
Let us now try to understand how the bold part works. We will build it up step by step. Therefor we will use the following simplified data set:
Code:

$ cat simple-file
Janice Flavor
Linda Brown
Janice Taylor
Janice Wafer

Now let us try to identify the first two names:
Code:

sed -nr ':a N;$! ba;:b s/([^ ]+ +[^ \n]+)/|\1|\1/p' simple-file
Notice the brackets. They mark a group that can be back-referenced. That means, whatever pattern will be matched inside this braces will be stored in a *special* buffer. The content of this buffer can be accessed by backreferences, in this case with '\1'. Try the above example to see what is stored inside '\1'. Whatever is stored in '\1' will appear between '|'.
So we see that the RegEx
Code:

([^ ]+ +[^ \n]+)
will match "Janice Flavor" which should, hopefully, be obvious why; I am not sure how deep your sed knowledge is at this point.
The first character-class
Code:

[^ ]+
matches one or more characters that are NOT space. Then it should be followed by at least one (or more) space(s). The next character-class will match at least one or more characters that are NEITHER space NOR newlines. This is important since 'Flavor' is followed by a newline at this point.
So now we have matched 'Janice Flavor'. Our next objective is to somehow identify the *other* Janices and retrieve their second name. Remember what I said about backreferences? Any pattern that is matched inside () is stored in a *special* buffer. You have 9 of those buffers. You can access them with
\n

where n is a number from 1 to 9, e.g. \1 refers to the content inside the first pair of brackets, \2 stores the content of the second pair of braces.
Let us capture 'Janice' in a *special* buffer:
Code:

sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)/|\1|\2|/p'
As you see, the groups can be nested! The first pair of braces (bold) still holds 'Janice Flavor'. The second pair (italic) holds 'Janice' alone.
Let us refine our RegEx a bit more:
Code:

sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)\n\2/|\1|\2|/p'
Notice the bold part. Until now we have only used backreferences on the right-hand side of the substitution command. But we can also use it in the left-hand side. Now our RegEx looks for a first and a second name which is followed by a newline and then the first name again. We do not match 'Janice Flavor' anymore because she is followed by 'Linda'. 'Janice Taylor', however, is followed by 'Janice Wafer' on the next line. So our RegEx does match.
When we substitute we do not need the back-reference \2 since 'Janice' is already in \1. It would be nice if we can obtain 'Wafer'. Well, once again we use another group () that we can back-reference:
Code:

sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)\n\2 +([^ \n]+)/|\1|\3|/p'
After we matched 'Janice' there can be one or more spaces until 'Wafer'. We match 'Wafer' itself by matching any character that is NEITHER a newline NOR a space. We negate space in order to accomodate for possible trailing spaces. Our first pair of braces matches 'Janice Taylor' and the third pair matches 'Wafer'. Those are our substitutes.

Now let us see if we can work around interfering 'Linda'. We want 'Janice Flavor' as our first match. 'Flavor' can be followed any character, which includes 'Linda Brown' and some newlines until we meet 'Janice' again in the third line. So let us add '.*\n' after our first pair of braces to account for that:
Code:

sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+).*\n\2 +([^ \n]+)/|\1|\3|/p' simple-file
It finally gets interesting! Notice, that you do NOT match 'Taylor' with your third group. RegExes are GREEDY. I.e., that '.*\n\2' will look for the longest possible match! And that is
Code:

Linda Brown\nJanice Taylor\nJanice
So the third group will match 'Wafer'. We are getting closer to our goal.
Our next step is to preserve 'Linda' and basically everything that has been matched by '.*\n'. Yes, once more we use another group that we can backreference:
Code:

sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)(.*\n)\2 +([^ \n]+)/\1 \4 \3/p'
                                          ^ 3. br  ^ 4. br

Notice, that 'Wafer' is now matched by the 4th group and therefore must be back-referenced by \4. \3 holds our previously lost information. I also do not use the '|' on the RHS as a visual aide since they would interfere in the next step if we kept them.
We still need to get 'Taylor' between 'Flavor' and 'Wafer'. Therefor we will extend our RegEx to match 'Janice Flavor' and anything else that follows on that same line:
Code:

sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+) *([^\n]*)(.*\n)\2 +([^ \n]+)/\1 \5 \3 \4/;tb;p'
Two things happen here. We use ' *([^\n]*)' to match anything after 'Janice Flavor'. We are using the '*' quantifier for that which matches zero or more occurences of the pattern. So if 'Janice Flavor' is still alone on the first line the additional pattern will match nothing. When 'Wafer' has been added after the 's' command runs the first time it will match 'Wafer'. Also notice, that our back-references have shifted again.
In order to force the 's' command to execute again we use the conditional jump 't' command. This will jump back to point ':b' only if the previous 's' command has made any changes to the pattern space. If our RegEx does not find any more matches then we are finished and the 't' command does not jump and the print command ('p') will execute and sed will finally exit.
That's basically it. As I said at the beginning of the post, our RegEx produces some empty lines. This can be taken care of by using
Code:

s/\n+/\n/g
before we print the pattern space. There are some minor differences between this solution and the one I provided earlier. This is to account for possible trailing spaces. As it turns out, you also do not need the global flag in the first 's' command.


One final note. My main point in my previous post was:
Don't do it this way.
Use awk instead.
The right tool for the right job can spare you some headache :)

Since I do like a good brain teaser every now and then I thought of this cumbersome sed solution.
But normally I would not have posted it.

I hope this clears things up a bit.

PS:
Earlier you said that you are doing a sed tutorial but you did not say which one.
To be sure that you are doing the right one, this is the tutorial to start with:
http://www.grymoire.com/Unix/Sed.html

danielbmartin 12-12-2011 09:08 AM

Quote:

Originally Posted by crts (Post 4547533)
I finally have some time to go into some details of the 'sed' solution...

Wow! Thank you for this detailed breakdown. It illustrates that sed has multiple levels of functionality comparable to the frequently-referenced layers of an onion. The sed you constructed and explained introduce me to layers I'd never known. That's great!

Daniel B. Martin

timetraveler 12-12-2011 02:47 PM

Quote:

Originally Posted by David the H.
But do all of them have it installed by default?

I can't think of one that does not, can you?

Quote:

Originally Posted by David the H.
Can you walk up to any random Linux computer and be certain that your perl script will run on it?

Yes, see above.

Your line of thinking used to apply a long time ago but no longer.


All times are GMT -5. The time now is 12:04 AM.