[SOLVED] Combining lines based on key

danielbmartin · 12-06-2011, 09:57 AM

In this contrived example the key field is the first name.

Input file:
Doris Fletcher
Jane Baker
Jane Simmons
Janice Taylor
Linda Archer
Linda Brown
Linda Green
Mary Carter

Desired output file:
Doris Fletcher
Jane Baker Simmons
Janice Taylor
Linda Archer Brown Green
Mary Carter

I am improving self-written REXX programs by replacing REXX code with Linux commands. This provides several benefits:
- more concise programs
- shorter execution times
- learn Linux (learn by doing)

The desired function is already working in REXX, so an awk or Perl solution is not sought. I hope to find a Linux command (or combination of commands) which do this task.

Please advise.

Daniel B. Martin

TB0ne · 12-06-2011, 10:41 AM

Quote:

Originally Posted by danielbmartin

In this contrived example the key field is the first name.

Input file:
Doris Fletcher
Jane Baker
Jane Simmons
Janice Taylor
Linda Archer
Linda Brown
Linda Green
Mary Carter

Desired output file:
Doris Fletcher
Jane Baker Simmons
Janice Taylor
Linda Archer Brown Green
Mary Carter

I am improving self-written REXX programs by replacing REXX code with Linux commands. This provides several benefits:
- more concise programs
- shorter execution times
- learn Linux (learn by doing)

The desired function is already working in REXX, so an awk or Perl solution is not sought. I hope to find a Linux command (or combination of commands) which do this task.

Without using awk or writing a shell script, not sure how you'd do it. I'd use awk to break each line up into two variables (FN and LN), compare the FN with the previous value, and if it's the same, output F2 on the same line. If it's NOT the same, start a new line, and put both on it.

Since you want to 'learn by doing', reference the shell scripting tutorial at http://tldp.org/LDP/abs/html/. Also, when asking for advice, it's probably best to avoid telling people what you don't want to hear, since we're all just trying to help each other. Perl could probably do this with a one-liner, and (if not), the code would be VERY tight and fast.

danielbmartin · 12-06-2011, 12:30 PM

Quote:

Originally Posted by TB0ne

Also, when asking for advice, it's probably best to avoid telling people what you don't want to hear, since we're all just trying to help each other.

Telling people what I don't want to hear is intended as a courtesy to the reader. Otherwise he may devote time to creating a solution which won't be used. That annoys the person who was "just trying to help."

Daniel B. Martin

TB0ne · 12-06-2011, 12:52 PM

Quote:

Originally Posted by danielbmartin

Telling people what I don't want to hear is intended as a courtesy to the reader. Otherwise he may devote time to creating a solution which won't be used. That annoys the person who was "just trying to help."

Only if you come back and post, saying "I didn't use your solution, because it wasn't exactly what I wanted". And since you're doing this to 'learn by doing', none of us here are going to create your solution, since that would (obviously), defeat the purpose of you learning anything. Ruling out obvious solutions would tend to indicate a homework-assignment.

Perl was created exactly for such things. You wanted Linux commands to do this...awk would be it, since it would split the based on whatever field delimiter you see fit, in this case, a space. Since you have the means to assign the first/last name fields to variables, and you've already GOT working logic, it should be simple for you to use these things (along with the bash tutorial), to get done what you'd like. A bash script would be Linux commands, so it would seem your original query has been answered.

danielbmartin · 12-06-2011, 01:18 PM

Quote:

Originally Posted by TB0ne

Ruling out obvious solutions would tend to indicate a homework-assignment.

I assure you, this is *not* homework! I am well into retirement (17 years, now) and dabble in programming as a hobby, hoping to keep my brain from atrophying. Any LQ member who has lingering doubts is invited to contact me off-forum. I will respond with details about my employment history, detail which should convince you that I am in compliance with LQ forum rules.

TB0ne · 12-06-2011, 01:58 PM

Quote:

Originally Posted by danielbmartin

I assure you, this is *not* homework! I am well into retirement (17 years, now) and dabble in programming as a hobby, hoping to keep my brain from atrophying. Any LQ member who has lingering doubts is invited to contact me off-forum. I will respond with details about my employment history, detail which should convince you that I am in compliance with LQ forum rules.

Not really needed, but the phrasing of your question and conditions set forth does tend to point in the 'homework' direction.

Regardless...the awk command is what you need to easily do this. Cut can also be used, and you've got man pages for both. These commands/man pages plus the scripting guide should be all you need.

David the H. · 12-06-2011, 03:04 PM

From what I see, you appear to be assuming that there is an adequate solution for your problem that doesn't use awk or perl. You also don't seem to recognize that awk is one of the core utilities found by default on all *nix boxes and is used ubiquitously in scripting.

Indeed, awk is exactly what any linux/unix user would tell you to use first off, because your request is exactly the kind of thing that it excels at above all other unix tools. As it stands, the three solutions I would suggest are an awk script, a perl script, or a bash script, probably in that order (although I'm most proficient at bash personally and would probably start with that myself).

Whichever the language used, I believe the simplest solution is simply to populate an associative array/hash with the first field as the index string, and then tacking the second field onto that entry as subsequent hits are made. Then you can simply follow up by printing out the whole array at the end.

Other than that, none of the other commonly-available tools will do exactly what you want, although it might be possible to cobble together a working solution by chaining together multiple commands. But why bother when we have awk at hand? Of course there may also be some lesser-known tool floating around that does exactly this, but you'd be just as likely to find them on your own as me, if you tried searching for them.

Tinkster · 12-06-2011, 07:19 PM

Moved: This thread is more suitable in <PROGRAMMING> and has been moved accordingly to help your thread/question get the exposure it deserves.

And I have a strong feeling of deja-vu :} reading this thread.

If you're on bash4 you're lucky, because you can use the first
column as the subscript for an array (older bash' only allow
numeric subscripts). Your reluctance you utilise awk still
baffles me; it's not like using awk on Linux is that different
from using REXX on zOS, OS/2 or even the Amiga. It's there,
it's free, does what you ask, and does it quickly (and easily).

Cheers,
Tink

crts · 12-06-2011, 11:11 PM

Hi,

is 'sed' a viable alternative?

Code:

$ cat file
Janice Flavor
Doris Fletcher
Jane Baker
Jane Simmons
Janice Taylor
Linda Archer
Linda Brown
Janice Wafer
Linda Green
Janice Joice
Mary Carter 

$ sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+) *([^ \n]*) *(.*\n)\2 +([^ \n]+)/\1 \5 \3 \4/g;tb;s/\n+/\n/gp' file
Janice Flavor Taylor Wafer Joice 
Doris Fletcher
Jane Baker Simmons  
Linda Archer Brown Green 
Mary Carter

It will produce the desired output even if the input is unsorted.

I am actually not really serious about doing such tasks with 'sed'. As others have already pointed out, 'awk' is far more appropriate for this kind of things.

danielbmartin · 12-07-2011, 07:09 AM

Quote:

Originally Posted by crts

is 'sed' a viable alternative?
It will produce the desired output even if the input is unsorted.

I am actually not really serious about doing such tasks with 'sed'. As others have already pointed out, 'awk' is far more appropriate for this kind of things.

Wow! This is the type of solution I asked for but I may have bitten off more than I can chew. As a Linux newbie I have used sed but only timidly, being awed by the power of this command. Please give an overview explanation of your code. Guided by this, I will tiptoe through the manual to get a better understanding. Perhaps this experience will overcome my reluctance to delve into awk. Thank you, thank you!

catkin · 12-07-2011, 07:24 AM

Quote:

Originally Posted by danielbmartin

I hope to find a Linux command (or combination of commands) which do this task.

Hello Daniel

Good to learn you are still going with this project, especially as I have fond memories of ReXX from VM/CMS days and partly wrote (not finished) a ReXX interpreter on UNIX as an exercise to learn C, UNIX and emacs.

I was going to ask if you regarded a bash script as a "combination of commands" but crts' sed fulfils your "a command" criterion.

Incidentally I find awk a lot easier than sed because it's more of a programming language -- especially if you do everything in the BEGIN section and use getline to read all the lines instead of using awk's pattern matching!

@crts: that's great

danielbmartin · 12-07-2011, 10:18 AM

Quote:

Originally Posted by catkin

... I have fond memories of ReXX from VM/CMS days ...

As do I. Seventeen years ago I retired after a long career as a mainframe engineer/programmer working for a major computer manufacturer. Knew nothing of PCs, nothing of Linux. During my working years I became proficient with REXX and CMS Pipelines.

Two years ago I installed Ubuntu at the recommendation of a friend. I was enchanted by the similarity of Linux commands to CMS Pipelines. I've made a choice to write code using Linux commands (those few which I have learned) in a style which is frankly imitative of CMS Pipelines. This includes an abhorrence of explicit loops. Someday I may depart from this style, but for the time being I am not using Bash or Perl or awk.

Cedrik · 12-07-2011, 11:02 AM

Quote:

Originally Posted by crts

Code:

$ sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+) *([^ \n]*) *(.*\n)\2 +([^ \n]+)/\1 \5 \3 \4/g;tb;s/\n+/\n/gp' file

I quit !

Nominal Animal · 12-07-2011, 02:08 PM

Although the OP is not interested in awk solutions, I would personally use a combination of awk and sort in Linux:

Code:

awk '{ for (i = 2; i <= NF; i++) list[$1] = list[$1] " " $i } END { for (i in list) printf("%s%s\n", i, list[i]) }' file | sort

The final sort is needed because the list traversal order is undefined. The input does not need to be sorted, but the output might be unsorted. This should work well in any awk variant available in Linux (gawk, mawk).

On an embedded linux there might not be any awk available, so I would first sort the input, then combine consecutive lines using a simple POSIX shell loop:

Code:

sort file | sh -c '
    currkey=""
    currval=""
    while read key val ; do
        if [ "$key" = "$currkey" ]; then
            currval="$currval $val"
        else
            [ -n "$currkey$currval" ] && echo "$currkey $currval"
            currkey="$key"
            currval="$val"
        fi
    done
    [ -n "$currkey$currval" ] && echo "$currkey $currval"
'

or, written as a standalone utility script,

Code:

#!/bin/sh
if [ $# -lt 1 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
    exec >&2
    echo ""
    echo "Usage: $0 [ -h | --help ]"
    echo "       $0 file(s)..."
    echo ""
    echo "This script will combine all records with the same initial field."
    echo "Duplicates are not removed. The input is considered unsorted."
    echo "The output is always sorted."
    echo ""
    exit 0
fi
sort "$@" | (
  currkey=""
  currval=""
  while read key val ; do
      if [ ":$key" = ":$currkey" ]; then
          currval="$currval $val"
      else
          [ -n "$currkey$currval" ] && echo "$currkey $currval"
          currkey="$key"
          currval="$val"
      fi
  done
  [ -n "$currkey$currval" ] && echo "$currkey $currval"
)

It is pretty common nowadays to use dash (a POSIX shell) instead of Bash, when resources are tight (or the script is simple and minimal execution time is desired; dash loads faster than bash). For example, most initial ramdisks used by Linux distributions use shell scripts written for dash. Some Linux distributions may still have sh symlinked to bash, however, so it might be prudent to specify dash explicitly instead of just using generic sh.

timetraveler · 12-07-2011, 07:18 PM

Well you have to use some shell to run sed, so do you mean you won't use a bash shell?
Just curious which shell meets your requirements.

Can you post your Rexx code to perform this task?