Combining lines based on key
In this contrived example the key field is the first name.
Input file: Doris Fletcher Jane Baker Jane Simmons Janice Taylor Linda Archer Linda Brown Linda Green Mary Carter Desired output file: Doris Fletcher Jane Baker Simmons Janice Taylor Linda Archer Brown Green Mary Carter I am improving self-written REXX programs by replacing REXX code with Linux commands. This provides several benefits: - more concise programs - shorter execution times - learn Linux (learn by doing) The desired function is already working in REXX, so an awk or Perl solution is not sought. I hope to find a Linux command (or combination of commands) which do this task. Please advise. Daniel B. Martin |
Quote:
Since you want to 'learn by doing', reference the shell scripting tutorial at http://tldp.org/LDP/abs/html/. Also, when asking for advice, it's probably best to avoid telling people what you don't want to hear, since we're all just trying to help each other. Perl could probably do this with a one-liner, and (if not), the code would be VERY tight and fast. |
Quote:
Daniel B. Martin |
Quote:
Perl was created exactly for such things. You wanted Linux commands to do this...awk would be it, since it would split the based on whatever field delimiter you see fit, in this case, a space. Since you have the means to assign the first/last name fields to variables, and you've already GOT working logic, it should be simple for you to use these things (along with the bash tutorial), to get done what you'd like. A bash script would be Linux commands, so it would seem your original query has been answered. |
Quote:
|
Quote:
Regardless...the awk command is what you need to easily do this. Cut can also be used, and you've got man pages for both. These commands/man pages plus the scripting guide should be all you need. |
From what I see, you appear to be assuming that there is an adequate solution for your problem that doesn't use awk or perl. You also don't seem to recognize that awk is one of the core utilities found by default on all *nix boxes and is used ubiquitously in scripting.
Indeed, awk is exactly what any linux/unix user would tell you to use first off, because your request is exactly the kind of thing that it excels at above all other unix tools. As it stands, the three solutions I would suggest are an awk script, a perl script, or a bash script, probably in that order (although I'm most proficient at bash personally and would probably start with that myself). Whichever the language used, I believe the simplest solution is simply to populate an associative array/hash with the first field as the index string, and then tacking the second field onto that entry as subsequent hits are made. Then you can simply follow up by printing out the whole array at the end. Other than that, none of the other commonly-available tools will do exactly what you want, although it might be possible to cobble together a working solution by chaining together multiple commands. But why bother when we have awk at hand? Of course there may also be some lesser-known tool floating around that does exactly this, but you'd be just as likely to find them on your own as me, if you tried searching for them. |
Moved: This thread is more suitable in <PROGRAMMING> and has been moved accordingly to help your thread/question get the exposure it deserves.
And I have a strong feeling of deja-vu :} reading this thread. If you're on bash4 you're lucky, because you can use the first column as the subscript for an array (older bash' only allow numeric subscripts). Your reluctance you utilise awk still baffles me; it's not like using awk on Linux is that different from using REXX on zOS, OS/2 or even the Amiga. It's there, it's free, does what you ask, and does it quickly (and easily). Cheers, Tink |
Hi,
is 'sed' a viable alternative? Code:
$ cat file I am actually not really serious about doing such tasks with 'sed'. As others have already pointed out, 'awk' is far more appropriate for this kind of things. |
Quote:
|
Quote:
Good to learn you are still going with this project, especially as I have fond memories of ReXX from VM/CMS days and partly wrote (not finished) a ReXX interpreter on UNIX as an exercise to learn C, UNIX and emacs. I was going to ask if you regarded a bash script as a "combination of commands" but crts' sed fulfils your "a command" criterion. Incidentally I find awk a lot easier than sed because it's more of a programming language -- especially if you do everything in the BEGIN section and use getline to read all the lines instead of using awk's pattern matching! :D @crts: that's great :) |
Quote:
Two years ago I installed Ubuntu at the recommendation of a friend. I was enchanted by the similarity of Linux commands to CMS Pipelines. I've made a choice to write code using Linux commands (those few which I have learned) in a style which is frankly imitative of CMS Pipelines. This includes an abhorrence of explicit loops. Someday I may depart from this style, but for the time being I am not using Bash or Perl or awk. |
Quote:
|
Although the OP is not interested in awk solutions, I would personally use a combination of awk and sort in Linux:
Code:
awk '{ for (i = 2; i <= NF; i++) list[$1] = list[$1] " " $i } END { for (i in list) printf("%s%s\n", i, list[i]) }' file | sort On an embedded linux there might not be any awk available, so I would first sort the input, then combine consecutive lines using a simple POSIX shell loop: Code:
sort file | sh -c ' Code:
#!/bin/sh |
Well you have to use some shell to run sed, so do you mean you won't use a bash shell?
Just curious which shell meets your requirements. Can you post your Rexx code to perform this task? |
Quote:
Code:
|
Quote:
|
I won't dispute the OP's wish to preclude Perl and AWK solutions (although the problem clearly wants a solution that uses associative arrays), however I am curious how Perl solutions are seen as something other than Linux, while a sed solution is not. Hard to interpret the requirements based on any logic I can derive from that.
--- rod. |
Quote:
Here's a linux command to do same: (perl one-liners count as linux commands, but you don't have to use them) perl -lane '$n{$F[0]} = $n{$F[0]} . " $F[1]";if(eof){print "$_$n{$_}" for keys %n}' names Linda Archer Brown Green Jane Baker Simmons Mary Carter Janice Taylor Doris Fletcher If there's a good rexx compiler/interpreter on linux then rexx counts too. That's the gnu/linux way and that's a huge part of the gnu/linux attraction, for many. Lots of choices. For some, bash shell programming is their favorite system programming language. Others like Python, others Perl, etc. Shell programming infers awk,sed,tail,head,cut, etc. Perl and Python (and others) can do those things natively. Limit your self or don't limit your self, gnu/linux lets you have it any way you want. |
Quote:
The shell is the software that gives you a gnu/linux command line. There are several shells around. Bash is probably the most common. There is also tcsh, csh, zsh, ksh and others. Sed, awk, etc. are separate and distinct from the shell but are run from a shell. The Regina interpreter might be a shell but I don't know. It probably is not but instead is run from a shell. Most likely you are using bash inside your terminal program. Main thing is to have some fun exploring gnu/linux and use whatever tools you like. |
A shell is a command-line interpreter, a cli interface into your system. They also generally have their own scripting language and the ability to act as interpreters for executing them. I'm not too familiar with REXX, but I believe it's more of a stand-alone interpreted language that can be easily used for scripting tasks. I don't know if it offers a shell interface per-se, but in the end much of the functionality is probably very similar.
@theNbomr: sed and awk are core programs found in all *nix systems, as (I believe) specified by posix. perl, OTOH, is an optional, multi-platform language, and can't be guaranteed to exist on any given system. So unlike the first two I can understand eliminating it as not a specifically "linux" solution. |
Quote:
Quote:
I had a self-written REXX program which operates on the voter registration list from the county where I live. (This file is a public record, readily downloadable by anyone.) The program sifts the data, slices and dices it, sorts, reformats, etc. This program worked, i.e. it generated the desired result. Then I discovered that I could replace large chunks of REXX code with smaller chunks of Linux commands. Now we get to the punch line, and that is execution time. Same input file, 500,000+ records. Same output file, 220,000+ records. Execution time for the original REXX-only version: 9+ hours (an overnight run). Execution time for the new mixed REXX+Linux version: 1 minute. A breathtaking improvement! As a consequence, execution time for this program is now of small concern. It is still a REXX program but now the Linux commands do all the heavy lifting. This dramatic reduction in execution time provides the motivation to learn more Linux commands and rework more of my REXX programs. Daniel B. Martin |
Quote:
|
Quote:
|
Oh, REXX - I used it at the time my employer changed from EXEC2 to it. Besides Regina there is also ooREXX which was open sourced by IBM several years ago and it includes a compiler. Maybe the original REXX script can execute faster in precompiled form too.
|
Quote:
Full agreement here on the second point. :cool: |
Quote:
|
That is a factor of ~500 in speed. The older hardware supporting REXX (I assume) could easily account for all of that difference. But without knowing anything about the hardware, it is hard to make any realistic comparison. However, just having an OS that will run on commodity hardware must be a good thing.
I think it would be a rare distro that doesn't include Perl out of the box. I'm not sure the OP meant to limit the discussion to POSIX-only distros, and as I stated earlier I can respect his wish for 'Linux-only' solutions. I just don't know how he defines that, either either conceptually, or by some defined standard. --- rod. |
Quote:
The long execution time of the REXX-only version may be attributed to: 1) Regina is interpreted, not compiled, and sorting large files takes a long time. 2) Regina I/O is painfully slow compared to Linux. |
Quote:
Seventeen years ago I retired from a mainframe engineer/programmer job. During my working years I became proficient with REXX and CMS Pipelines. Two years ago I installed Ubuntu on my home PC. (Good-bye Microsoft! Good-bye forever!!) I was enchanted by the similarity of Linux commands to CMS Pipelines. I've made a choice to write code using Linux commands (those few which I have learned) in a style which is frankly imitative of CMS Pipelines. This includes an abhorrence of explicit loops. Someday I may depart from this style, but for the time being I am not using Bash or Perl or awk. |
So, would a clean definition of your requirement be 'no branching/looping constructs allowed'? That would make things quite a bit more challenging for most problems. I haven't inherited your background, so I won't pretend to understand how you see that as helpful. I do wonder if it isn't just a bit severe; it certainly limits one of your stated goals, being 'learn Linux'.
Now I'm going to have to actually figure out what the posted sed solution does. 8-( --- rod. |
Quote:
Quote:
Quote:
Quote:
|
Hi,
I have been very busy and did not find any spare time to deal with explanations. I finally have some time to go into some details of the 'sed' solution. I will split it up first and then rebuild the most important part step by step. I have marked the main part in bold: Code:
sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+) *([^ \n]*) *(.*\n)\2 +([^ \n]+)/\1 \5 \3 \4/g;tb;s/\n+/\n/gp' Code:
:a N;$! ba Code:
s/\n+/\n/gp Let us now try to understand how the bold part works. We will build it up step by step. Therefor we will use the following simplified data set: Code:
$ cat simple-file Code:
sed -nr ':a N;$! ba;:b s/([^ ]+ +[^ \n]+)/|\1|\1/p' simple-file So we see that the RegEx Code:
([^ ]+ +[^ \n]+) The first character-class Code:
[^ ]+ So now we have matched 'Janice Flavor'. Our next objective is to somehow identify the *other* Janices and retrieve their second name. Remember what I said about backreferences? Any pattern that is matched inside () is stored in a *special* buffer. You have 9 of those buffers. You can access them with \n where n is a number from 1 to 9, e.g. \1 refers to the content inside the first pair of brackets, \2 stores the content of the second pair of braces. Let us capture 'Janice' in a *special* buffer: Code:
sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)/|\1|\2|/p' Let us refine our RegEx a bit more: Code:
sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)\n\2/|\1|\2|/p' When we substitute we do not need the back-reference \2 since 'Janice' is already in \1. It would be nice if we can obtain 'Wafer'. Well, once again we use another group () that we can back-reference: Code:
sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)\n\2 +([^ \n]+)/|\1|\3|/p' Now let us see if we can work around interfering 'Linda'. We want 'Janice Flavor' as our first match. 'Flavor' can be followed any character, which includes 'Linda Brown' and some newlines until we meet 'Janice' again in the third line. So let us add '.*\n' after our first pair of braces to account for that: Code:
sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+).*\n\2 +([^ \n]+)/|\1|\3|/p' simple-file Code:
Linda Brown\nJanice Taylor\nJanice Our next step is to preserve 'Linda' and basically everything that has been matched by '.*\n'. Yes, once more we use another group that we can backreference: Code:
sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+)(.*\n)\2 +([^ \n]+)/\1 \4 \3/p' We still need to get 'Taylor' between 'Flavor' and 'Wafer'. Therefor we will extend our RegEx to match 'Janice Flavor' and anything else that follows on that same line: Code:
sed -nr ':a N;$! ba;:b s/(([^ ]+) +[^ \n]+) *([^\n]*)(.*\n)\2 +([^ \n]+)/\1 \5 \3 \4/;tb;p' In order to force the 's' command to execute again we use the conditional jump 't' command. This will jump back to point ':b' only if the previous 's' command has made any changes to the pattern space. If our RegEx does not find any more matches then we are finished and the 't' command does not jump and the print command ('p') will execute and sed will finally exit. That's basically it. As I said at the beginning of the post, our RegEx produces some empty lines. This can be taken care of by using Code:
s/\n+/\n/g One final note. My main point in my previous post was: Don't do it this way. Use awk instead. The right tool for the right job can spare you some headache :) Since I do like a good brain teaser every now and then I thought of this cumbersome sed solution. But normally I would not have posted it. I hope this clears things up a bit. PS: Earlier you said that you are doing a sed tutorial but you did not say which one. To be sure that you are doing the right one, this is the tutorial to start with: http://www.grymoire.com/Unix/Sed.html |
Quote:
Daniel B. Martin |
Quote:
Quote:
Your line of thinking used to apply a long time ago but no longer. |
All times are GMT -5. The time now is 12:04 AM. |