merge data from two lines within a "group" onto one line

bop-a-nator · 01-23-2013, 02:33 PM

I have a text file with some different data something like this , I was trying to figure out how to parse though it to
in a sense merge data from two lines within a "group" onto one line.

prompt> cat sample.txt
ID1 NAME FIRST TOM
ID1 NAME LAST SMITH
ID1 ADDRESS MYTOWN USA
ID2 NAME FIRST DAVE
ID2 NAME LAST BROWN
ID2 ADDRESS ANYTOWN USA
ID3 NAME LAST JONES
ID3 ADDRESS SOMETOWN USA

I want to make this into a new file like this to put the first and last name together on one line and leave the address line alone.

ID1 TOM SMITH
ID1 ADDRESS MYTOWN USA
ID2 DAVE BROWN
ID2 ADDRESS ANYTOWN USA
ID3 JONES
ID3 ADDRESS SOMETOWN USA

I thought I figured out how to parse though the ID's but I am not so sure:

prompt> my.awk
BEGIN{OFS=FS=" "}
{if($1 in a)
{a[$1]=a[$1]} else {a[$1]=$0}}
END {asort(a); for(i in a) print a[i]}

What I am getting:

prompt> /bin/gawk -f my.awk sample.txt
ID1 NAME FIRST TOM
ID2 NAME FIRST DAVE
ID3 NAME LAST JONES

Then I thought what about this:

prompt> cat my2.awk
BEGIN{OFS=FS=" "}
{if($1 in a) {a[$1]=a[$1] " " $NF} else {a[$1]=$0}}
END {asort(a); for(i in a) print a[i]}

Resulted in this below, which got the first and last name together, but I got the USA from the address too
and still no address on it's own line and the last name on the second record did not pick up the "BROWN",
so I think I need to specify the fields I want in the print, but I wasn't sure how to do that either.

ID1 NAME FIRST TOM SMITH USA
ID2 NAME FIRST DAVE JR USA
ID3 NAME LAST JONES USA

Thanks for helping a newbie!
bop-a-nator

shivaa · 01-23-2013, 09:44 PM

You can try it:-

Code:

#!/bin/bash
INFILE=/home/username/sample.txt  # This file is your sample.txt input file
TEMP=/tmp/ids.txt
awk '!_[$1]++ {print $1}' $INFILE > $TEMP
while read -r id
do
gawk -v name="$id" 'BEGIN {ORS=" "}; $1 ~ name && /FIRST/ {print name,$4}' $INFILE && awk -v name="$id" '$1 ~ name && /LAST/ {print $4}' $INFILE

gawk -v name="$id" '$1 ~ name && /ADDRESS/ {print name,"ADDRESS",$3,$4}' $INFILE
done < $TEMP
\rm $TEMP

bop-a-nator · 01-24-2013, 12:51 PM

Yes that is a solution, though I realize perhaps I needed to be more clear in that I was trying to do this within awk specifically. I can certainly close this and give you credit for solving and re-phrase my question if you feel that that is best.

Thank you.
bop-a-nator

I was looking to do it with in the awk script itself. As I am already parsing though the file which contains other data too. I simply have a subset of data within a file, I need to merge data from two lines together, and was trying to find a simply way to illustrate the problem I was trying to solve within an awk script. Basically as it it goes though the bigger awk and finds the records that begin with ID, then it needs to loop around in these to find the NAME identifier of FIRST and LAST then put the values of those on the same line.

shivaa · 01-24-2013, 01:18 PM

To be honest, I am also a beginner in awk. But whenever awk combines with shell, it creates magic. So I prefer both, instead of awk or shell alone.

In your case, I will give it a try to write whole script in awk itself.

David the H. · 01-27-2013, 10:03 AM

To do this entirely in awk, I think we need to be a bit more exacting in our matching logic. It also helps to write it out as a stand-alone script, rather than try to cram it all onto the command line.

Code:

#!/usr/bin/awk -f

{
if ( $2 == "NAME" )
  {
    if ( $3 == "FIRST" ) { fn[$1]=$4 }
    if ( $3 == "LAST"  ) { ln[$1]=$4 }
    next
  }

if ( $2 == "ADDRESS" )
  {
    name = fn[$1] ? fn[$1] OFS ln[$1] : ln[$1]
    print $1 , name
    print $0
  }
}

The above assumes that there's always a "LAST" name, but "FIRST" is optional. You'll have to redo the name variable setting if it can be otherwise. It also assumes that the "ADDRESS" line always follows the name fields. If not, then you'll either have to save the address too and print everything out in an END section after the main processing is complete.

There's also a final assumption that the names are all single words. The code would have to get more complex if there could be a $5 field on the "NAME" lines.

PS: Please use ***[code][/code]*** tags around your code and data, to preserve the original formatting and to improve readability. Do not use quote tags, bolding, colors, "start/end" lines, or other creative techniques.

shivaa · 01-27-2013, 12:20 PM

@David:
Indeed, you've given a more strict (+perfect) solution. Could you explain the following line in your code i.e. what does ? and : do here, and how it's storing all this inside 'name':-

Code:

name = fn[$1] ? fn[$1] OFS ln[$1] : ln[$1]

David the H. · 01-27-2013, 04:23 PM

It's called a ternary operator, a kind of simplified if/else pattern available in several programming languages.

http://www.gnu.org/software/gawk/man...ional-Exp.html

In this case I used it to ensure that the space between the two names only appears when both are present. It's kind of hard to handle optional spaces without something like it.