LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Match codes in first column in 2 files and return data in other columns (http://www.linuxquestions.org/questions/programming-9/match-codes-in-first-column-in-2-files-and-return-data-in-other-columns-905313/)

cgcamal 09-27-2011 07:26 PM

Match codes in first column in 2 files and return data in other columns
 
Hi everyone,

Maybe somebody could help me.

I have this infile1 (10 columns):
Code:

CODE,DATA1,DATA2,DATA3,DATA4,DATA5,DATA6,DATA7,DATA8,DATA9,DATA10
FRT7,,,,XVRT,,EET,,,,
PLO-1,,,,,UT,,,,,
UUY,,,,,,,5,,,
TYUDA,,,,MM,,MIP,3,,,
XXRT-5,,,,,133,LK,,,,

And this infile2 (11 columns):
Code:

ZZZ11-1,TextZ,33.01,Data_Z,MAK_Z,YUU,Value_Z,Serial_Z,BEJ_Z,ASSOC_Z,10
PLO-1,TextP,1.01,Data_P,MAK_P,GSF,Value_P,Serial_P,BEJ_P,ASSOC_P,99
LLOQ-43,TextL,-12,Data_L,MAK_L,YRD,Value_L,Serial_L,BEJ_L,ASSOC_L,24
TYUDA,TextT,-23.01,Data_T,MAK_T,IGF,Value_T,Serial_T,BEJ_T,ASSOC_T,11
MMMJ1Y,TextM,0.02,Data_M,MAK_M,WR,Value_M,Serial_M,BEJ_M,ASSOC_M,342
FRT7,TextF,7,Data_F,MAK_F,NNM,Value_F,Serial_F,BEJ_F,ASSOC_F,56
MAYUP33,TextM,4.5,Data_M,MAK_M,QWW,Value_M,Serial_M,BEJ_M,ASSOC_M,86
AA13B,TextA,2.7,Data_A,MAK_A,PPO,Value_A,Serial_A,BEJ_A,ASSOC_A,123
XXRT-5,TextX,2.56,Data_X,MAK_X,LIP,Value_X,Serial_X,BEJ_X,ASSOC_X,67
UUY,TextU,8.9,Data_U,MAK_U,ZZA,Value_U,Serial_U,BEJ_U,ASSOC_U,43

The output desired would be:
Code:

CODE,DATA1,DATA2,DATA3,DATA4,DATA5,DATA6,DATA7,DATA8,DATA9,DATA10
FRT7,MAKF,TextF,DataF,XVRT,,EET,,7,,
PLO-1,MAKP,TextP,DataP,,UT,,,1.01,,
UUY,MAKU,TextU,DataU,,,,,8.9,,
TYUDA,MAKT,TextT,DataT,MM,,MIP,,-23.01,,
XXRT-5,MAKX,TextX,DataX,,133,LK,,2.56,,

The columns to copy from infile2 into infile1 for every match line in col1 in both files are in next order.

Col infile1-Col infile2
in col2 file1-copy col5 file2
in col3 file1-copy col2 file2
in col4 file1-copy col4 file2
in col9 file1-copy col3 file2

I hope somebody could help with a swk script. I don't know how to do this.

Thanks in advance.

druuna 09-28-2011 02:56 AM

Hi,

Have a look at this:
Code:

#!/bin/bash

awk '
BEGIN {
  # set field separator
  FS = ","
  # read infile2, store relevant parts into an array
  while ( ( getline < "infile2" ) > 0 )
    { infile2Array[$1,0] = $1
      infile2Array[$1,1] = $5
      infile2Array[$1,2] = $2
      infile2Array[$1,3] = $4
      infile2Array[$1,4] = $3 }
  # print header
  print "CODE,DATA1,DATA2,DATA3,DATA4,DATA5,DATA6,DATA7,DATA8,DATA9,DATA10"
}
{
  # fill in missing pieces from array and print line
  if ( $1 == infile2Array[$1,0] )
    { print $1"," infile2Array[$1,1]"," infile2Array[$1,2]"," infile2Array[$1,3]"," $5"," $6"," $7"," $8"," infile2Array[$1,4]"," $10"," }
}' infile1

exit 0

Hope this helps.

cgcamal 09-28-2011 03:58 AM

Hi druuna,

Many thanks for reply.

I'll try it, I can't now. But I'll try hoping it will work.

Many thanks for your kindly help.

Best regards.

druuna 09-28-2011 04:08 AM

Hi,

I did test the script with the input provided by you in post #1 and the output is as described.

BTW: You're welcome :)

grail 09-28-2011 09:21 AM

Just for some variety:

Awk:
Code:

#!/usr/bin/awk -f

BEGIN{ OFS = FS = "," }

FNR == NR{
    arr[$1] = sprintf("%s,%s,%s,%s",$5,$2,$4,$3)
    next
}

FNR != 1 && $1 in arr{

    split(arr[$1],tmp)
   
    $2 = tmp[1]
    $3 = tmp[2]
    $4 = tmp[3]
    $9 = tmp[4]
}

1

Ruby (1.9+):
Code:

#!/usr/bin/ruby

BEGIN{ $, = $; = ","
    store = {}
}

while line = ARGF.gets
    arr = line.split
    unless ARGV[0].nil?
        store[arr[0]] = [ arr[4], arr[1], arr[3], arr[2]]
        next
    else
        arr[1], arr[2], arr[3], arr[8] = store[arr[0]] unless store[arr[0]].nil?
    end
    puts arr.join
end


cgcamal 09-28-2011 05:50 PM

Hi druuna,

Many thanks, I've tried and it works as I thought.

Only 2 issues.

1- I was trying to print the completely first row
instead of
Code:

print "CODE,DATA1,DATA2,DATA3,DATA4,DATA5,DATA6,DATA7,DATA8,DATA9,DATA10"
I tried with
Code:

NR=1 {print $0}
but receive errors. How to solve it?

because the original file has 32 headers and the header names change.

2- Sometimes inside some fields there are commas, but the field is surounded by double quotes to differentiate that is the same field as below (for example in line 1 and 2 in red).
Code:

ZZZ11-1,TextZ,33.01,"Data_Z, Data_Z1","MAK_Z,MAK_1,JT_Z",YUU,Value_Z,Serial_Z,BEJ_Z,ASSOC_Z,10
PLO-1,TextP,1.01,"Data_P,DataP1","MAK_P,MAK_P1",GSF,Value_P,Serial_P,BEJ_P,ASSOC_P,99
LLOQ-43,TextL,-12,Data_L,MAK_L,YRD,Value_L,Serial_L,BEJ_L,ASSOC_L,24
TYUDA,TextT,-23.01,Data_T,MAK_T,IGF,Value_T,Serial_T,BEJ_T,ASSOC_T,11
MMMJ1Y,TextM,0.02,Data_M,MAK_M,WR,Value_M,Serial_M,BEJ_M,ASSOC_M,342
FRT7,TextF,7,Data_F,MAK_F,NNM,Value_F,Serial_F,BEJ_F,ASSOC_F,56
MAYUP33,TextM,4.5,Data_M,MAK_M,QWW,Value_M,Serial_M,BEJ_M,ASSOC_M,86
AA13B,TextA,2.7,Data_A,MAK_A,PPO,Value_A,Serial_A,BEJ_A,ASSOC_A,123
XXRT-5,TextX,2.56,Data_X,MAK_X,LIP,Value_X,Serial_X,BEJ_X,ASSOC_X,67
UUY,TextU,8.9,Data_U,MAK_U,ZZA,Value_U,Serial_U,BEJ_U,ASSOC_U,43

How to fix this?

Hi grail, thanks for your help,

Sorry, I wasn't able to run succesfully your code, because simply I don't know how to run it.

I've tested doing:
Code:

awk -f 'BEGIN{}...' file2 file1

and

awk 'BEGIN{}...' file2 file1

but bith ways not working for me.

May you explain me how to do it please?

Many thanks for help so far.

Regards

grail 09-28-2011 08:36 PM

Run the code as you would a normal script and pass the file names as arguments:
Code:

./script.awk file2 file1
./script.rb file2 file1

Although looking at your latest reply I would imagine my code will have issues too with the quoted parts.
If you have version 4+ of gawk it is a relatively easy fix, if not it will be a bit more complicated.
As for Ruby i am still learning but will give it a crack when I get back to my linux machine.

druuna 09-29-2011 04:16 AM

Hi,
Quote:

Originally Posted by cgcamal (Post 4484866)
1- I was trying to print the completely first row

because the original file has 32 headers and the header names change.

Without specifics I can only guess. Assuming that each header line has a unique (first?) field that identifies it as header you can search for it and print if found:
Code:

/^CODE/ { print }

Quote:

2- Sometimes inside some fields there are commas, but the field is surounded by double quotes to differentiate that is the same field

How to fix this?
Having a separator that is not unique makes things a lot harder. First thing that comes to mind: Can you change the way the in-files are generated so they have a unique separator (a % perhaps)?

Grail mentions that this can be done in (g)awk 4+ with relative ease, but to be honest: I wouldn't know how, I can only come up with a dirty solution:
Code:

#!/bin/bash

for THISFILE in infile1 infile2
do
  awk 'BEGIN { FS="," }
  {  i=0
    c=","
    while(i++<NF) {
          if (i==NF) { c="" }
          if ($i~/^"/) {
              sub("\"","|",$i)
              a=!a }
          if ($i~/"$/) {
              sub("\"","|",$i)
              a=!a
              printf "%s%s",$i,c
              continue }
          if (!a) {
              printf "|%s|%s",$i,c
              continue }             
          printf "%s%s",$i,c }
    printf "\n" }' ${THISFILE} | sed -e 's/|,|/|/g' -e 's/^|//' -e 's/|$//' -e 's/,/ /g' > ${THISFILE}.tmp
done

awk 'BEGIN { FS = "|"
  while ( ( getline < "infile2.tmp" ) > 0 )
    { infile2Array[$1,0] = $1
      infile2Array[$1,1] = $5
      infile2Array[$1,2] = $2
      infile2Array[$1,3] = $4
      infile2Array[$1,4] = $3 } }

  /^CODE/ { print }
{
  if ( $1 == infile2Array[$1,0] )
    { print $1"," infile2Array[$1,1]"," infile2Array[$1,2]"," infile2Array[$1,3]"," $5"," $6"," $7"," $8"," infile2Array[$1,4]"," $10"," }
}' infile1.tmp | sed 's/|/,/g'

rm infile{1,2}.tmp

exit 0

A test run (with one "exception" highlighted):
Code:

$ cat infile1
CODE,DATA1,DATA2,DATA3,DATA4,DATA5,DATA6,DATA7,DATA8,DATA9,DATA10
PLO-1,,,,,UT,,,,,
FRT7,,,,XVRT,,EET,,,,
CODE,HEAD1,HEAD2,HEAD3,HEAD4,HEAD5,HEAD6,HEAD7,HEAD8,HEAD9,HEAD10
TYUDA,,,,MM,,MIP,3,,,
XXRT-5,,,,,133,LK,,,,

$ cat infile2
PLO-1,TextP,1.01,"Data_P,DataP1","MAK_P,MAK_P1",GSF,Value_P,Serial_P,BEJ_P,ASSOC_P,99
TYUDA,TextT,-23.01,Data_T,MAK_T,IGF,Value_T,Serial_T,BEJ_T,ASSOC_T,11
FRT7,TextF,7,Data_F,MAK_F,NNM,Value_F,Serial_F,BEJ_F,ASSOC_F,56
XXRT-5,TextX,2.56,"Data_X, ABCD",MAK_X,LIP,Value_X,Serial_X,BEJ_X,ASSOC_X,67

$ ./tester.sh
CODE,DATA1,DATA2,DATA3,DATA4,DATA5,DATA6,DATA7,DATA8,DATA9,DATA10
PLO-1,MAK_P MAK_P1,TextP,Data_P DataP1,,UT,,,1.01,,
FRT7,MAK_F,TextF,Data_F,XVRT,,EET,,7,,
CODE,HEAD1,HEAD2,HEAD3,HEAD4,HEAD5,HEAD6,HEAD7,HEAD8,HEAD9,HEAD10
TYUDA,MAK_T,TextT,Data_T,MM,,MIP,3,-23.01,,
XXRT-5,MAK_X,TextX,Data_X  ABCD,,133,LK,,2.56,,

Maybe someone has time to come up with a more elegant solution.....

Anyway, hope this helps.

grail 09-29-2011 02:08 PM

So just my luck, the machine that has version 4 on it is not this one :( So will try that idea later.
Basically the gold of version 4 which makes it simple is:
Code:

FPAT #
This is a regular expression (as a string) that tells gawk to create the fields based on text that matches the regular expression. Assigning a value to FPAT overrides the use of FS and FIELDWIDTHS for field splitting. See Splitting By Content, for more information.
If gawk is in compatibility mode (see Options), then FPAT has no special meaning, and field-splitting operations occur based exclusively on the value of FS.

So like the Ruby solution below, this allows you to set what each field will look like as opposed to setting the field separator.

On the plus side, the Ruby solution changed one line:
Code:

#!/usr/bin/ruby

BEGIN{ $, = $; = ","
    store = {}
}

while line = ARGF.gets
    #arr = line.split
    arr = line.scan(/("[^"]*"|[^,]*),/)
    unless ARGV[0].nil?
        store[arr[0]] = [ arr[4], arr[1], arr[3], arr[2]]
        next
    else
        arr[1], arr[2], arr[3], arr[8] = store[arr[0]] unless store[arr[0]].nil?
    end
    puts arr.join
end


cgcamal 09-29-2011 05:48 PM

Thanks both,

I'll test the codes at night over here. I think this will work for cases where there are text surounded with double quotes too.

Best regards

grail 09-29-2011 06:15 PM

So the change for version 4 of gawk looks like:
Code:

#!/usr/bin/awk -f

BEGIN{ OFS = ","
    FPAT = "(\"[^\"]*\")|([^,]*)"
}

FNR == NR{
    arr[$1] = sprintf("%s,%s,%s,%s",$5,$2,$4,$3)
    next
}

FNR != 1 && $1 in arr{

    split(arr[$1],tmp,",")
   
    $2 = tmp[1]
    $3 = tmp[2]
    $4 = tmp[3]
    $9 = tmp[4]
}

1


cgcamal 09-30-2011 12:55 AM

Quote:

Originally Posted by druuna (Post 4485336)
Maybe someone has time to come up with a more elegant solution.....

Hi druuna,

It's very elegant for me, because it works, Thanks a lot for your help really.

grail,

Thanks again for your solution, the first one works for me either, the second I think I don't have version 4 of gawk.

Solution of both works very well. I changed first to pipe separator to avoid complications and apply your solutions and worked great!.

Many thanks for your help and time both.

Regards

druuna 09-30-2011 01:03 AM

Hi,
Quote:

Originally Posted by cgcamal (Post 4486190)
It's very elegant for me, because it works,

It is a working solution, but I wouldn't call it elegant :D

Quote:

Thanks a lot for your help really.
You're welcome :)

@grail:
Quote:

Originally Posted by grail
So the change for version 4 of gawk looks like:

That is indeed a nice gawk addition!


All times are GMT -5. The time now is 01:12 AM.