LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 09-27-2011, 08:26 PM   #1
cgcamal
Member
 
Registered: Nov 2008
Location: Tegucigalpa
Posts: 72

Rep: Reputation: 16
Match codes in first column in 2 files and return data in other columns


Hi everyone,

Maybe somebody could help me.

I have this infile1 (10 columns):
Code:
CODE,DATA1,DATA2,DATA3,DATA4,DATA5,DATA6,DATA7,DATA8,DATA9,DATA10
FRT7,,,,XVRT,,EET,,,,
PLO-1,,,,,UT,,,,,
UUY,,,,,,,5,,,
TYUDA,,,,MM,,MIP,3,,,
XXRT-5,,,,,133,LK,,,,
And this infile2 (11 columns):
Code:
ZZZ11-1,TextZ,33.01,Data_Z,MAK_Z,YUU,Value_Z,Serial_Z,BEJ_Z,ASSOC_Z,10
PLO-1,TextP,1.01,Data_P,MAK_P,GSF,Value_P,Serial_P,BEJ_P,ASSOC_P,99
LLOQ-43,TextL,-12,Data_L,MAK_L,YRD,Value_L,Serial_L,BEJ_L,ASSOC_L,24
TYUDA,TextT,-23.01,Data_T,MAK_T,IGF,Value_T,Serial_T,BEJ_T,ASSOC_T,11
MMMJ1Y,TextM,0.02,Data_M,MAK_M,WR,Value_M,Serial_M,BEJ_M,ASSOC_M,342
FRT7,TextF,7,Data_F,MAK_F,NNM,Value_F,Serial_F,BEJ_F,ASSOC_F,56
MAYUP33,TextM,4.5,Data_M,MAK_M,QWW,Value_M,Serial_M,BEJ_M,ASSOC_M,86
AA13B,TextA,2.7,Data_A,MAK_A,PPO,Value_A,Serial_A,BEJ_A,ASSOC_A,123
XXRT-5,TextX,2.56,Data_X,MAK_X,LIP,Value_X,Serial_X,BEJ_X,ASSOC_X,67
UUY,TextU,8.9,Data_U,MAK_U,ZZA,Value_U,Serial_U,BEJ_U,ASSOC_U,43
The output desired would be:
Code:
CODE,DATA1,DATA2,DATA3,DATA4,DATA5,DATA6,DATA7,DATA8,DATA9,DATA10
FRT7,MAKF,TextF,DataF,XVRT,,EET,,7,,
PLO-1,MAKP,TextP,DataP,,UT,,,1.01,,
UUY,MAKU,TextU,DataU,,,,,8.9,,
TYUDA,MAKT,TextT,DataT,MM,,MIP,,-23.01,,
XXRT-5,MAKX,TextX,DataX,,133,LK,,2.56,,
The columns to copy from infile2 into infile1 for every match line in col1 in both files are in next order.

Col infile1-Col infile2
in col2 file1-copy col5 file2
in col3 file1-copy col2 file2
in col4 file1-copy col4 file2
in col9 file1-copy col3 file2

I hope somebody could help with a swk script. I don't know how to do this.

Thanks in advance.
 
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 09-28-2011, 03:56 AM   #2
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374
Hi,

Have a look at this:
Code:
#!/bin/bash

awk '
BEGIN { 
  # set field separator
  FS = ","
  # read infile2, store relevant parts into an array
  while ( ( getline < "infile2" ) > 0 )
    { infile2Array[$1,0] = $1
      infile2Array[$1,1] = $5
      infile2Array[$1,2] = $2
      infile2Array[$1,3] = $4
      infile2Array[$1,4] = $3 }
  # print header
  print "CODE,DATA1,DATA2,DATA3,DATA4,DATA5,DATA6,DATA7,DATA8,DATA9,DATA10"
}
{
  # fill in missing pieces from array and print line
  if ( $1 == infile2Array[$1,0] )
    { print $1"," infile2Array[$1,1]"," infile2Array[$1,2]"," infile2Array[$1,3]"," $5"," $6"," $7"," $8"," infile2Array[$1,4]"," $10"," }
}' infile1

exit 0
Hope this helps.

Last edited by druuna; 09-28-2011 at 04:00 AM.
 
1 members found this post helpful.
Old 09-28-2011, 04:58 AM   #3
cgcamal
Member
 
Registered: Nov 2008
Location: Tegucigalpa
Posts: 72

Original Poster
Rep: Reputation: 16
Hi druuna,

Many thanks for reply.

I'll try it, I can't now. But I'll try hoping it will work.

Many thanks for your kindly help.

Best regards.
 
Old 09-28-2011, 05:08 AM   #4
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374
Hi,

I did test the script with the input provided by you in post #1 and the output is as described.

BTW: You're welcome
 
Old 09-28-2011, 10:21 AM   #5
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,698

Rep: Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988
Just for some variety:

Awk:
Code:
#!/usr/bin/awk -f

BEGIN{ OFS = FS = "," }

FNR == NR{
    arr[$1] = sprintf("%s,%s,%s,%s",$5,$2,$4,$3)
    next
}

FNR != 1 && $1 in arr{

    split(arr[$1],tmp)
    
    $2 = tmp[1]
    $3 = tmp[2]
    $4 = tmp[3]
    $9 = tmp[4]
}

1
Ruby (1.9+):
Code:
#!/usr/bin/ruby

BEGIN{ $, = $; = ","
    store = {}
}

while line = ARGF.gets
    arr = line.split
    unless ARGV[0].nil?
        store[arr[0]] = [ arr[4], arr[1], arr[3], arr[2]]
        next
    else
        arr[1], arr[2], arr[3], arr[8] = store[arr[0]] unless store[arr[0]].nil?
    end
    puts arr.join
end

Last edited by grail; 09-28-2011 at 10:22 AM.
 
1 members found this post helpful.
Old 09-28-2011, 06:50 PM   #6
cgcamal
Member
 
Registered: Nov 2008
Location: Tegucigalpa
Posts: 72

Original Poster
Rep: Reputation: 16
Hi druuna,

Many thanks, I've tried and it works as I thought.

Only 2 issues.

1- I was trying to print the completely first row
instead of
Code:
print "CODE,DATA1,DATA2,DATA3,DATA4,DATA5,DATA6,DATA7,DATA8,DATA9,DATA10"
I tried with
Code:
NR=1 {print $0}
but receive errors. How to solve it?

because the original file has 32 headers and the header names change.

2- Sometimes inside some fields there are commas, but the field is surounded by double quotes to differentiate that is the same field as below (for example in line 1 and 2 in red).
Code:
ZZZ11-1,TextZ,33.01,"Data_Z, Data_Z1","MAK_Z,MAK_1,JT_Z",YUU,Value_Z,Serial_Z,BEJ_Z,ASSOC_Z,10
PLO-1,TextP,1.01,"Data_P,DataP1","MAK_P,MAK_P1",GSF,Value_P,Serial_P,BEJ_P,ASSOC_P,99
LLOQ-43,TextL,-12,Data_L,MAK_L,YRD,Value_L,Serial_L,BEJ_L,ASSOC_L,24
TYUDA,TextT,-23.01,Data_T,MAK_T,IGF,Value_T,Serial_T,BEJ_T,ASSOC_T,11
MMMJ1Y,TextM,0.02,Data_M,MAK_M,WR,Value_M,Serial_M,BEJ_M,ASSOC_M,342
FRT7,TextF,7,Data_F,MAK_F,NNM,Value_F,Serial_F,BEJ_F,ASSOC_F,56
MAYUP33,TextM,4.5,Data_M,MAK_M,QWW,Value_M,Serial_M,BEJ_M,ASSOC_M,86
AA13B,TextA,2.7,Data_A,MAK_A,PPO,Value_A,Serial_A,BEJ_A,ASSOC_A,123
XXRT-5,TextX,2.56,Data_X,MAK_X,LIP,Value_X,Serial_X,BEJ_X,ASSOC_X,67
UUY,TextU,8.9,Data_U,MAK_U,ZZA,Value_U,Serial_U,BEJ_U,ASSOC_U,43
How to fix this?

Hi grail, thanks for your help,

Sorry, I wasn't able to run succesfully your code, because simply I don't know how to run it.

I've tested doing:
Code:
awk -f 'BEGIN{}...' file2 file1

and

awk 'BEGIN{}...' file2 file1
but bith ways not working for me.

May you explain me how to do it please?

Many thanks for help so far.

Regards
 
Old 09-28-2011, 09:36 PM   #7
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,698

Rep: Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988
Run the code as you would a normal script and pass the file names as arguments:
Code:
./script.awk file2 file1
./script.rb file2 file1
Although looking at your latest reply I would imagine my code will have issues too with the quoted parts.
If you have version 4+ of gawk it is a relatively easy fix, if not it will be a bit more complicated.
As for Ruby i am still learning but will give it a crack when I get back to my linux machine.
 
1 members found this post helpful.
Old 09-29-2011, 05:16 AM   #8
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374
Hi,
Quote:
Originally Posted by cgcamal View Post
1- I was trying to print the completely first row

because the original file has 32 headers and the header names change.
Without specifics I can only guess. Assuming that each header line has a unique (first?) field that identifies it as header you can search for it and print if found:
Code:
/^CODE/ { print }

Quote:
2- Sometimes inside some fields there are commas, but the field is surounded by double quotes to differentiate that is the same field

How to fix this?
Having a separator that is not unique makes things a lot harder. First thing that comes to mind: Can you change the way the in-files are generated so they have a unique separator (a % perhaps)?

Grail mentions that this can be done in (g)awk 4+ with relative ease, but to be honest: I wouldn't know how, I can only come up with a dirty solution:
Code:
#!/bin/bash

for THISFILE in infile1 infile2
do
  awk 'BEGIN { FS="," }
  {  i=0
     c=","
     while(i++<NF) {
          if (i==NF) { c="" }
          if ($i~/^"/) {
               sub("\"","|",$i)
               a=!a }
          if ($i~/"$/) {
               sub("\"","|",$i)
               a=!a
               printf "%s%s",$i,c
               continue }
          if (!a) {
               printf "|%s|%s",$i,c
               continue }               
          printf "%s%s",$i,c }
     printf "\n" }' ${THISFILE} | sed -e 's/|,|/|/g' -e 's/^|//' -e 's/|$//' -e 's/,/ /g' > ${THISFILE}.tmp
done

awk 'BEGIN { FS = "|"
  while ( ( getline < "infile2.tmp" ) > 0 )
    { infile2Array[$1,0] = $1
      infile2Array[$1,1] = $5
      infile2Array[$1,2] = $2
      infile2Array[$1,3] = $4
      infile2Array[$1,4] = $3 } }

  /^CODE/ { print }
{
  if ( $1 == infile2Array[$1,0] )
    { print $1"," infile2Array[$1,1]"," infile2Array[$1,2]"," infile2Array[$1,3]"," $5"," $6"," $7"," $8"," infile2Array[$1,4]"," $10"," }
}' infile1.tmp | sed 's/|/,/g'

rm infile{1,2}.tmp

exit 0
A test run (with one "exception" highlighted):
Code:
$ cat infile1
CODE,DATA1,DATA2,DATA3,DATA4,DATA5,DATA6,DATA7,DATA8,DATA9,DATA10
PLO-1,,,,,UT,,,,,
FRT7,,,,XVRT,,EET,,,,
CODE,HEAD1,HEAD2,HEAD3,HEAD4,HEAD5,HEAD6,HEAD7,HEAD8,HEAD9,HEAD10
TYUDA,,,,MM,,MIP,3,,,
XXRT-5,,,,,133,LK,,,,

$ cat infile2
PLO-1,TextP,1.01,"Data_P,DataP1","MAK_P,MAK_P1",GSF,Value_P,Serial_P,BEJ_P,ASSOC_P,99
TYUDA,TextT,-23.01,Data_T,MAK_T,IGF,Value_T,Serial_T,BEJ_T,ASSOC_T,11
FRT7,TextF,7,Data_F,MAK_F,NNM,Value_F,Serial_F,BEJ_F,ASSOC_F,56
XXRT-5,TextX,2.56,"Data_X, ABCD",MAK_X,LIP,Value_X,Serial_X,BEJ_X,ASSOC_X,67

$ ./tester.sh
CODE,DATA1,DATA2,DATA3,DATA4,DATA5,DATA6,DATA7,DATA8,DATA9,DATA10
PLO-1,MAK_P MAK_P1,TextP,Data_P DataP1,,UT,,,1.01,,
FRT7,MAK_F,TextF,Data_F,XVRT,,EET,,7,,
CODE,HEAD1,HEAD2,HEAD3,HEAD4,HEAD5,HEAD6,HEAD7,HEAD8,HEAD9,HEAD10
TYUDA,MAK_T,TextT,Data_T,MM,,MIP,3,-23.01,,
XXRT-5,MAK_X,TextX,Data_X  ABCD,,133,LK,,2.56,,
Maybe someone has time to come up with a more elegant solution.....

Anyway, hope this helps.
 
2 members found this post helpful.
Old 09-29-2011, 03:08 PM   #9
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,698

Rep: Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988
So just my luck, the machine that has version 4 on it is not this one So will try that idea later.
Basically the gold of version 4 which makes it simple is:
Code:
FPAT #
This is a regular expression (as a string) that tells gawk to create the fields based on text that matches the regular expression. Assigning a value to FPAT overrides the use of FS and FIELDWIDTHS for field splitting. See Splitting By Content, for more information.
If gawk is in compatibility mode (see Options), then FPAT has no special meaning, and field-splitting operations occur based exclusively on the value of FS.
So like the Ruby solution below, this allows you to set what each field will look like as opposed to setting the field separator.

On the plus side, the Ruby solution changed one line:
Code:
#!/usr/bin/ruby

BEGIN{ $, = $; = ","
    store = {}
}

while line = ARGF.gets
    #arr = line.split
    arr = line.scan(/("[^"]*"|[^,]*),/)
    unless ARGV[0].nil?
        store[arr[0]] = [ arr[4], arr[1], arr[3], arr[2]]
        next
    else
        arr[1], arr[2], arr[3], arr[8] = store[arr[0]] unless store[arr[0]].nil?
    end
    puts arr.join
end
 
1 members found this post helpful.
Old 09-29-2011, 06:48 PM   #10
cgcamal
Member
 
Registered: Nov 2008
Location: Tegucigalpa
Posts: 72

Original Poster
Rep: Reputation: 16
Thanks both,

I'll test the codes at night over here. I think this will work for cases where there are text surounded with double quotes too.

Best regards
 
Old 09-29-2011, 07:15 PM   #11
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,698

Rep: Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988
So the change for version 4 of gawk looks like:
Code:
#!/usr/bin/awk -f

BEGIN{ OFS = "," 
    FPAT = "(\"[^\"]*\")|([^,]*)"
}

FNR == NR{
    arr[$1] = sprintf("%s,%s,%s,%s",$5,$2,$4,$3)
    next
}

FNR != 1 && $1 in arr{

    split(arr[$1],tmp,",")
    
    $2 = tmp[1]
    $3 = tmp[2]
    $4 = tmp[3]
    $9 = tmp[4]
}

1
 
1 members found this post helpful.
Old 09-30-2011, 01:55 AM   #12
cgcamal
Member
 
Registered: Nov 2008
Location: Tegucigalpa
Posts: 72

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by druuna View Post
Maybe someone has time to come up with a more elegant solution.....
Hi druuna,

It's very elegant for me, because it works, Thanks a lot for your help really.

grail,

Thanks again for your solution, the first one works for me either, the second I think I don't have version 4 of gawk.

Solution of both works very well. I changed first to pipe separator to avoid complications and apply your solutions and worked great!.

Many thanks for your help and time both.

Regards
 
Old 09-30-2011, 02:03 AM   #13
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374
Hi,
Quote:
Originally Posted by cgcamal View Post
It's very elegant for me, because it works,
It is a working solution, but I wouldn't call it elegant

Quote:
Thanks a lot for your help really.
You're welcome

@grail:
Quote:
Originally Posted by grail
So the change for version 4 of gawk looks like:
That is indeed a nice gawk addition!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Insert column with awk or sed between two columns captainentropy Linux - Newbie 9 11-27-2014 12:49 PM
sort multiple columns + replace another column cedance Linux - Newbie 1 03-29-2011 06:59 AM
How to get parts of one column and prints all columns ? nasra2002 Linux - Software 2 12-13-2010 10:36 PM
how to Reading And Writing SAC (Seismic Analysis Code) Data Files in Fortran codes lengyue Linux - Software 2 12-03-2008 01:39 PM
return codes in c exvor Programming 4 01-21-2005 09:45 PM


All times are GMT -5. The time now is 02:45 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration