merge columns from multiple files

vijay_babu1981 · 12-20-2010, 05:19 AM

Hi all

I want to merge columns (selectively) from several files and create a new file with the merge output.

I saw some suggestions to use pr/paste to join the columns and then awk to pick-up the columns.

Code:

pr -m -t -s\  file1 file2 | gawk '{print $4,$5,$6,$1}'

But I have hundreds of files and I cannot manually pick up columns using awk as given in above solution. I want to pick up every 2nd column of each of the source file.

Can someone suggest a automated way?

Regards

catkin · 12-20-2010, 05:33 AM

Depending on what you mean by column, colrm might be useful.

vijay_babu1981 · 12-20-2010, 05:44 AM

Hi

I can extract the required column from each file using awk in a for loop (or colrm). But how can I join the result of each of the for loop iteration columnwise and write to a file?

Regards
Vijay

grail · 12-20-2010, 05:49 AM

Well as usual an example would always help as we are shooting in the dark here, but how about just:

Code:

awk '{print $2}' file1 file2 file3 ...

catkin · 12-20-2010, 05:55 AM

Got it! The code in your OP isn't an attempt to solve the problem, it's colucix's illustration of a general technique, copied from his post.

Do I now rightly understand that you want to copy the "second space-separated column"s from a set of hundreds of files into a the output file? Do all the files have the same number of lines?

vijay_babu1981 · 12-20-2010, 06:05 AM

Quote:

Originally Posted by grail

Well as usual an example would always help as we are shooting in the dark here, but how about just:

Code:

awk '{print $2}' file1 file2 file3 ...

Hello grail

Code:

awk '{print $2}' file1 file2 file3 ...

does not concatenate the output columns columnwise. One column is printed "below" the others so it is just one vertical line of columns.

I want the final output to be
A1 B1 C1...
A2 B2 C2...
A3 B3 C3...

If the original files had
x A1 y z
x A2 y z
x A3 y z

x B1 y z
x B2 y z
x B3 y z

x C1 y z
x C2 y z
x C3 y z

and so on...

vijay_babu1981 · 12-20-2010, 06:07 AM

Quote:

Originally Posted by catkin

Got it! The code in your OP isn't an attempt to solve the problem, it's colucix's illustration of a general technique, copied from his post.

Do I now rightly understand that you want to copy the "second space-separated column"s from a set of hundreds of files into a the output file? Do all the files have the same number of lines?

yes the number of lines are same right now.

But I may have to deal with variable number of lines at a later stage.

Thanks.

catkin · 12-20-2010, 06:09 AM

How many lines in the hundreds of files?

vijay_babu1981 · 12-20-2010, 06:10 AM

I am considering this

This may work...

vijay_babu1981 · 12-20-2010, 06:12 AM

I am also considering writing rows instead of columns and then transposing the whole thing later. writing rows is easy

colucix · 12-20-2010, 06:26 AM

If it does not require too much memory (I cannot test with hundreds of files, right now), what about something like this?

Code:

awk '{_[FNR]=(_[FNR] OFS $2)}END{for (i=1; i<=FNR; i++) {sub(/^ /,"",_[i]); print _[i]}}' file*

The sub function removes the extra OFS (space) at the beginning of each row.

vijay_babu1981 · 12-21-2010, 12:08 AM

Hi colucix

This works fine, thanks. It is easy to understand also.

Now when I try to apply this to files with unequal number of rows I am facing problem.

If original files were:
x A1 y z
x A2 y z

x B1 y z
x B2 y z
x B3 y z

Output becomes
A1 B1
A2 B2
B3

I would prefer
A1 B1
A2 B2
[space]B3

I think this will be difficult as 'awk print' will not know how many blanks spaces to put before actually printing the variable. Anyway please let me know if you great guys have a solution for this also.

Regards

colucix · 12-21-2010, 03:34 AM

This is more complex. The basic idea is to count how many fields have been previously added for each row number, and while the count of fields does not match the count of files (minus one) add spaces accordingly.

However, in order to have a correctly formatted table, I assumed the maximum length of the string in the second field to be 8 characters and every blank space added will be 8 characters length (see the format in the sprintf statements).

Here we go:

Code:

BEGIN {
  count = 1
}

FNR == 1 && FNR < NR {
  count++
}

{
  if ( _[FNR] ) {
    while ( c[FNR] != count-1 ) {
      _[FNR] = (_[FNR] " " sprintf("%8s",""))
      c[FNR]++
    }
    _[FNR] = (_[FNR] " " sprintf("%-8s",$2))
    c[FNR]++
  }
  else if ( count == 1 ) {
    _[FNR] = sprintf("%-8s",$2)
    c[FNR]++
  }
  else {
    while ( c[FNR] != count-1 ) {
      if ( _[FNR] )
        _[FNR] = (_[FNR] " " sprintf("%8s",""))
      else
        _[FNR] = sprintf("%8s","")
      c[FNR]++
    }
    _[FNR] = (_[FNR] " " sprintf("%-8s",$2))
    c[FNR]++
  }
}

END {
  for (i=1; i<=length(_); i++)
    print _[i]
}

If I have some files with a different number of rows, the output will be something like this:

Code:

$ awk -f test.awk file*
A1       B1       C1       D1       E1       F1       G1      
A2       B2       C2       D2       E2       F2       G2      
         B3                D3       E3       F3      
                           D4                F4      
                                             F5      
                                             F6

Feel free to ask for any clarification.

grail · 12-21-2010, 05:14 AM

Hey colucix, you inspired me to have a look at this one

Code:

{
    while( length(_[FNR]) < (ARGIND * 8))
        _[FNR]=_[FNR] sprintf("%8s","")

    _[FNR]=_[FNR] sprintf("%-8s",$2)
}

END{
    for(x=1; x < ARGC; x++)
        print _[x]
}

colucix · 12-21-2010, 05:35 AM

Wonderful, grail!

It is the more compact version I was looking for, but I lost myself in the if/else condition approach. Just a little add-on to your code, to remove the extra spaces at the beginning of each line:

Code:

{
    while( length(_[FNR]) < (ARGIND * 8))
        _[FNR]=_[FNR] sprintf("%8s","")

    _[FNR]=_[FNR] sprintf("%-8s",$2)
}

END{
    for(x=1; x < ARGC; x++) {
        sub(/^ {8}/,"",_[x])
        print _[x]
    }
}

To enable the use of interval expressions we have to run awk with --re-interval, otherwise we can simply explicit the eight-spaces string in the regular expression:

Code:

sub(/^        /,"",_[x])

In alternative we might add a condition to check if _[FNR] already exist, but it will complicate the code making it similar to mine.

Well done, grail! You're awk skills have been improving a lot since I know you. +10 from me.