LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   merge columns from multiple files (https://www.linuxquestions.org/questions/linux-newbie-8/merge-columns-from-multiple-files-851336/)

vijay_babu1981 12-20-2010 05:19 AM

merge columns from multiple files
 
Hi all

I want to merge columns (selectively) from several files and create a new file with the merge output.

I saw some suggestions to use pr/paste to join the columns and then awk to pick-up the columns.
Code:

pr -m -t -s\  file1 file2 | gawk '{print $4,$5,$6,$1}'
But I have hundreds of files and I cannot manually pick up columns using awk as given in above solution. I want to pick up every 2nd column of each of the source file.

Can someone suggest a automated way?

Regards

catkin 12-20-2010 05:33 AM

Depending on what you mean by column, colrm might be useful.

vijay_babu1981 12-20-2010 05:44 AM

Hi

I can extract the required column from each file using awk in a for loop (or colrm). But how can I join the result of each of the for loop iteration columnwise and write to a file?

Regards
Vijay

grail 12-20-2010 05:49 AM

Well as usual an example would always help as we are shooting in the dark here, but how about just:
Code:

awk '{print $2}' file1 file2 file3 ...

catkin 12-20-2010 05:55 AM

Got it! The code in your OP isn't an attempt to solve the problem, it's colucix's illustration of a general technique, copied from his post.

Do I now rightly understand that you want to copy the "second space-separated column"s from a set of hundreds of files into a the output file? Do all the files have the same number of lines?

vijay_babu1981 12-20-2010 06:05 AM

Quote:

Originally Posted by grail (Post 4197550)
Well as usual an example would always help as we are shooting in the dark here, but how about just:
Code:

awk '{print $2}' file1 file2 file3 ...

Hello grail

Code:

awk '{print $2}' file1 file2 file3 ...
does not concatenate the output columns columnwise. One column is printed "below" the others so it is just one vertical line of columns.

I want the final output to be
A1 B1 C1...
A2 B2 C2...
A3 B3 C3...

If the original files had
x A1 y z
x A2 y z
x A3 y z

x B1 y z
x B2 y z
x B3 y z

x C1 y z
x C2 y z
x C3 y z

and so on...

vijay_babu1981 12-20-2010 06:07 AM

Quote:

Originally Posted by catkin (Post 4197556)
Got it! The code in your OP isn't an attempt to solve the problem, it's colucix's illustration of a general technique, copied from his post.

Do I now rightly understand that you want to copy the "second space-separated column"s from a set of hundreds of files into a the output file? Do all the files have the same number of lines?

yes the number of lines are same right now.

But I may have to deal with variable number of lines at a later stage.

Thanks.

catkin 12-20-2010 06:09 AM

How many lines in the hundreds of files?

vijay_babu1981 12-20-2010 06:10 AM

I am considering this

This may work...

vijay_babu1981 12-20-2010 06:12 AM

I am also considering writing rows instead of columns and then transposing the whole thing later. writing rows is easy :)

colucix 12-20-2010 06:26 AM

If it does not require too much memory (I cannot test with hundreds of files, right now), what about something like this?
Code:

awk '{_[FNR]=(_[FNR] OFS $2)}END{for (i=1; i<=FNR; i++) {sub(/^ /,"",_[i]); print _[i]}}' file*
The sub function removes the extra OFS (space) at the beginning of each row.

vijay_babu1981 12-21-2010 12:08 AM

Hi colucix

This works fine, thanks. It is easy to understand also.

Now when I try to apply this to files with unequal number of rows I am facing problem.

If original files were:
x A1 y z
x A2 y z

x B1 y z
x B2 y z
x B3 y z

Output becomes
A1 B1
A2 B2
B3

I would prefer
A1 B1
A2 B2
[space]B3

I think this will be difficult as 'awk print' will not know how many blanks spaces to put before actually printing the variable. Anyway please let me know if you great guys have a solution for this also.

Regards

colucix 12-21-2010 03:34 AM

This is more complex. The basic idea is to count how many fields have been previously added for each row number, and while the count of fields does not match the count of files (minus one) add spaces accordingly.

However, in order to have a correctly formatted table, I assumed the maximum length of the string in the second field to be 8 characters and every blank space added will be 8 characters length (see the format in the sprintf statements).

Here we go:
Code:

BEGIN {
  count = 1
}

FNR == 1 && FNR < NR {
  count++
}

{
  if ( _[FNR] ) {
    while ( c[FNR] != count-1 ) {
      _[FNR] = (_[FNR] " " sprintf("%8s",""))
      c[FNR]++
    }
    _[FNR] = (_[FNR] " " sprintf("%-8s",$2))
    c[FNR]++
  }
  else if ( count == 1 ) {
    _[FNR] = sprintf("%-8s",$2)
    c[FNR]++
  }
  else {
    while ( c[FNR] != count-1 ) {
      if ( _[FNR] )
        _[FNR] = (_[FNR] " " sprintf("%8s",""))
      else
        _[FNR] = sprintf("%8s","")
      c[FNR]++
    }
    _[FNR] = (_[FNR] " " sprintf("%-8s",$2))
    c[FNR]++
  }
}

END {
  for (i=1; i<=length(_); i++)
    print _[i]
}

If I have some files with a different number of rows, the output will be something like this:
Code:

$ awk -f test.awk file*
A1      B1      C1      D1      E1      F1      G1     
A2      B2      C2      D2      E2      F2      G2     
        B3                D3      E3      F3     
                          D4                F4     
                                            F5     
                                            F6

Feel free to ask for any clarification.

grail 12-21-2010 05:14 AM

Hey colucix, you inspired me to have a look at this one :)
Code:

{
    while( length(_[FNR]) < (ARGIND * 8))
        _[FNR]=_[FNR] sprintf("%8s","")

    _[FNR]=_[FNR] sprintf("%-8s",$2)
}

END{
    for(x=1; x < ARGC; x++)
        print _[x]
}


colucix 12-21-2010 05:35 AM

Wonderful, grail! :) It is the more compact version I was looking for, but I lost myself in the if/else condition approach. Just a little add-on to your code, to remove the extra spaces at the beginning of each line:
Code:

{
    while( length(_[FNR]) < (ARGIND * 8))
        _[FNR]=_[FNR] sprintf("%8s","")

    _[FNR]=_[FNR] sprintf("%-8s",$2)
}

END{
    for(x=1; x < ARGC; x++) {
        sub(/^ {8}/,"",_[x])
        print _[x]
    }
}

To enable the use of interval expressions we have to run awk with --re-interval, otherwise we can simply explicit the eight-spaces string in the regular expression:
Code:

sub(/^        /,"",_[x])
In alternative we might add a condition to check if _[FNR] already exist, but it will complicate the code making it similar to mine.

Well done, grail! You're awk skills have been improving a lot since I know you. +10 from me.


All times are GMT -5. The time now is 06:02 AM.