LinuxQuestions.org - [SOLVED] Merging columns from different files and renaming columns

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Merging columns from different files and renaming columns (https://www.linuxquestions.org/questions/linux-newbie-8/merging-columns-from-different-files-and-renaming-columns-4175579715/)

Merging columns from different files and renaming columns

Hi all,

I have ~ 10 files containing ~ 5000 rows of values (these are under tracking_id) for 2 samples (q1_0 and q2_0) i.e. data for 2 samples in each file - a total of 20 samples. If I use the head command for one of these files you can get an idea of what they look like:

tracking_id q1_0 q2_0
PF3D7_0100100 20.0472 12.8313
PF3D7_0100200 31.0172 56.0184
PF3D7_0100300 0.598181 174.531
PF3D7_0100400 1.47331 84.0354
PF3D7_0100500 0 0
PF3D7_0100600 0 0.549631
PF3D7_0100700 0 0
PF3D7_0100800 0 0.145775
PF3D7_0100900 0.582414 2.27694

Now I have 10 files each file containing these 5000 values for 2 samples so I have 20 samples (columns) in total. What I want to be able to do is merge all these 10 files together into one file so that the file looks like this:

tracking_id q1_0(from file1) q2_0(from file1) q1_0(from file2) q2_0(from file2) ......
PF3D7_0100100
PF3D7_0100200
PF3D7_0100300
PF3D7_0100400
PF3D7_0100500
PF3D7_0100600
PF3D7_0100700
PF3D7_0100800
PF3D7_0100900

iI want the tracking_id for the 5000 values to only appear once in the first column then I want the corresponding values for each of the 20 samples (which are currently split across 10 files). At the moment every file says q1 and q2 at the top of the columns and I have them stored in different subdirectories which list what the actual samples are. What I would like to do is rename the column headers as well so that the file looks like this:

tracking_id Sample1 Sample2 Sample3 Sample4.....Sample 20
PF3D7_0100100
PF3D7_0100200
PF3D7_0100300
PF3D7_0100400
PF3D7_0100500
PF3D7_0100600
PF3D7_0100700
PF3D7_0100800
PF3D7_0100900

Could someone please tell me how to achieve this? I can use the command paste to append all the columns together but then I end up with the tracking_id column repeated many times in the file.

Many thanks for any help!

Lynne

What you are trying to achieve sounds like a task for a SQL database or the R language.

Quote:

Originally Posted by lharrisl (Post 5545057)

... merge all these 10 files together into one file so that the file looks like this...

The examples look incomplete. Do you mean that each line should start with the tracking id (e.g. PF3D7_0100100) and then be followed with the corresponding 20 additional columns each containing data?

Code:

tracking_id            q1_0(from file1)        q2_0(from file1)  q1_0(from file2)        q2_0(from file2) ......

PF3D7_0100100  20.0472 12.8313  30.0533 13.888 ... and so on

If so, then that is very easy in perl with a hash of arrays. To do that, I'd read the files in order and use the tracking id as the hash key then the push() the q1_0 and q2_0 onto it, for each file, then print the whole thing out.

Quote:

Originally Posted by lharrisl (Post 5545057)

I can use the command paste to append all the columns together but then I end up with the tracking_id column repeated many times in the file.

On more close reading, if 'paste' does most of what you need, then perl would be too much and you can use 'awk' to skip the unnecessary columns.

Another option is to read the result of 'paste' into LibreOffice Calc, delete the unnecessary columns and then re-export it to Text CSV. Just be sure to check all the boxes for possible separator options, including space, when importing.

You could simply cut the unwanted columns from the result of paste.

Code:

paste -d " " <input files> | cut -d " " --complement -f $(seq -s "," 4 3 28) > <output file>

This assumes that the input files are delimited by a space character.
You would still need to change your column headings.

If it is using columns I would probably just use awk.

Try join instead of paste - in your case won't even need any flags. It's even possible to pipe the output to another join

Code:

join file1 file2 | join - file3

KISS

The join command worked very well and then I used R to rename the columns. Thank you everyone for your help!