Linux merge huge gzip files and keep only intersection

francy_casa · 01-15-2014, 06:19 AM

I have a very big tab-separated file called "main.txt" and I am trying to add information to this file from many multiple gz files called "chr1.info.gz", "chr2.info.gz", "chr3.info.gz" and so on, which contain a lot more rows than the main file. Note these files are zipped using gzip and I cannot unzip them first and save them because they are huge files (and I don't have space to do this).

I would like to match the column called "name_id" (6th field) in the main file with the matching columns called "rs_id" in the multiple different files (3rd field), and add the additional information from these files while only keeping the rows in the main file:

The main.txt file looks like this:

number maf effect se pval name_id
34 0.7844 0.2197 0.0848 0.009585 snp1
78 0.6655 -0.1577 0.0796 0.04772 snp2

The chr1.info.gz like this:

use pos rs_id a1 a2 a3 a4
f 10584 snp34 0 0 0 0
g 10687 snp35 0 0 0 0
t 13303 snp1 0 0 0 0

The chr2.info.gz like this:

use pos rs_id a1 a2 a3 a4
s 13328 snp67 0 0 0 0
g 10612 snp2 0 0 0 0
t 13303 snp10 0 0 0 0

…and so on

I would like to get the file main.all.gz with added info from the other files:

number maf effect se pval name_id use pos rs_id a1 a2 a3 a4
34 0.7844 0.2197 0.0848 0.009585 snp1 t 13303 snp1 0 0 0 0
78 0.6655 -0.1577 0.0796 0.04772 snp2 g 10612 snp2 0 0 0 0

I have tried with "join" but it looks like it requires unzipping the files, sorting them and and saving them, and I get the message that I don't have enough space on device for this (I don't think I have the correct code anyway):

join -1 6 -2 3 <(zcat main.txt | sort -k6,6) <(zcat chr1.info.gz | sort -k3,3 ) > try.txt

I have tried with awk but I am definitely doing several things wrong since it gives me an empty file, and I get stuck when using multiple files.

I've been spending a day on this and can't find a good solution, Can you please help me solve this?

Thank you very much! -f

acid_kewpie · 01-15-2014, 06:51 AM

Please post your thread in only one forum. Posting a single thread in the most relevant forum will make it easier for members to help you and will keep the discussion in one place. This thread is being closed because it is a duplicate.