Linux merge huge gzip files and keep only intersection

francy_casa · 01-14-2014, 01:04 PM

I have one tab-separated file called "main.txt" and I am trying to add information to this file from many multiple gz files called "chr1.info.gz", "chr2.info.gz", "chr3.info.gz" and so on, which contain a lot more rows than the main file. Note these files are zipped using gzip and I cannot unzip them first and save them because they are huge files (and I don't have space to do this).

I would like to match the column called "name_id" (6th field) in the main file with the matching columns called "rs_id" in the multiple different files (3rd field), and add the additional information from these files while only keeping the rows in the main file:

The main.txt file looks like this:

number maf effect se pval name_id
34 0.7844 0.2197 0.0848 0.009585 snp1
78 0.6655 -0.1577 0.0796 0.04772 snp2

The chr1.info.gz like this:

use pos rs_id a1 a2 a3 a4
f 10584 snp34 0 0 0 0
g 10687 snp35 0 0 0 0
t 13303 snp1 0 0 0 0

The chr2.info.gz like this:

use pos rs_id a1 a2 a3 a4
s 13328 snp67 0 0 0 0
g 10612 snp2 0 0 0 0
t 13303 snp10 0 0 0 0

…and so on

I would like to get the file main.all.gz with added info from the other files:

number maf effect se pval name_id use pos rs_id a1 a2 a3 a4
34 0.7844 0.2197 0.0848 0.009585 snp1 t 13303 snp1 0 0 0 0
78 0.6655 -0.1577 0.0796 0.04772 snp2 g 10612 snp2 0 0 0 0

I have tried with "join" but it looks like it requires unzipping the files, sorting them and and saving them, and I get the message that I don't have enough space on device for this (I don't think I have the correct code anyway):

join -1 6 -2 3 <(zcat main.txt | sort -k6,6) <(zcat chr1.info.gz | sort -k3,3 ) > try.txt

I have tried with awk but I am definitely doing several things wrong since it gives me an empty file, and I get stuck when using multiple files.

I've been spending a day on this and can't find a good solution, Can you please help me solve this?

Thank you very much! -f

NevemTeve · 01-15-2014, 09:06 AM

Yes, that's how sort works. Perhaps you should do a gunzip+sort+gzip step on a different computer that has more disk-space.

francy_casa · 01-15-2014, 09:16 AM

It's too big and takes a long time. Do you have suggestions not having to unzip and re-write the files?

NevemTeve · 01-15-2014, 09:27 AM

Please try to understand: sorting a file is a 'global operation', it cannot be performed 'on-the-fly'. (Yes, sort(1) can be invoked as a filter, but it will create a temporary file to store the complete input, and only after the EOF does it start the actually sorting.)

francy_casa · 01-15-2014, 09:31 AM

Hi NavemTeme,

thank you. What I am trying to ask is whether there is a way other than sort to do this, for example awk works without sorting from my understanding, or is this step always needed? Thanks again.

ntubski · 01-15-2014, 09:40 AM

Maybe passing --compress-program=gzip to sort would help?

sort(1):

Code:

--compress-program=PROG
    compress temporaries with PROG; decompress them with PROG -d

Otherwise, is main.txt a small file? In that case you can scan chr1 and chr2 for every line in main.txt without sorting. Joining without sorting first is usually very slow because of the rescanning, but if one of the tables is tiny (main.txt in this case) it's okay.

francy_casa · 01-15-2014, 10:17 AM

Thank you ntubski for your reply. Unfortunately all the files (including main.txt) are huge files, but maybe what you suggested would at least not take up so much space, although I guess it will still take a long time.. From what I understand you are suggesting something like this?

for i in `cat main.txt | awk {'print $1'}`
do
grep ${i} chr1.info.gz
done

NevemTeve · 01-15-2014, 10:39 AM

> What I am trying to ask is whether there is a way other than sort to do this

You should decide whether sorting is necessary or not. You can create unsorted test-files, and see what happens.

ntubski · 01-15-2014, 11:23 AM

Quote:

Originally Posted by francy_casa

all the files (including main.txt) are huge files

In that case, I would suggest sorting every file and deleting the unsorted version first:

Code:

Warning: untested
gzip main.txt # assuming it's uncompressed

zcat main.txt.gz | sort -k6,6 --compress-program=gzip | gzip -c > main.sorted.txt.gz
rm main.txt.gz

for file in chr1.info.gz chr2.info.gz ; do
   zcat "$file" | sort -k3,3 --compress-program=gzip | gzip -c > "${file%.info.gz}.sorted.info.gz"
   rm "$file"
done

francy_casa · 01-15-2014, 11:41 AM

Ok thank you, so I guess that means that there is no other way around this without sorting.
Thank you for your help!

ntubski · 01-15-2014, 03:56 PM

Quote:

Originally Posted by francy_casa

there is no other way around this without sorting.

You could use something like what you had in post #7 (except using for `...` that way would likely use up all your RAM, use while read instead), but it would be so much slower it's probably not worth it.