Linux merge huge gzip files and keep only intersection
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Linux merge huge gzip files and keep only intersection
I have one tab-separated file called "main.txt" and I am trying to add information to this file from many multiple gz files called "chr1.info.gz", "chr2.info.gz", "chr3.info.gz" and so on, which contain a lot more rows than the main file. Note these files are zipped using gzip and I cannot unzip them first and save them because they are huge files (and I don't have space to do this).
I would like to match the column called "name_id" (6th field) in the main file with the matching columns called "rs_id" in the multiple different files (3rd field), and add the additional information from these files while only keeping the rows in the main file:
The main.txt file looks like this:
number maf effect se pval name_id
34 0.7844 0.2197 0.0848 0.009585 snp1
78 0.6655 -0.1577 0.0796 0.04772 snp2
The chr1.info.gz like this:
use pos rs_id a1 a2 a3 a4
f 10584 snp34 0 0 0 0
g 10687 snp35 0 0 0 0
t 13303 snp1 0 0 0 0
The chr2.info.gz like this:
use pos rs_id a1 a2 a3 a4
s 13328 snp67 0 0 0 0
g 10612 snp2 0 0 0 0
t 13303 snp10 0 0 0 0
…and so on
I would like to get the file main.all.gz with added info from the other files:
number maf effect se pval name_id use pos rs_id a1 a2 a3 a4
34 0.7844 0.2197 0.0848 0.009585 snp1 t 13303 snp1 0 0 0 0
78 0.6655 -0.1577 0.0796 0.04772 snp2 g 10612 snp2 0 0 0 0
I have tried with "join" but it looks like it requires unzipping the files, sorting them and and saving them, and I get the message that I don't have enough space on device for this (I don't think I have the correct code anyway):
Please try to understand: sorting a file is a 'global operation', it cannot be performed 'on-the-fly'. (Yes, sort(1) can be invoked as a filter, but it will create a temporary file to store the complete input, and only after the EOF does it start the actually sorting.)
thank you. What I am trying to ask is whether there is a way other than sort to do this, for example awk works without sorting from my understanding, or is this step always needed? Thanks again.
--compress-program=PROG
compress temporaries with PROG; decompress them with PROG -d
Otherwise, is main.txt a small file? In that case you can scan chr1 and chr2 for every line in main.txt without sorting. Joining without sorting first is usually very slow because of the rescanning, but if one of the tables is tiny (main.txt in this case) it's okay.
Thank you ntubski for your reply. Unfortunately all the files (including main.txt) are huge files, but maybe what you suggested would at least not take up so much space, although I guess it will still take a long time.. From what I understand you are suggesting something like this?
for i in `cat main.txt | awk {'print $1'}`
do
grep ${i} chr1.info.gz
done
there is no other way around this without sorting.
You could use something like what you had in post #7 (except using for `...` that way would likely use up all your RAM, use while read instead), but it would be so much slower it's probably not worth it.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.