LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Closed Thread
  Search this Thread
Old 01-15-2014, 06:19 AM   #1
francy_casa
LQ Newbie
 
Registered: Sep 2011
Posts: 12

Rep: Reputation: Disabled
Linux merge huge gzip files and keep only intersection


I have a very big tab-separated file called "main.txt" and I am trying to add information to this file from many multiple gz files called "chr1.info.gz", "chr2.info.gz", "chr3.info.gz" and so on, which contain a lot more rows than the main file. Note these files are zipped using gzip and I cannot unzip them first and save them because they are huge files (and I don't have space to do this).

I would like to match the column called "name_id" (6th field) in the main file with the matching columns called "rs_id" in the multiple different files (3rd field), and add the additional information from these files while only keeping the rows in the main file:

The main.txt file looks like this:

number maf effect se pval name_id
34 0.7844 0.2197 0.0848 0.009585 snp1
78 0.6655 -0.1577 0.0796 0.04772 snp2

The chr1.info.gz like this:

use pos rs_id a1 a2 a3 a4
f 10584 snp34 0 0 0 0
g 10687 snp35 0 0 0 0
t 13303 snp1 0 0 0 0

The chr2.info.gz like this:

use pos rs_id a1 a2 a3 a4
s 13328 snp67 0 0 0 0
g 10612 snp2 0 0 0 0
t 13303 snp10 0 0 0 0

…and so on

I would like to get the file main.all.gz with added info from the other files:

number maf effect se pval name_id use pos rs_id a1 a2 a3 a4
34 0.7844 0.2197 0.0848 0.009585 snp1 t 13303 snp1 0 0 0 0
78 0.6655 -0.1577 0.0796 0.04772 snp2 g 10612 snp2 0 0 0 0

I have tried with "join" but it looks like it requires unzipping the files, sorting them and and saving them, and I get the message that I don't have enough space on device for this (I don't think I have the correct code anyway):

join -1 6 -2 3 <(zcat main.txt | sort -k6,6) <(zcat chr1.info.gz | sort -k3,3 ) > try.txt

I have tried with awk but I am definitely doing several things wrong since it gives me an empty file, and I get stuck when using multiple files.

I've been spending a day on this and can't find a good solution, Can you please help me solve this?

Thank you very much! -f
 
Old 01-15-2014, 06:51 AM   #2
acid_kewpie
Moderator
 
Registered: Jun 2001
Location: UK
Distribution: Gentoo, RHEL, Fedora, Centos
Posts: 43,417

Rep: Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985Reputation: 1985
Please post your thread in only one forum. Posting a single thread in the most relevant forum will make it easier for members to help you and will keep the discussion in one place. This thread is being closed because it is a duplicate.
 
  


Closed Thread

Tags
linux, merge, programming



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Linux merge huge gzip files and keep only intersection francy_casa Programming 10 01-15-2014 03:56 PM
[SOLVED] convert msdos gzip to linux gzip PoleStar Linux - Newbie 6 05-05-2013 04:43 PM
LXer: The Staging Merge For Linux 3.2 Kernel Is Huge LXer Syndicated Linux News 0 10-26-2011 02:41 AM
how do merge diff of two files on linux amit_pansuria General 2 03-09-2010 10:11 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 06:20 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration