LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-14-2014, 01:04 PM   #1
francy_casa
LQ Newbie
 
Registered: Sep 2011
Posts: 12

Rep: Reputation: Disabled
Linux merge huge gzip files and keep only intersection


I have one tab-separated file called "main.txt" and I am trying to add information to this file from many multiple gz files called "chr1.info.gz", "chr2.info.gz", "chr3.info.gz" and so on, which contain a lot more rows than the main file. Note these files are zipped using gzip and I cannot unzip them first and save them because they are huge files (and I don't have space to do this).

I would like to match the column called "name_id" (6th field) in the main file with the matching columns called "rs_id" in the multiple different files (3rd field), and add the additional information from these files while only keeping the rows in the main file:

The main.txt file looks like this:

number maf effect se pval name_id
34 0.7844 0.2197 0.0848 0.009585 snp1
78 0.6655 -0.1577 0.0796 0.04772 snp2

The chr1.info.gz like this:

use pos rs_id a1 a2 a3 a4
f 10584 snp34 0 0 0 0
g 10687 snp35 0 0 0 0
t 13303 snp1 0 0 0 0

The chr2.info.gz like this:

use pos rs_id a1 a2 a3 a4
s 13328 snp67 0 0 0 0
g 10612 snp2 0 0 0 0
t 13303 snp10 0 0 0 0

…and so on

I would like to get the file main.all.gz with added info from the other files:

number maf effect se pval name_id use pos rs_id a1 a2 a3 a4
34 0.7844 0.2197 0.0848 0.009585 snp1 t 13303 snp1 0 0 0 0
78 0.6655 -0.1577 0.0796 0.04772 snp2 g 10612 snp2 0 0 0 0

I have tried with "join" but it looks like it requires unzipping the files, sorting them and and saving them, and I get the message that I don't have enough space on device for this (I don't think I have the correct code anyway):

join -1 6 -2 3 <(zcat main.txt | sort -k6,6) <(zcat chr1.info.gz | sort -k3,3 ) > try.txt

I have tried with awk but I am definitely doing several things wrong since it gives me an empty file, and I get stuck when using multiple files.

I've been spending a day on this and can't find a good solution, Can you please help me solve this?

Thank you very much! -f
 
Old 01-15-2014, 09:06 AM   #2
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,863
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
Yes, that's how sort works. Perhaps you should do a gunzip+sort+gzip step on a different computer that has more disk-space.
 
Old 01-15-2014, 09:16 AM   #3
francy_casa
LQ Newbie
 
Registered: Sep 2011
Posts: 12

Original Poster
Rep: Reputation: Disabled
It's too big and takes a long time. Do you have suggestions not having to unzip and re-write the files?
 
Old 01-15-2014, 09:27 AM   #4
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,863
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
Please try to understand: sorting a file is a 'global operation', it cannot be performed 'on-the-fly'. (Yes, sort(1) can be invoked as a filter, but it will create a temporary file to store the complete input, and only after the EOF does it start the actually sorting.)
 
Old 01-15-2014, 09:31 AM   #5
francy_casa
LQ Newbie
 
Registered: Sep 2011
Posts: 12

Original Poster
Rep: Reputation: Disabled
Hi NavemTeme,

thank you. What I am trying to ask is whether there is a way other than sort to do this, for example awk works without sorting from my understanding, or is this step always needed? Thanks again.
 
Old 01-15-2014, 09:40 AM   #6
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,781

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Maybe passing --compress-program=gzip to sort would help?

sort(1):
Code:
--compress-program=PROG
    compress temporaries with PROG; decompress them with PROG -d
Otherwise, is main.txt a small file? In that case you can scan chr1 and chr2 for every line in main.txt without sorting. Joining without sorting first is usually very slow because of the rescanning, but if one of the tables is tiny (main.txt in this case) it's okay.
 
Old 01-15-2014, 10:17 AM   #7
francy_casa
LQ Newbie
 
Registered: Sep 2011
Posts: 12

Original Poster
Rep: Reputation: Disabled
Thank you ntubski for your reply. Unfortunately all the files (including main.txt) are huge files, but maybe what you suggested would at least not take up so much space, although I guess it will still take a long time.. From what I understand you are suggesting something like this?

for i in `cat main.txt | awk {'print $1'}`
do
grep ${i} chr1.info.gz
done
 
Old 01-15-2014, 10:39 AM   #8
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,863
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
> What I am trying to ask is whether there is a way other than sort to do this

You should decide whether sorting is necessary or not. You can create unsorted test-files, and see what happens.
 
Old 01-15-2014, 11:23 AM   #9
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,781

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
Originally Posted by francy_casa View Post
all the files (including main.txt) are huge files
In that case, I would suggest sorting every file and deleting the unsorted version first:
Code:
Warning: untested
gzip main.txt # assuming it's uncompressed zcat main.txt.gz | sort -k6,6 --compress-program=gzip | gzip -c > main.sorted.txt.gz rm main.txt.gz for file in chr1.info.gz chr2.info.gz ; do zcat "$file" | sort -k3,3 --compress-program=gzip | gzip -c > "${file%.info.gz}.sorted.info.gz" rm "$file" done
 
1 members found this post helpful.
Old 01-15-2014, 11:41 AM   #10
francy_casa
LQ Newbie
 
Registered: Sep 2011
Posts: 12

Original Poster
Rep: Reputation: Disabled
Ok thank you, so I guess that means that there is no other way around this without sorting.
Thank you for your help!
 
Old 01-15-2014, 03:56 PM   #11
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,781

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
Originally Posted by francy_casa View Post
there is no other way around this without sorting.
You could use something like what you had in post #7 (except using for `...` that way would likely use up all your RAM, use while read instead), but it would be so much slower it's probably not worth it.
 
  


Reply

Tags
big, linux command, merge



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: How to diff and merge files or directories on Linux LXer Syndicated Linux News 0 10-02-2013 09:51 PM
[SOLVED] convert msdos gzip to linux gzip PoleStar Linux - Newbie 6 05-05-2013 04:43 PM
LXer: The Staging Merge For Linux 3.2 Kernel Is Huge LXer Syndicated Linux News 0 10-26-2011 02:41 AM
how do merge diff of two files on linux amit_pansuria General 2 03-09-2010 10:11 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:21 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration