[SOLVED] how to extract a subset from a huge dataset

cliffyao · 03-13-2010, 09:17 AM

Hi, All

I have a huge file which has 450G. Its format is as below

x1 50020 A 1
x1 50021 B 8
x1 50022 C 9
x1 50023 A 10
x2 50024 D 5
x2 50025 C 7
x2 50026 F 8
x2 50027 M 1
:
:

Now, I want to extract a subset from this file. In this subset, column 1 is x10, column 2 is from 600000 to 30000000. I wrote the following perl script but it doesn't work:

#!/usr/bin/perl

$file1 = $ARGV[0]; # Input file
$file2 = $ARGV[1]; # Output file

open (IN, $file1);
while ($line = <IN>)
{
chomp($line);
@array = split(/\t/,$line);

if ($array[0] eq 'x10')
{
if (($array[1] >= 600000) && ($array[1] <= 26279795))
{
open (OUT, ">>$file2");
print OUT "$line\n";
close OUT;
}
}
}
close IN;
exit;

I guess the input file and output file are both too big that my script can't handle it.

Anyone knows if there is any good way to do it? Perl or Shell scripts are preferred..

All your help will be appreciated!

raju.mopidevi · 03-13-2010, 09:20 AM

Good way of doing .... USE awk or sed

cliffyao · 03-13-2010, 09:37 AM

Thanks. I just know a little about awk. Not familar with Sed at all. Do you know how to use awk/sed to do that? The dataset is tab-delimited, btw.

crts · 03-13-2010, 09:47 AM

Code:

sed -n '/x10 600000/,/x10 30000000/p' inputFile > newFile

This should do it. I only used spaces in my regExp, so you will have to edit this. To make a tab in the shell press Ctrl-v and then tab. Alternativly you could type this line in a text editor where tab will work as expected.

P.S.: You might want to create a smaller file with the same format and test it. Going through 450G of data might take a bit too long for testing purposes.

grail · 03-14-2010, 01:20 AM

Code:

awk '$1 == "x10" && $2 > 600000 && $2 <= 30000000 { print $0 }' in.txt > out.txt

syg00 · 03-14-2010, 03:26 AM

And of course perl will do likewise just fine - using similar logic.

cliffyao · 03-16-2010, 03:11 PM

Thanks all. I just tried grail's awk command and it works!

I really appreciate everyone's help on my problem although I didn't get chance to try all of them.

Thanks all!

raju.mopidevi · 03-16-2010, 06:16 PM

Finally You got it. That's good.
Make this thread as SOLVED. You can do this from the top menu. Thread tools -> SOLVED

ghostdog74 · 03-16-2010, 10:02 PM

Quote:

Originally Posted by cliffyao

Thanks all. I just tried grail's awk command and it works!

I really appreciate everyone's help on my problem although I didn't get chance to try all of them.

Thanks all!

you might want to stop processing after the 30000000th line

Code:

awk '$1 == "x10" && $2 > 600000{print $0 }$2>30000000{exit}' in.txt > out.txt

sundialsvcs · 03-16-2010, 10:14 PM

awk is, as you have seen, a very powerful tool that is expressly designed for tasks such as these. Learn it. Use it well.

The Perl programming language actually grew directly out of this one. It has since taken on a life of its own, and most of that "life" these days comes out of a vast library of tested software that you can use with it.

There are many power-tools in the Unix/Linux environment, and, as the Perl folks would say:

TMTOWTDI = There's More Than One Way To Do It.