LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-13-2010, 09:17 AM   #1
cliffyao
LQ Newbie
 
Registered: Oct 2009
Posts: 27

Rep: Reputation: 15
how to extract a subset from a huge dataset


Hi, All

I have a huge file which has 450G. Its format is as below

x1 50020 A 1
x1 50021 B 8
x1 50022 C 9
x1 50023 A 10
x2 50024 D 5
x2 50025 C 7
x2 50026 F 8
x2 50027 M 1
:
:

Now, I want to extract a subset from this file. In this subset, column 1 is x10, column 2 is from 600000 to 30000000. I wrote the following perl script but it doesn't work:

#!/usr/bin/perl

$file1 = $ARGV[0]; # Input file
$file2 = $ARGV[1]; # Output file

open (IN, $file1);
while ($line = <IN>)
{
chomp($line);
@array = split(/\t/,$line);

if ($array[0] eq 'x10')
{
if (($array[1] >= 600000) && ($array[1] <= 26279795))
{
open (OUT, ">>$file2");
print OUT "$line\n";
close OUT;
}
}
}
close IN;
exit;

I guess the input file and output file are both too big that my script can't handle it.

Anyone knows if there is any good way to do it? Perl or Shell scripts are preferred..

All your help will be appreciated!
 
Old 03-13-2010, 09:20 AM   #2
raju.mopidevi
Senior Member
 
Registered: Jan 2009
Location: vijayawada, India
Distribution: openSUSE 11.2, Ubuntu 9.0.4
Posts: 1,155
Blog Entries: 12

Rep: Reputation: 92
Good way of doing .... USE awk or sed
 
Old 03-13-2010, 09:37 AM   #3
cliffyao
LQ Newbie
 
Registered: Oct 2009
Posts: 27

Original Poster
Rep: Reputation: 15
Thanks. I just know a little about awk. Not familar with Sed at all. Do you know how to use awk/sed to do that? The dataset is tab-delimited, btw.
 
Old 03-13-2010, 09:47 AM   #4
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Code:
sed -n '/x10 600000/,/x10 30000000/p' inputFile > newFile
This should do it. I only used spaces in my regExp, so you will have to edit this. To make a tab in the shell press Ctrl-v and then tab. Alternativly you could type this line in a text editor where tab will work as expected.

P.S.: You might want to create a smaller file with the same format and test it. Going through 450G of data might take a bit too long for testing purposes.

Last edited by crts; 03-13-2010 at 09:57 AM.
 
Old 03-14-2010, 01:20 AM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Code:
awk '$1 == "x10" && $2 > 600000 && $2 <= 30000000 { print $0 }' in.txt > out.txt
 
Old 03-14-2010, 03:26 AM   #6
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
And of course perl will do likewise just fine - using similar logic.
 
Old 03-16-2010, 03:11 PM   #7
cliffyao
LQ Newbie
 
Registered: Oct 2009
Posts: 27

Original Poster
Rep: Reputation: 15
Thanks all. I just tried grail's awk command and it works!

I really appreciate everyone's help on my problem although I didn't get chance to try all of them.

Thanks all!
 
Old 03-16-2010, 06:16 PM   #8
raju.mopidevi
Senior Member
 
Registered: Jan 2009
Location: vijayawada, India
Distribution: openSUSE 11.2, Ubuntu 9.0.4
Posts: 1,155
Blog Entries: 12

Rep: Reputation: 92
Finally You got it. That's good.
Make this thread as SOLVED. You can do this from the top menu. Thread tools -> SOLVED
 
Old 03-16-2010, 10:02 PM   #9
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by cliffyao View Post
Thanks all. I just tried grail's awk command and it works!

I really appreciate everyone's help on my problem although I didn't get chance to try all of them.

Thanks all!
you might want to stop processing after the 30000000th line

Code:
awk '$1 == "x10" && $2 > 600000{print $0 }$2>30000000{exit}' in.txt > out.txt
 
Old 03-16-2010, 10:14 PM   #10
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941
awk is, as you have seen, a very powerful tool that is expressly designed for tasks such as these. Learn it. Use it well.

The Perl programming language actually grew directly out of this one. It has since taken on a life of its own, and most of that "life" these days comes out of a vast library of tested software that you can use with it.

There are many power-tools in the Unix/Linux environment, and, as the Perl folks would say:

TMTOWTDI = There's More Than One Way To Do It.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
GNUplot - plot a subset of a data column MALDATA Linux - Software 2 03-11-2015 12:11 PM
htaccess deny a subset tigerstripedcat Linux - Networking 2 09-10-2006 09:35 AM
passing a dataset from one sub to another mrobertson Programming 2 03-03-2006 07:39 AM
Printing a datagrid/dataset in c# mrobertson Programming 1 02-27-2006 03:43 PM
Displaying an oracle dataset on a datagrid using C# mrobertson Programming 1 01-11-2006 02:34 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:21 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration