LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 08-27-2010, 06:16 PM   #1
sunrain66
LQ Newbie
 
Registered: Aug 2010
Posts: 3

Rep: Reputation: 0
How to subset a large dataset by specifing the starting & end line?


Dear Everybody:

This is my first time on this forum. I am a statistician.

I am trying to subset a large dataset by specifing the starting & end line. The dataset is pretty large (more than 300 million lines), containing around 1.2 million lines for a person. So I would like to split the dataset into per person consecutively. I tried wrap r codes, but R seems to have to read from top to where I want although I specified that it should skip the lines that other tasks have read. So the memory is increasing with the task ID. Finally I got kicked out by the administer.

I guess that shell may do it much simple and elegently. First I thought of "split" command. But the the file has a header of 10 lines. So I can't split it into even size chuncks.

Thank you in advance.

Sunrain66
 
Old 08-27-2010, 09:58 PM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 15,935

Rep: Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209
Most simple utilities will open the file and read (all the way) through it. If this is a plain text file and you know the line numbers of interest in advance, something like this will work
Code:
sed -n '105,20000 p ; 20000q' input.file > output.file
Adjust line numbers accordingly - the "20000q" stops it reading any further; saves all the unnecessary I/O reading to end of file.
If you wanted to do (say) three at once without re-reading the file, try this
Code:
sed -n -e '105,20000 w file1.out' -e '50000,60000 w file2.out' -e '80000,90000 w file3.out' -e 90000q input.file
Note the adjustment to the finishing line number.

These can also be passed as shell variables (use double quotes, not single) - any more than this probably needs some coding "smarts"; awk or perl.
 
Old 08-29-2010, 01:42 PM   #3
sunrain66
LQ Newbie
 
Registered: Aug 2010
Posts: 3

Original Poster
Rep: Reputation: 0
Hi, SYG00:

Thank you for your help.
I can run the sed and awk commands and write the part that I extract to a file.

But I found that my script can run, but I found nothing in my new file tmpt.txt.

i=$SGE_TASK_ID
nline=1199187
a=$i-1
st=$((a * nline + 11))
ed=$((i * nline + 10))
echo $st
echo $ed
export i;
awk 'NR=="$st", NR=="$ed" ' origin.txt > tmpt.txt

Also I tried to pass the variables "st" and "ed" in the command sed you suggested:
sed -n '"$st","$ed" p' origin.txt > tmpt.txt
It did not work.
Please help me.

Thanks.
 
Old 08-29-2010, 02:18 PM   #4
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Not knowing much about the underlying data it's hard to
make an educated suggestion ... for the splitting, assuming
that some field in each line indicates the person the line
belongs to, awk would be useful for the split. Consider
the following pseudo-awk-code. Something like that will
give you a file per person to work with.
Code:
{
  person=$X
  print $0 >> person
}

Cheers,
Tink

Last edited by Tinkster; 08-29-2010 at 04:49 PM. Reason: Simplified code
 
Old 08-29-2010, 08:40 PM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 15,935

Rep: Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209Reputation: 2209
Try the sed as
Code:
sed -n "$st,$ed p" origin.txt > tmpt.txt
Edit: one of the reasons I suggested stopping after finding the data is that you are more likely to find the start of the file in cache on a re-run. A later run that looks for data further into the file may well find it doesn't need to issue physical I/O for data already read by a previous run. Much faster.
If you read the whole file, that earlier data has a higher likelihood of (needlessly) being flushed (LRU algorithm), and will have to be re-fetched from disk. All of it.

Last edited by syg00; 08-29-2010 at 08:54 PM.
 
1 members found this post helpful.
Old 09-01-2010, 12:26 PM   #6
sunrain66
LQ Newbie
 
Registered: Aug 2010
Posts: 3

Original Poster
Rep: Reputation: 0
Hi, SYG00:

How are you?
I tried sed -n "$st,$ed p" origin.txt > tmpt.txt
It works pretty good: It just read the lines I specified, with similar speed across the series of the tasks. So the memory it uses for each task is pretty stable. The admin of our cluster sys did not complain.

So I would like to thank you for your help and time.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] how to extract a subset from a huge dataset cliffyao Programming 9 03-16-2010 11:14 PM
End-of-line Characters missing from last line of md5 file. Md5sum fails mehorter Linux - General 5 06-29-2009 09:56 PM
Attempting to append a line of text to the end of the previous line market_garden Linux - General 4 12-11-2008 12:37 PM
Knoppix on CD, Upon starting I get large menu with large text & blank submenus samdaria Linux - Newbie 1 06-06-2008 10:59 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 01:06 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration