How to subset a large dataset by specifing the starting & end line?
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
How to subset a large dataset by specifing the starting & end line?
Dear Everybody:
This is my first time on this forum. I am a statistician.
I am trying to subset a large dataset by specifing the starting & end line. The dataset is pretty large (more than 300 million lines), containing around 1.2 million lines for a person. So I would like to split the dataset into per person consecutively. I tried wrap r codes, but R seems to have to read from top to where I want although I specified that it should skip the lines that other tasks have read. So the memory is increasing with the task ID. Finally I got kicked out by the administer.
I guess that shell may do it much simple and elegently. First I thought of "split" command. But the the file has a header of 10 lines. So I can't split it into even size chuncks.
Most simple utilities will open the file and read (all the way) through it. If this is a plain text file and you know the line numbers of interest in advance, something like this will work
Code:
sed -n '105,20000 p ; 20000q' input.file > output.file
Adjust line numbers accordingly - the "20000q" stops it reading any further; saves all the unnecessary I/O reading to end of file.
If you wanted to do (say) three at once without re-reading the file, try this
Code:
sed -n -e '105,20000 w file1.out' -e '50000,60000 w file2.out' -e '80000,90000 w file3.out' -e 90000q input.file
Note the adjustment to the finishing line number.
These can also be passed as shell variables (use double quotes, not single) - any more than this probably needs some coding "smarts"; awk or perl.
Also I tried to pass the variables "st" and "ed" in the command sed you suggested:
sed -n '"$st","$ed" p' origin.txt > tmpt.txt
It did not work.
Please help me.
Not knowing much about the underlying data it's hard to
make an educated suggestion ... for the splitting, assuming
that some field in each line indicates the person the line
belongs to, awk would be useful for the split. Consider
the following pseudo-awk-code. Something like that will
give you a file per person to work with.
Code:
{
person=$X
print $0 >> person
}
Cheers,
Tink
Last edited by Tinkster; 08-29-2010 at 03:49 PM.
Reason: Simplified code
Edit: one of the reasons I suggested stopping after finding the data is that you are more likely to find the start of the file in cache on a re-run. A later run that looks for data further into the file may well find it doesn't need to issue physical I/O for data already read by a previous run. Much faster.
If you read the whole file, that earlier data has a higher likelihood of (needlessly) being flushed (LRU algorithm), and will have to be re-fetched from disk. All of it.
How are you?
I tried sed -n "$st,$ed p" origin.txt > tmpt.txt
It works pretty good: It just read the lines I specified, with similar speed across the series of the tasks. So the memory it uses for each task is pretty stable. The admin of our cluster sys did not complain.
So I would like to thank you for your help and time.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.