How to subset a large dataset by specifing the starting & end line?

sunrain66 · 08-27-2010, 05:16 PM

Dear Everybody:

This is my first time on this forum. I am a statistician.

I am trying to subset a large dataset by specifing the starting & end line. The dataset is pretty large (more than 300 million lines), containing around 1.2 million lines for a person. So I would like to split the dataset into per person consecutively. I tried wrap r codes, but R seems to have to read from top to where I want although I specified that it should skip the lines that other tasks have read. So the memory is increasing with the task ID. Finally I got kicked out by the administer.

I guess that shell may do it much simple and elegently. First I thought of "split" command. But the the file has a header of 10 lines. So I can't split it into even size chuncks.

Thank you in advance.

Sunrain66

syg00 · 08-27-2010, 08:58 PM

Most simple utilities will open the file and read (all the way) through it. If this is a plain text file and you know the line numbers of interest in advance, something like this will work

Code:

sed -n '105,20000 p ; 20000q' input.file > output.file

Adjust line numbers accordingly - the "20000q" stops it reading any further; saves all the unnecessary I/O reading to end of file.
If you wanted to do (say) three at once without re-reading the file, try this

Code:

sed -n -e '105,20000 w file1.out' -e '50000,60000 w file2.out' -e '80000,90000 w file3.out' -e 90000q input.file

Note the adjustment to the finishing line number.

These can also be passed as shell variables (use double quotes, not single) - any more than this probably needs some coding "smarts"; awk or perl.

sunrain66 · 08-29-2010, 12:42 PM

Hi, SYG00:

Thank you for your help.
I can run the sed and awk commands and write the part that I extract to a file.

But I found that my script can run, but I found nothing in my new file tmpt.txt.

i=$SGE_TASK_ID
nline=1199187
a=$i-1
st=$((a * nline + 11))
ed=$((i * nline + 10))
echo $st
echo $ed
export i;
awk 'NR=="$st", NR=="$ed" ' origin.txt > tmpt.txt

Also I tried to pass the variables "st" and "ed" in the command sed you suggested:
sed -n '"$st","$ed" p' origin.txt > tmpt.txt
It did not work.
Please help me.

Thanks.

Tinkster · 08-29-2010, 01:18 PM

Not knowing much about the underlying data it's hard to
make an educated suggestion ... for the splitting, assuming
that some field in each line indicates the person the line
belongs to, awk would be useful for the split. Consider
the following pseudo-awk-code. Something like that will
give you a file per person to work with.

Code:

{
  person=$X
  print $0 >> person
}

Cheers,
Tink

syg00 · 08-29-2010, 07:40 PM

Try the sed as

Code:

sed -n "$st,$ed p" origin.txt > tmpt.txt

Edit: one of the reasons I suggested stopping after finding the data is that you are more likely to find the start of the file in cache on a re-run. A later run that looks for data further into the file may well find it doesn't need to issue physical I/O for data already read by a previous run. Much faster.
If you read the whole file, that earlier data has a higher likelihood of (needlessly) being flushed (LRU algorithm), and will have to be re-fetched from disk. All of it.

sunrain66 · 09-01-2010, 11:26 AM

Hi, SYG00:

How are you?
I tried sed -n "$st,$ed p" origin.txt > tmpt.txt
It works pretty good: It just read the lines I specified, with similar speed across the series of the tasks. So the memory it uses for each task is pretty stable. The admin of our cluster sys did not complain.

So I would like to thank you for your help and time.