How to subset a large dataset by specifing the starting & end line?
Dear Everybody:
This is my first time on this forum. I am a statistician. I am trying to subset a large dataset by specifing the starting & end line. The dataset is pretty large (more than 300 million lines), containing around 1.2 million lines for a person. So I would like to split the dataset into per person consecutively. I tried wrap r codes, but R seems to have to read from top to where I want although I specified that it should skip the lines that other tasks have read. So the memory is increasing with the task ID. Finally I got kicked out by the administer. I guess that shell may do it much simple and elegently. First I thought of "split" command. But the the file has a header of 10 lines. So I can't split it into even size chuncks. Thank you in advance. Sunrain66 |
Most simple utilities will open the file and read (all the way) through it. If this is a plain text file and you know the line numbers of interest in advance, something like this will work
Code:
sed -n '105,20000 p ; 20000q' input.file > output.file If you wanted to do (say) three at once without re-reading the file, try this Code:
sed -n -e '105,20000 w file1.out' -e '50000,60000 w file2.out' -e '80000,90000 w file3.out' -e 90000q input.file These can also be passed as shell variables (use double quotes, not single) - any more than this probably needs some coding "smarts"; awk or perl. |
Hi, SYG00:
Thank you for your help. I can run the sed and awk commands and write the part that I extract to a file. But I found that my script can run, but I found nothing in my new file tmpt.txt. i=$SGE_TASK_ID nline=1199187 a=$i-1 st=$((a * nline + 11)) ed=$((i * nline + 10)) echo $st echo $ed export i; awk 'NR=="$st", NR=="$ed" ' origin.txt > tmpt.txt Also I tried to pass the variables "st" and "ed" in the command sed you suggested: sed -n '"$st","$ed" p' origin.txt > tmpt.txt It did not work. Please help me. Thanks. |
Not knowing much about the underlying data it's hard to
make an educated suggestion ... for the splitting, assuming that some field in each line indicates the person the line belongs to, awk would be useful for the split. Consider the following pseudo-awk-code. Something like that will give you a file per person to work with. Code:
{ Cheers, Tink |
Try the sed as
Code:
sed -n "$st,$ed p" origin.txt > tmpt.txt If you read the whole file, that earlier data has a higher likelihood of (needlessly) being flushed (LRU algorithm), and will have to be re-fetched from disk. All of it. |
Hi, SYG00:
How are you? I tried sed -n "$st,$ed p" origin.txt > tmpt.txt It works pretty good: It just read the lines I specified, with similar speed across the series of the tasks. So the memory it uses for each task is pretty stable. The admin of our cluster sys did not complain. So I would like to thank you for your help and time. |
All times are GMT -5. The time now is 10:33 AM. |