LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (http://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   sed script to parse a file into smaller files with set # of lines (http://www.linuxquestions.org/questions/linux-newbie-8/sed-script-to-parse-a-file-into-smaller-files-with-set-of-lines-768575/)

kmkocot 11-12-2009 12:06 AM

sed script to parse a file into smaller files with set # of lines
 
Hey all,

I have a text file with 80 million lines that look like this:
@SRR016565.56469 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:511 length=76
GAGGACTTTCAAAGATAGGGATTAATTTGATCGCTGTTGGAATATTTTCAAATTATGAGGATATTATGCTAACCAC
+SRR016565.56469 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:511 length=76
>GI=C;BCI55/7I54;DCI6=D/?I?I.00%H65F0=C1-2,58;*@).+36018<'@..-1..+0-+0+/%&++
@SRR016565.56470 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:649 length=76
ATATACCTCCATTTATCCCTGCAACACAACACGAGTGTGTCACCCTATCTATCCAGATTCCCAAACATTTTAGATT
+SRR016565.56470 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:649 length=76
:3271,8I&1;:CF5+0:065+.4-.+-524*,/9):(.()+3''-&3))%,+%((%(*++*+%&$&*$'&)+%$4

I want to parse these huge files into smaller files with around 4 million lines each. I wrote a sed script to do this but it stops after creating the first output file but acts like it is still doing something (doesn't return the command prompt). Any ideas what my problem is?

Code:

for FileName in *.fastq
do
sed -n '1,4000000 s/./&/w $FileName.01' $FileName
sed -n '4000001,8000000 s/./&/w $FileName.02' $FileName
sed -n '8000001,12000000 s/./&/w $FileName.03' $FileName
sed -n '12000001,16000000 s/./&/w $FileName.04' $FileName
sed -n '16000001,20000000 s/./&/w $FileName.05' $FileName
sed -n '20000001,24000000 s/./&/w $FileName.06' $FileName
sed -n '24000001,28000000 s/./&/w $FileName.07' $FileName
sed -n '28000001,32000000 s/./&/w $FileName.08' $FileName
sed -n '32000001,36000000 s/./&/w $FileName.09' $FileName
sed -n '36000001,40000000 s/./&/w $FileName.10' $FileName
sed -n '40000001,44000000 s/./&/w $FileName.11' $FileName
sed -n '44000001,48000000 s/./&/w $FileName.12' $FileName
sed -n '48000001,52000000 s/./&/w $FileName.13' $FileName
sed -n '52000001,56000000 s/./&/w $FileName.14' $FileName
sed -n '56000001,60000000 s/./&/w $FileName.15' $FileName
sed -n '60000001,64000000 s/./&/w $FileName.16' $FileName
sed -n '64000001,68000000 s/./&/w $FileName.17' $FileName
sed -n '68000001,72000000 s/./&/w $FileName.18' $FileName
sed -n '72000001,76000000 s/./&/w $FileName.19' $FileName
sed -n '76000001,$ s/./&/w $FileName.20' $FileName
done


ghostdog74 11-12-2009 12:17 AM

don't do the unnecessary, you can use csplit or split. check their man page.
else you can use awk. this example prints every 4 lines and output to file-1.txt , file-2.txt respective.
Code:

awk 'NR%4==1{++c}{print $0 > "file-"c".txt"}' file
i leave it to you to change to suit your need.

David the H. 11-12-2009 12:20 AM

80 million lines? I can only imagine that that's just overwhelming sed's processing ability. Maybe it's a buffer issue or something?

I don't know if that's really the problem, but in any case I think you can probably make it more efficient by telling sed to ignore blocks of lines before the ones you want to print. Try using a pattern like this instead:
Code:

sed -n '1,100! { 101,200p }' file.txt
The ! tells it to ignore the range you specify, before running the command in the brackets following it (in this case printing lines 101-200).

Edit: By the way. there's no need to use all those separate sed commands. Just use a single instance of sed with multiple "-e" expressions.

ta0kira 11-12-2009 11:51 AM

The problem with sed in this case is that even if you were to do e.g. sed '1!d' it would still read the rest of the file. You might pipe the file into a while loop, and have an nested while loop that reads a preset number of lines:
Code:

count=1

cat $FileName | while true; do
  number=0

  while [ $number -lt 4000000 ] && read line; do
    echo "$line"
    number=$(($number+1))
  done | sed 's/./&/' > "$FileName.0$count"

  count=$(($count+1))
done

Sorry, this wasn't tested because I'm on M$ right now. Hopefully you get the idea, though.
Kevin Barry

PS You use $FileName in single quotes.


All times are GMT -5. The time now is 12:54 PM.