sed script to parse a file into smaller files with set # of lines
Hey all,
I have a text file with 80 million lines that look like this: @SRR016565.56469 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:511 length=76 GAGGACTTTCAAAGATAGGGATTAATTTGATCGCTGTTGGAATATTTTCAAATTATGAGGATATTATGCTAACCAC +SRR016565.56469 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:511 length=76 >GI=C;BCI55/7I54;DCI6=D/?I?I.00%H65F0=C1-2,58;*@).+36018<'@..-1..+0-+0+/%&++ @SRR016565.56470 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:649 length=76 ATATACCTCCATTTATCCCTGCAACACAACACGAGTGTGTCACCCTATCTATCCAGATTCCCAAACATTTTAGATT +SRR016565.56470 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:649 length=76 :3271,8I&1;:CF5+0:065+.4-.+-524*,/9):(.()+3''-&3))%,+%((%(*++*+%&$&*$'&)+%$4 I want to parse these huge files into smaller files with around 4 million lines each. I wrote a sed script to do this but it stops after creating the first output file but acts like it is still doing something (doesn't return the command prompt). Any ideas what my problem is? Code:
for FileName in *.fastq |
don't do the unnecessary, you can use csplit or split. check their man page.
else you can use awk. this example prints every 4 lines and output to file-1.txt , file-2.txt respective. Code:
awk 'NR%4==1{++c}{print $0 > "file-"c".txt"}' file |
80 million lines? I can only imagine that that's just overwhelming sed's processing ability. Maybe it's a buffer issue or something?
I don't know if that's really the problem, but in any case I think you can probably make it more efficient by telling sed to ignore blocks of lines before the ones you want to print. Try using a pattern like this instead: Code:
sed -n '1,100! { 101,200p }' file.txt Edit: By the way. there's no need to use all those separate sed commands. Just use a single instance of sed with multiple "-e" expressions. |
The problem with sed in this case is that even if you were to do e.g. sed '1!d' it would still read the rest of the file. You might pipe the file into a while loop, and have an nested while loop that reads a preset number of lines:
Code:
count=1 Kevin Barry PS You use $FileName in single quotes. |
All times are GMT -5. The time now is 02:21 AM. |