LinuxAnswers - the LQ Linux tutorial section.
Go Back > Forums > Linux Forums > Linux - Newbie
User Name
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!


  Search this Thread
Old 11-12-2009, 12:06 AM   #1
Registered: Dec 2007
Location: Queensland, Australia
Posts: 117

Rep: Reputation: 15
sed script to parse a file into smaller files with set # of lines

Hey all,

I have a text file with 80 million lines that look like this:
@SRR016565.56469 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:511 length=76
+SRR016565.56469 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:511 length=76
@SRR016565.56470 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:649 length=76
+SRR016565.56470 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:649 length=76

I want to parse these huge files into smaller files with around 4 million lines each. I wrote a sed script to do this but it stops after creating the first output file but acts like it is still doing something (doesn't return the command prompt). Any ideas what my problem is?

for FileName in *.fastq
sed -n '1,4000000 s/./&/w $FileName.01' $FileName
sed -n '4000001,8000000 s/./&/w $FileName.02' $FileName
sed -n '8000001,12000000 s/./&/w $FileName.03' $FileName
sed -n '12000001,16000000 s/./&/w $FileName.04' $FileName
sed -n '16000001,20000000 s/./&/w $FileName.05' $FileName
sed -n '20000001,24000000 s/./&/w $FileName.06' $FileName
sed -n '24000001,28000000 s/./&/w $FileName.07' $FileName
sed -n '28000001,32000000 s/./&/w $FileName.08' $FileName
sed -n '32000001,36000000 s/./&/w $FileName.09' $FileName
sed -n '36000001,40000000 s/./&/w $FileName.10' $FileName
sed -n '40000001,44000000 s/./&/w $FileName.11' $FileName
sed -n '44000001,48000000 s/./&/w $FileName.12' $FileName
sed -n '48000001,52000000 s/./&/w $FileName.13' $FileName
sed -n '52000001,56000000 s/./&/w $FileName.14' $FileName
sed -n '56000001,60000000 s/./&/w $FileName.15' $FileName
sed -n '60000001,64000000 s/./&/w $FileName.16' $FileName
sed -n '64000001,68000000 s/./&/w $FileName.17' $FileName
sed -n '68000001,72000000 s/./&/w $FileName.18' $FileName
sed -n '72000001,76000000 s/./&/w $FileName.19' $FileName
sed -n '76000001,$ s/./&/w $FileName.20' $FileName
Old 11-12-2009, 12:17 AM   #2
Senior Member
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 242Reputation: 242Reputation: 242
don't do the unnecessary, you can use csplit or split. check their man page.
else you can use awk. this example prints every 4 lines and output to file-1.txt , file-2.txt respective.
awk 'NR%4==1{++c}{print $0 > "file-"c".txt"}' file
i leave it to you to change to suit your need.
Old 11-12-2009, 12:20 AM   #3
David the H.
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1954Reputation: 1954Reputation: 1954Reputation: 1954Reputation: 1954Reputation: 1954Reputation: 1954Reputation: 1954Reputation: 1954Reputation: 1954Reputation: 1954
80 million lines? I can only imagine that that's just overwhelming sed's processing ability. Maybe it's a buffer issue or something?

I don't know if that's really the problem, but in any case I think you can probably make it more efficient by telling sed to ignore blocks of lines before the ones you want to print. Try using a pattern like this instead:
sed -n '1,100! { 101,200p }' file.txt
The ! tells it to ignore the range you specify, before running the command in the brackets following it (in this case printing lines 101-200).

Edit: By the way. there's no need to use all those separate sed commands. Just use a single instance of sed with multiple "-e" expressions.

Last edited by David the H.; 11-12-2009 at 12:32 AM.
Old 11-12-2009, 11:51 AM   #4
Senior Member
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
The problem with sed in this case is that even if you were to do e.g. sed '1!d' it would still read the rest of the file. You might pipe the file into a while loop, and have an nested while loop that reads a preset number of lines:

cat $FileName | while true; do

  while [ $number -lt 4000000 ] && read line; do
    echo "$line"
  done | sed 's/./&/' > "$FileName.0$count"

Sorry, this wasn't tested because I'm on M$ right now. Hopefully you get the idea, though.
Kevin Barry

PS You use $FileName in single quotes.

Last edited by ta0kira; 11-12-2009 at 11:54 AM.


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
sed delete lines from file one if regexp are listed in file two fucinheira Programming 6 09-17-2009 08:28 AM
how-to make sed read 1 random line into a file and parse it ot a variable?? Speedy2k Linux - Newbie 7 05-24-2009 11:23 AM
i am missing new lines when substituting with sed in a bash script FIRATYILDIRIM Programming 7 12-15-2008 03:07 PM
Script: SED for Copy/Paste Lines from Files unihiekka Programming 2 10-07-2008 06:12 AM
ssimple shell script to parse a file ~sed or awk stevie_velvet Programming 7 07-14-2006 03:41 AM

All times are GMT -5. The time now is 02:16 AM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration