LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Search this Thread
Old 11-12-2009, 12:06 AM   #1
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 98

Rep: Reputation: 15
sed script to parse a file into smaller files with set # of lines


Hey all,

I have a text file with 80 million lines that look like this:
@SRR016565.56469 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:511 length=76
GAGGACTTTCAAAGATAGGGATTAATTTGATCGCTGTTGGAATATTTTCAAATTATGAGGATATTATGCTAACCAC
+SRR016565.56469 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:511 length=76
>GI=C;BCI55/7I54;DCI6=D/?I?I.00%H65F0=C1-2,58;*@).+36018<'@..-1..+0-+0+/%&++
@SRR016565.56470 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:649 length=76
ATATACCTCCATTTATCCCTGCAACACAACACGAGTGTGTCACCCTATCTATCCAGATTCCCAAACATTTTAGATT
+SRR016565.56470 BI:081230_SL-XAQ_0001_FC30M3RAAXX:2:1:498:649 length=76
:3271,8I&1;:CF5+0:065+.4-.+-524*,/9).()+3''-&3))%,+%((%(*++*+%&$&*$'&)+%$4

I want to parse these huge files into smaller files with around 4 million lines each. I wrote a sed script to do this but it stops after creating the first output file but acts like it is still doing something (doesn't return the command prompt). Any ideas what my problem is?

Code:
for FileName in *.fastq
do
sed -n '1,4000000 s/./&/w $FileName.01' $FileName
sed -n '4000001,8000000 s/./&/w $FileName.02' $FileName
sed -n '8000001,12000000 s/./&/w $FileName.03' $FileName
sed -n '12000001,16000000 s/./&/w $FileName.04' $FileName
sed -n '16000001,20000000 s/./&/w $FileName.05' $FileName
sed -n '20000001,24000000 s/./&/w $FileName.06' $FileName
sed -n '24000001,28000000 s/./&/w $FileName.07' $FileName
sed -n '28000001,32000000 s/./&/w $FileName.08' $FileName
sed -n '32000001,36000000 s/./&/w $FileName.09' $FileName
sed -n '36000001,40000000 s/./&/w $FileName.10' $FileName
sed -n '40000001,44000000 s/./&/w $FileName.11' $FileName
sed -n '44000001,48000000 s/./&/w $FileName.12' $FileName
sed -n '48000001,52000000 s/./&/w $FileName.13' $FileName
sed -n '52000001,56000000 s/./&/w $FileName.14' $FileName
sed -n '56000001,60000000 s/./&/w $FileName.15' $FileName
sed -n '60000001,64000000 s/./&/w $FileName.16' $FileName
sed -n '64000001,68000000 s/./&/w $FileName.17' $FileName
sed -n '68000001,72000000 s/./&/w $FileName.18' $FileName
sed -n '72000001,76000000 s/./&/w $FileName.19' $FileName
sed -n '76000001,$ s/./&/w $FileName.20' $FileName
done
 
Old 11-12-2009, 12:17 AM   #2
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 240Reputation: 240Reputation: 240
don't do the unnecessary, you can use csplit or split. check their man page.
else you can use awk. this example prints every 4 lines and output to file-1.txt , file-2.txt respective.
Code:
awk 'NR%4==1{++c}{print $0 > "file-"c".txt"}' file
i leave it to you to change to suit your need.
 
Old 11-12-2009, 12:20 AM   #3
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946
80 million lines? I can only imagine that that's just overwhelming sed's processing ability. Maybe it's a buffer issue or something?

I don't know if that's really the problem, but in any case I think you can probably make it more efficient by telling sed to ignore blocks of lines before the ones you want to print. Try using a pattern like this instead:
Code:
sed -n '1,100! { 101,200p }' file.txt
The ! tells it to ignore the range you specify, before running the command in the brackets following it (in this case printing lines 101-200).

Edit: By the way. there's no need to use all those separate sed commands. Just use a single instance of sed with multiple "-e" expressions.

Last edited by David the H.; 11-12-2009 at 12:32 AM.
 
Old 11-12-2009, 11:51 AM   #4
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
The problem with sed in this case is that even if you were to do e.g. sed '1!d' it would still read the rest of the file. You might pipe the file into a while loop, and have an nested while loop that reads a preset number of lines:
Code:
count=1

cat $FileName | while true; do
  number=0

  while [ $number -lt 4000000 ] && read line; do
    echo "$line"
    number=$(($number+1))
  done | sed 's/./&/' > "$FileName.0$count"

  count=$(($count+1))
done
Sorry, this wasn't tested because I'm on M$ right now. Hopefully you get the idea, though.
Kevin Barry

PS You use $FileName in single quotes.

Last edited by ta0kira; 11-12-2009 at 11:54 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
sed delete lines from file one if regexp are listed in file two fucinheira Programming 6 09-17-2009 08:28 AM
how-to make sed read 1 random line into a file and parse it ot a variable?? Speedy2k Linux - Newbie 7 05-24-2009 11:23 AM
i am missing new lines when substituting with sed in a bash script FIRATYILDIRIM Programming 7 12-15-2008 03:07 PM
Script: SED for Copy/Paste Lines from Files unihiekka Programming 2 10-07-2008 06:12 AM
ssimple shell script to parse a file ~sed or awk stevie_velvet Programming 7 07-14-2006 03:41 AM


All times are GMT -5. The time now is 11:12 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration