LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-22-2009, 01:01 PM   #1
mcbenus
LQ Newbie
 
Registered: Feb 2007
Posts: 24

Rep: Reputation: 15
split file based on number of string apperance


I am trying to split a large file into fragments. Within the text file there is a string recurring thousands of times and I wish to split it every 300 appearances. I have used csplit before, but I don't know (if I can) how to tell csplit to skip certain appearances of the string.

Alternatively, I thought of reading the file line by line, echo each line into a new file, and count each appearance of the string. When the count is 300, echo the lines into a new file (and restart the count). My problem is that I only know to count the total appearances of the string in the file using grep -c.

Can I count the appearances of the string "line by line"? (with awk maybe?) Alternatively, can I count the string using grep but only with with the first x lines of the file?

I've been using csh for this script.

This seems as a very inefficient method so more elegant ways are welcome. Thanks!
 
Old 12-22-2009, 01:41 PM   #2
gnashley
Amigo developer
 
Registered: Dec 2003
Location: Germany
Distribution: Slackware
Posts: 4,928

Rep: Reputation: 612Reputation: 612Reputation: 612Reputation: 612Reputation: 612Reputation: 612
Bash or (maybe) sh:
Code:
COUNT=0
OUT=1
while read LINE ; do
case $LINE in
 *"string*) echo $LINE >> out.file ; ((COUNT++)) ;;
 *) [ $COUNT -lt 300 ] && echo $LINE >> $OUT.file
esac
if [ $COUNT -eq 300 ] ; then
 COUNT=0
 ((OUT++))
fi
done< in.file
 
Old 12-22-2009, 03:40 PM   #3
mcbenus
LQ Newbie
 
Registered: Feb 2007
Posts: 24

Original Poster
Rep: Reputation: 15
Thanks for the reply and the code. I am not familiar with bash scripts but I kind of get it. I replaced in.file with my input file, the string with my string (and added another " after the string for the script to run). However, something goes wrong with COUNT becasue it stays 0. The script creates only one file (1.file) which has a much higher than 300 apperances of the string. Though I understand what you wrote I can't find why wouldn't COUNT increase by 1.

Any ideas?

Quote:
Originally Posted by gnashley View Post
Bash or (maybe) sh:
Code:
COUNT=0
OUT=1
while read LINE ; do
case $LINE in
 *"string*) echo $LINE >> out.file ; ((COUNT++)) ;;
 *) [ $COUNT -lt 300 ] && echo $LINE >> $OUT.file
esac
if [ $COUNT -eq 300 ] ; then
 COUNT=0
 ((OUT++))
fi
done< in.file
 
Old 12-22-2009, 06:33 PM   #4
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
unless you only have the shell to work with, otherwise, use awk (or other languages) good at parsing big files
Code:
# assuming pattern searched at every line, regardless of how many times it appears on the line
awk '/pattern/{++c}c==300{p++;c=0}{print $0 > "file_"p".txt" }' file

Last edited by ghostdog74; 12-22-2009 at 06:34 PM.
 
1 members found this post helpful.
Old 12-22-2009, 07:37 PM   #5
smeezekitty
Senior Member
 
Registered: Sep 2009
Location: Washington U.S.
Distribution: M$ Windows / Debian / Ubuntu / DSL / many others
Posts: 2,339

Rep: Reputation: 231Reputation: 231Reputation: 231
Code:
COUNT=0
OUT=1
while read LINE ; do
case $LINE in
 *"string"*) echo $LINE >> out.file ; ((COUNT++)) ;;
 *) [ $COUNT -lt 300 ] && echo $LINE >> $OUT.file
esac
if [ $COUNT -eq 300 ] ; then
 COUNT=0
 ((OUT++))
fi
done< in.file
You did replace "string" with the proper string right?
 
Old 12-24-2009, 11:35 AM   #6
mcbenus
LQ Newbie
 
Registered: Feb 2007
Posts: 24

Original Poster
Rep: Reputation: 15
Thanks for the reply. I should be able to put this line in my csh shell, right?

I tried to do that, but I am getting an error saying:
Missing }.
Missing }.
awk: file
awk: ^ syntax error

where file is my input file (the last word in your code). Any ideas where the error is?

Thanks for the help.

Quote:
Originally Posted by ghostdog74 View Post
unless you only have the shell to work with, otherwise, use awk (or other languages) good at parsing big files
Code:
# assuming pattern searched at every line, regardless of how many times it appears on the line
awk '/pattern/{++c}c==300{p++;c=0}{print $0 > "file_"p".txt" }' file
 
Old 12-24-2009, 11:45 AM   #7
mcbenus
LQ Newbie
 
Registered: Feb 2007
Posts: 24

Original Poster
Rep: Reputation: 15
Sorry for the previous. it is working perfectly! (I had an error with my ` ' ").

Thanks a lot!


Quote:
Originally Posted by ghostdog74 View Post
unless you only have the shell to work with, otherwise, use awk (or other languages) good at parsing big files
Code:
# assuming pattern searched at every line, regardless of how many times it appears on the line
awk '/pattern/{++c}c==300{p++;c=0}{print $0 > "file_"p".txt" }' file
 
Old 12-24-2009, 11:47 AM   #8
mcbenus
LQ Newbie
 
Registered: Feb 2007
Posts: 24

Original Poster
Rep: Reputation: 15
Yes, string was replaced with my string. I am not sure why it doesn't work. Anyway, ghostdog74 awk line works perfect. Thanks.

Quote:
Originally Posted by smeezekitty View Post
Code:
COUNT=0
OUT=1
while read LINE ; do
case $LINE in
 *"string"*) echo $LINE >> out.file ; ((COUNT++)) ;;
 *) [ $COUNT -lt 300 ] && echo $LINE >> $OUT.file
esac
if [ $COUNT -eq 300 ] ; then
 COUNT=0
 ((OUT++))
fi
done< in.file
You did replace "string" with the proper string right?
 
Old 12-24-2009, 01:28 PM   #9
mcbenus
LQ Newbie
 
Registered: Feb 2007
Posts: 24

Original Poster
Rep: Reputation: 15
I do have one question about your code:

It works perfect, but the first file that is generated by the script is called file_.txt (without a number). All the following files are numbered from 1 (file_1.txt) and up. I tried to enter p=1 in a few places (so the counting will start p=1 and not from an empty p), but couldn't make it to work. Any advice?


Quote:
Originally Posted by ghostdog74 View Post
unless you only have the shell to work with, otherwise, use awk (or other languages) good at parsing big files
Code:
# assuming pattern searched at every line, regardless of how many times it appears on the line
awk '/pattern/{++c}c==300{p++;c=0}{print $0 > "file_"p".txt" }' file
 
Old 12-24-2009, 04:50 PM   #10
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Boundary conditions are the bane of programming - you might also find the first file is one line short. Try this
Code:
awk 'BEGIN{p=1;c=-1}/pattern/{++c}c==300{p++;c=0}{print $0 > "file_"p".txt" }' file
 
Old 12-24-2009, 06:44 PM   #11
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by mcbenus View Post
I tried to enter p=1 in a few places (so the counting will start p=1 and not from an empty p), but couldn't make it to work. Any advice?
Code:
awk 'BEGIN{p=1} ..... '
now, please head down to gawk manual(my sig) and study it.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
split string on file delimeter Jeroen1000 Programming 7 10-05-2009 08:35 AM
Bash or PHP: Split csv file based on field value? guest Programming 4 02-06-2009 12:57 AM
how do I split large file by string? khairil Programming 5 04-28-2008 10:37 PM
awk: Using split to divide string to array. How do I find out the number of elements? vxc69 Programming 9 02-09-2008 12:49 PM
Look for a string on a file and get its line number horacioemilio Programming 15 01-08-2008 08:32 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:21 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration