split file based on number of string apperance

mcbenus · 12-22-2009, 01:01 PM

I am trying to split a large file into fragments. Within the text file there is a string recurring thousands of times and I wish to split it every 300 appearances. I have used csplit before, but I don't know (if I can) how to tell csplit to skip certain appearances of the string.

Alternatively, I thought of reading the file line by line, echo each line into a new file, and count each appearance of the string. When the count is 300, echo the lines into a new file (and restart the count). My problem is that I only know to count the total appearances of the string in the file using grep -c.

Can I count the appearances of the string "line by line"? (with awk maybe?) Alternatively, can I count the string using grep but only with with the first x lines of the file?

I've been using csh for this script.

This seems as a very inefficient method so more elegant ways are welcome. Thanks!

gnashley · 12-22-2009, 01:41 PM

Bash or (maybe) sh:

Code:

COUNT=0
OUT=1
while read LINE ; do
case $LINE in
 *"string*) echo $LINE >> out.file ; ((COUNT++)) ;;
 *) [ $COUNT -lt 300 ] && echo $LINE >> $OUT.file
esac
if [ $COUNT -eq 300 ] ; then
 COUNT=0
 ((OUT++))
fi
done< in.file

mcbenus · 12-22-2009, 03:40 PM

Thanks for the reply and the code. I am not familiar with bash scripts but I kind of get it. I replaced in.file with my input file, the string with my string (and added another " after the string for the script to run). However, something goes wrong with COUNT becasue it stays 0. The script creates only one file (1.file) which has a much higher than 300 apperances of the string. Though I understand what you wrote I can't find why wouldn't COUNT increase by 1.

Any ideas?

Quote:

Originally Posted by gnashley

Bash or (maybe) sh:

Code:

COUNT=0
OUT=1
while read LINE ; do
case $LINE in
 *"string*) echo $LINE >> out.file ; ((COUNT++)) ;;
 *) [ $COUNT -lt 300 ] && echo $LINE >> $OUT.file
esac
if [ $COUNT -eq 300 ] ; then
 COUNT=0
 ((OUT++))
fi
done< in.file

ghostdog74 · 12-22-2009, 06:33 PM

unless you only have the shell to work with, otherwise, use awk (or other languages) good at parsing big files

Code:

# assuming pattern searched at every line, regardless of how many times it appears on the line
awk '/pattern/{++c}c==300{p++;c=0}{print $0 > "file_"p".txt" }' file

smeezekitty · 12-22-2009, 07:37 PM

Code:

COUNT=0
OUT=1
while read LINE ; do
case $LINE in
 *"string"*) echo $LINE >> out.file ; ((COUNT++)) ;;
 *) [ $COUNT -lt 300 ] && echo $LINE >> $OUT.file
esac
if [ $COUNT -eq 300 ] ; then
 COUNT=0
 ((OUT++))
fi
done< in.file

You did replace "string" with the proper string right?

mcbenus · 12-24-2009, 11:35 AM

Thanks for the reply. I should be able to put this line in my csh shell, right?

I tried to do that, but I am getting an error saying:
Missing }.
Missing }.
awk: file
awk: ^ syntax error

where file is my input file (the last word in your code). Any ideas where the error is?

Thanks for the help.

Quote:

Originally Posted by ghostdog74

unless you only have the shell to work with, otherwise, use awk (or other languages) good at parsing big files

Code:

# assuming pattern searched at every line, regardless of how many times it appears on the line
awk '/pattern/{++c}c==300{p++;c=0}{print $0 > "file_"p".txt" }' file

mcbenus · 12-24-2009, 11:45 AM

Sorry for the previous. it is working perfectly! (I had an error with my ` ' ").

Thanks a lot!

Quote:

Originally Posted by ghostdog74

unless you only have the shell to work with, otherwise, use awk (or other languages) good at parsing big files

Code:

# assuming pattern searched at every line, regardless of how many times it appears on the line
awk '/pattern/{++c}c==300{p++;c=0}{print $0 > "file_"p".txt" }' file

mcbenus · 12-24-2009, 11:47 AM

Yes, string was replaced with my string. I am not sure why it doesn't work. Anyway, ghostdog74 awk line works perfect. Thanks.

Quote:

Originally Posted by smeezekitty

Code:

COUNT=0
OUT=1
while read LINE ; do
case $LINE in
 *"string"*) echo $LINE >> out.file ; ((COUNT++)) ;;
 *) [ $COUNT -lt 300 ] && echo $LINE >> $OUT.file
esac
if [ $COUNT -eq 300 ] ; then
 COUNT=0
 ((OUT++))
fi
done< in.file

You did replace "string" with the proper string right?

mcbenus · 12-24-2009, 01:28 PM

I do have one question about your code:

It works perfect, but the first file that is generated by the script is called file_.txt (without a number). All the following files are numbered from 1 (file_1.txt) and up. I tried to enter p=1 in a few places (so the counting will start p=1 and not from an empty p), but couldn't make it to work. Any advice?

Quote:

Originally Posted by ghostdog74

unless you only have the shell to work with, otherwise, use awk (or other languages) good at parsing big files

Code:

# assuming pattern searched at every line, regardless of how many times it appears on the line
awk '/pattern/{++c}c==300{p++;c=0}{print $0 > "file_"p".txt" }' file

syg00 · 12-24-2009, 04:50 PM

Boundary conditions are the bane of programming - you might also find the first file is one line short. Try this

Code:

awk 'BEGIN{p=1;c=-1}/pattern/{++c}c==300{p++;c=0}{print $0 > "file_"p".txt" }' file

ghostdog74 · 12-24-2009, 06:44 PM

Quote:

Originally Posted by mcbenus

I tried to enter p=1 in a few places (so the counting will start p=1 and not from an empty p), but couldn't make it to work. Any advice?

Code:

awk 'BEGIN{p=1} ..... '

now, please head down to gawk manual(my sig) and study it.