LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Splitting files by pattern match (https://www.linuxquestions.org/questions/linux-software-2/splitting-files-by-pattern-match-769731/)

bartonski 11-17-2009 12:49 PM

Splitting files by pattern match
 
I have a series of files which have been concatenated together. Each of the files has a header, something like this:

metasyntactic_variables.txt


Code:

header----- # primary
xyzzy foo
header----- # secondary
bar
baz
header----- # quux family
quux
quuux
quuuux
quuuuux

Does anyone know if there is a tool, similar to split, which can separate these in to files, by header, something like this?:

Code:

$ wondersplit --patern-"^header-----" metasyntactic_variables.txt
which produces

xa
Code:

header----- # primary as the hills
xyzzy foo

xb
Code:

header----- # secondary
bar
baz

xc
Code:

header----- # quux family
quux
quuux
quuuux
quuuuux

I know that I could hack this together in perl in about 15 minutes, but it would be nice know if this exists as a stand alone tool.

sarum1990 11-17-2009 01:01 PM

This can be solved with a faily simple gawk command:

gawk 'BEGIN{fnum=0; out="outf";} /^header----/ {fnum++;} {print $0 >> out""fnum}' <INPUT-FILE>

essentially scan through outputing every line to the file "outf#" and increment # everytime you find the regexp ^header----.

Hope This Helps

bartonski 11-17-2009 01:20 PM

I knew that someone would jump in with an awk script...

very nice.

Thanks.

bartonski 11-17-2009 01:25 PM

Wow. That ran so fast I thought that it had failed, but I got output.

note to self: learn some awk.

sarum1990 11-19-2009 06:08 PM

Just wanted to update this, I stumbled across a much better option today

the command csplit

"content split"

just a man csplit will show you how to use it. I feel kinda silly running to gawk when this option was available.

ghostdog74 11-19-2009 06:27 PM

the awk code is simply
Code:

awk '/header/{++d}{print $0>"file_"d}' file

sarum1990 11-19-2009 06:46 PM

Quote:

Originally Posted by ghostdog74 (Post 3763274)
the awk code is simply
Code:

awk '/header/{++d}{print $0>"file_"d}' file

Running that on my Mac OS X version of awk I get an error due to too many files being open for write when it's run with > 18 different files combined in one. I have the same error with the version I posted earlier.

If I'm not mistaken closing the files after having written to them fixes this though.

Code:

awk '/header/{close("file_"d);++d}{print $0>"file_"d}' file
but still I think I'd use the program csplit for anything like this in the future, since that is the entire functionality of that program.

bartonski 11-19-2009 08:30 PM

Well... I'll be damned. I think that I used csplit about 10 years ago, and I totally forgot about it. How did you run across it?

sarum1990 11-20-2009 10:05 AM

I was in a situation without internet and was navigating around info pages looking for the pr -m or paste command to merge two files line by line for awk processing. When I got to the text-processing commands I noticed csplit and decided to check out what it did.


All times are GMT -5. The time now is 09:03 PM.