LinuxQuestions.org - [SOLVED] Splitting files by pattern match

- Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)

- - Splitting files by pattern match (https://www.linuxquestions.org/questions/linux-software-2/splitting-files-by-pattern-match-769731/)

bartonski

11-17-2009 12:49 PM

Splitting files by pattern match

I have a series of files which have been concatenated together. Each of the files has a header, something like this:

metasyntactic_variables.txt

Code:

header----- # primary

xyzzy foo

header----- # secondary

bar

baz

header----- # quux family

quux

quuux

quuuux

quuuuux

Does anyone know if there is a tool, similar to split, which can separate these in to files, by header, something like this?:

Code:

$ wondersplit --patern-"^header-----" metasyntactic_variables.txt

which produces

xa

Code:

header----- # primary as the hills

xyzzy foo

Code:

header----- # secondary

bar

baz

Code:

header----- # quux family

quux

quuux

quuuux

quuuuux

I know that I could hack this together in perl in about 15 minutes, but it would be nice know if this exists as a stand alone tool.

sarum1990

11-17-2009 01:01 PM

This can be solved with a faily simple gawk command:

gawk 'BEGIN{fnum=0; out="outf";} /^header----/ {fnum++;} {print $0 >> out""fnum}' <INPUT-FILE>

essentially scan through outputing every line to the file "outf#" and increment # everytime you find the regexp ^header----.

Hope This Helps

bartonski

11-17-2009 01:20 PM

I knew that someone would jump in with an awk script...

very nice.

Thanks.

bartonski

11-17-2009 01:25 PM

Wow. That ran so fast I thought that it had failed, but I got output.

note to self: learn some awk.

sarum1990

11-19-2009 06:08 PM

Just wanted to update this, I stumbled across a much better option today

the command csplit

"content split"

just a man csplit will show you how to use it. I feel kinda silly running to gawk when this option was available.

ghostdog74

11-19-2009 06:27 PM

the awk code is simply

Code:

awk '/header/{++d}{print $0>"file_"d}' file

sarum1990

11-19-2009 06:46 PM

Quote:

Originally Posted by ghostdog74 (Post 3763274)

the awk code is simply

Code:

awk '/header/{++d}{print $0>"file_"d}' file

Running that on my Mac OS X version of awk I get an error due to too many files being open for write when it's run with > 18 different files combined in one. I have the same error with the version I posted earlier.

If I'm not mistaken closing the files after having written to them fixes this though.

Code:

awk '/header/{close("file_"d);++d}{print $0>"file_"d}' file

but still I think I'd use the program csplit for anything like this in the future, since that is the entire functionality of that program.

bartonski

11-19-2009 08:30 PM

Well... I'll be damned. I think that I used csplit about 10 years ago, and I totally forgot about it. How did you run across it?

sarum1990

11-20-2009 10:05 AM

I was in a situation without internet and was navigating around info pages looking for the pr -m or paste command to merge two files line by line for awk processing. When I got to the text-processing commands I noticed csplit and decided to check out what it did.

All times are GMT -5. The time now is 09:03 PM.