how can I split a file into many files using a string in awk or sed

atjurhs · 06-05-2013, 11:21 AM

Hi guys,

I have a file called DataDictionary.txt that has a bunch of "groups" all in a single column, like this:

SRIG_NAME: FILENAME_TIG
data.sap_1_ecp
data.sap_2_ecp
data.sap_3_ecp
data.sap_4_ecp
etc...
SRIG_END: FILENAME_TIG
SRIG_NAME: BSG_BSG
info.bsg_1.csv
info.bsg_2.csv
info.bsg_3.csv
info.bsg_4.csv
etc...
SRIG_END: BSG_BSG
SRIG_NAME: CMP34_ADY
cmp34_data.ady._1.dat
cmp34_data.ady._2.dat
cmp34_data.ady._3.dat
cmp34_data.ady._4.dat
etc...
SRIG_END: CMP34_ADY

and the file continues on this way for a very long length.

I would like to break up the one long file and get many smaller files. I think this could be done by using the begining and ending strings of each "group" to parse on

SRIG_NAME:
SRIG_END:

and the name of each of the new text files would be the string that follows

SRIG_NAME:
SRIG_END:

so in my example I'd have

FILENAME_TIG.txt
BSG_BSG.txt
CMP34_ADY.txt

can anybody help me?

I know a little awk and sed so I can follow along

Thanks so much guys! Tabitha

oh, and here's what I've already been working with

#/bin/sh
awk -F "," '$1 == SRIG_NAME: {print FILENAME}' DataDictionary.txt | uniq > FILENAME_TIG.txt

shivaa · 06-05-2013, 01:15 PM

Though awk could do it better, but before that you can try split cmd (I assume that every group has 7 lines) as:

Code:

~$ split -l 7 file.txt newfile
~$ ls
newfileaa newfileab newfileac .....

atjurhs · 06-05-2013, 03:48 PM

Quote:

Originally Posted by shivaa

Though awk could do it better, but before that you can try split cmd (I assume that every group has 7 lines) as:

Code:

~$ split -l 7 file.txt newfile
~$ ls
newfileaa newfileab newfileac .....

thanks Shivaa for the thought! but the number of lines in each "group" varies

I'm pretty sure that the script will have to key off of the SRIG_NAME: and SRIG_END: strings

thanks again, Tabby

AnanthaP · 06-05-2013, 08:29 PM

awk can redirect output to multiple files based on a value within the file.

A hint.
If a line starts with SRIG_NAME:, then all subsequent data (including the current line) gets written to the file with name as the 2nd argument in the line starting with SRIG_NAME. (FILENAME_TIG, BSG_BSG etc).

OK

atjurhs · 06-05-2013, 08:41 PM

Quote:

Originally Posted by AnanthaP

awk can redirect output to multiple files based on a value within the file.

A hint.
If a line starts with SRIG_NAME:, then all subsequent data (including the current line) gets written to the file with name as the 2nd argument in the line starting with SRIG_NAME. (FILENAME_TIG, BSG_BSG etc).

OK

Thanks AnanthaP for your reply!

yep, that's exactly the idea, and when it reads SRIG_END it ends writing lines for that "group" and starts again looking for the next SRIG_NAME

the question is how to implement this?

Tabby

David the H. · 06-06-2013, 02:02 PM

Code:

csplit -z -f 'srigfile_' -b '%03d.txt' infile.txt '/^SRIG_NAME/' '{*}'

This creates individual files named srigfile_000.txt, etc.

See info csplit for details on how to use it properly.

PS: Please use ***[code][/code]*** tags around your code and data, to preserve the original formatting and to improve readability. Do not use quote tags, bolding, colors, "start/end" lines, or other creative techniques. Thanks.

David the H. · 06-06-2013, 02:15 PM

Oh, and here's a simple loop for renaming the files to the desired strings from the text.

Code:

for oname in srigfile_00*; do
   read -r _ nname <"$oname"
   mv "$oname" "$nname.txt"
done

This should work as long as the new name is the second space-delimited field on the first line of each file. But be sure that there aren't any duplicate names.

AnanthaP · 06-06-2013, 07:52 PM

Hi atjurhs,

The idea was to give you just a hint so that you can try it yourself. You seem to have made a start with awk (in post #1).

I refer you to the standard help on awk (below).
http://www.gnu.org/software/gawk/manual/gawk.html

OK

atjurhs · 06-07-2013, 10:03 AM

Quote:

Originally Posted by David the H.

Oh, and here's a simple loop for renaming the files to the desired strings from the text.

Code:

for oname in srigfile_00*; do
   read -r _ nname <"$oname"
   mv "$oname" "$nname.txt"
done

This should work as long as the new name is the second space-delimited field on the first line of each file. But be sure that there aren't any duplicate names.

The csplit command worked great, thanks so much!

it's output was 1765 files named srigfile_000.txt through file srigfile_1765.txt

The oname loop hasn't been as succesfull.

If I run it as a bash script with #!/bin/bash thinking that maybe the path to my bin is somehow messed up, the command line gves me back nothing and there is no change to the srigfile names.

If I run it without the #!/bin/bash the command line gves me back ./script: line 2: srigfile_00*: No such file or directory and the mv command of course says it cannot stat `srigfile_00*'

So I tried changing around the string of srigfile_00* but that had no effect either, it still can't find the srigfiles, and sometimes even deleted all the srigfiles, yikes!

I double checked the fields on the first line of each of the newly created srigfiles from the csplit command, and they are space delimeted, but I don't think this part of the script is getting accessed yet?

can you tell where I'm going wrong?

David the H. · 06-09-2013, 04:50 AM

It's usually a good idea not to run a possibly-destructive command like mv until you've confirmed that it's configured correctly. The easiest way is to just stick an echo at the front of the command, then you'll see a printout of what would actually be executed after the variables are expanded.

I don't really see what could be wrong with what I posted though. It's just a simple globbing pattern and for loop.

Since you have many more files than what I used for testing, You'll probably need to shorten the glob to something like "srig*". Just keep it long enough to match only the files you want. "printf '%s\n' <glob>" can be used to list out all the files matched by that pattern, one per line.

The read command inside the loop just takes the first line from each file and splits it into two variables; the first word on the line goes into the throw-away "_", and all the rest into the nname variable, for use as the new filename.

Check to see that you haven't made any syntax or spelling errors. And of course the loop needs to be run in the same directory as the files, or else it would have to be made more complex. Make sure the new names don't have any illegal filename characters or other conflicts either, as I mentioned before.

I highly doubt there are any problems with your PATH or other low-level issue like that. If you haven't had any problems before, then they aren't likely to be a factor now. It's certainly either a syntax or matching error of some kind.

Also, another thought, could the files have dos-style line-endings in them? If so, you may need to run them through dos2unix or a similar converter first.

schneidz · 06-09-2013, 09:24 AM

davids is probably better. what i wouldve done is grep for a list of lines to feed into sed:

Code:

grep -n SRIG file.txt

and then parse them with sed.

Firerat · 06-09-2013, 09:50 AM

here is how I would do it

Code:

for i in `awk '/^SRIG_NAME:/{print $NF}' DataDictionary.txt`;do
    sed '/^SRIG_NAME:.'$i'/,/^SRIG_END/!d' DataDictionary.txt > ${i}.txt
done

atjurhs · 06-10-2013, 02:20 PM

Both David's and the Firerat's sripts work

David I apologize. I missed typed "nname" as just "name" in the move command. The output of echo pointed me to my errors.

you know it's funny how you can see a script and follow along and know what it's doing at each step, but know I can't write it myself. I get some of it and then get stuck, or I start off down the wrong road

but thanks so much guys!

Tabby

Firerat · 06-10-2013, 04:18 PM

Hi glad it worked for you,
I had a little think and re-wrote just in awk

Code:

awk '{if ( $1 == "SRIG_NAME:" ){FileName = $NF ;print $0 > FileName".txt";next};{print $0 >> FileName".txt"}}' DataDictionary.txt

Should be much faster than the awk/sed combo I posted

However there is a little catch.. It will overwrite duplicates, which you can of course avoid by using ">>" globally

David the H. · 06-11-2013, 12:04 PM

Those little typos get you every time. In hindsight I probably should've used variable names that were a bit clearer to read, like "oldname" and "newname", instead of the shorter ones. I usually use "fname" and "dname" myself for files and directories, so I was keeping with the same pattern.

@Firerat, Nicely done. Just a couple of quick suggestions. "print" on its own is the same as "print $0", and an "else" would probably be a better choice to connect the two commands, rather than "next".

I believe you could also reduce it down to just this (untested, 'cause I'm lazy):

Code:

awk '$1 == "SRIG_NAME:" { FileName = $NF } { print >> FileName".txt" }' DataDictionary.txt