[SOLVED] Split file upon increments of string value

captainentropy · 08-09-2013, 05:08 PM

I have a large file that contains thousand of records. Each record begins with a specific string. I'd like to split the file into many smaller files but not one output file per record, maybe 5 or 10, or whatever.

I'm using this right now to split the file:

Code:

awk '/STRING/{n++}{print >"out" n ".txt" }' input_file.txt

And it works fine...if I want a thousand or more files.

How can I have awk split the file at every 10th instance of "STRING"? I tried adding an NR variable, but that was a mess.

Note, the records aren't the same size, so I can't just split based on number of lines.

colucix · 08-09-2013, 05:44 PM

Try to compute the output file name based on the value of n, i.e.

Code:

awk '/STRING/{n++} n%5{file = sprintf("out%03d.txt",n/5+1)}{print > file }' input_file.txt

The d specifier in the sprintf format ensures that the result of the division n/5 is an integer, hence for the first 4 records the result is 0, for the record from 5 to 9 the result is 1 and so on. Add one (as in my example) to start the file count from 1.
In addition I used the condition

Code:

n % 5

to avoid the change of name at the 5th, 10th, 15th records and so on, so that every file contains exactly 5 records (otherwise the 5th record would go to the new file). Hope this helps.

schneidz · 08-09-2013, 06:03 PM

would the split command work ?

captainentropy · 08-09-2013, 06:04 PM

Thanks colucix, it worked perfectly! I never would have figured that out.

captainentropy · 08-09-2013, 06:21 PM

schneidz, as I understand the man page for split, I can only split into files of equal size (bytes or lines). If my records were of equal length I would have used that. Split was my first thought too.

grail · 08-09-2013, 11:14 PM

Quote:

If my records were of equal length I would have used that.

I am not sure I follow?? If you use the awk you are splitting on each fifth consecutive line so could you not tell split to work on 5 lines at a time?

captainentropy · 08-12-2013, 07:20 PM

Quote:

Originally Posted by grail

I am not sure I follow?? If you use the awk you are splitting on each fifth consecutive line so could you not tell split to work on 5 lines at a time?

What I was saying is that split only works by splitting into discreet sizes (e.g. every 5, 10, 67 or whatever lines, or every 2kb, etc.).

My file contains lots of records where each record is a different length. One record might be 5 lines but the next could be 17, or 85, etc. Using

Code:

split -l 5 file.txt prefix

results in each file having 5 lines which cuts in the middle parts of each record (or wherever the 5 lines land). Split can't work for this type of file. colucix's code worked perfectly.