LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Split file upon increments of string value (https://www.linuxquestions.org/questions/linux-newbie-8/split-file-upon-increments-of-string-value-4175472825/)

captainentropy 08-09-2013 05:08 PM

Split file upon increments of string value
 
I have a large file that contains thousand of records. Each record begins with a specific string. I'd like to split the file into many smaller files but not one output file per record, maybe 5 or 10, or whatever.

I'm using this right now to split the file:

Code:

awk '/STRING/{n++}{print >"out" n ".txt" }' input_file.txt
And it works fine...if I want a thousand or more files.

How can I have awk split the file at every 10th instance of "STRING"? I tried adding an NR variable, but that was a mess.

Note, the records aren't the same size, so I can't just split based on number of lines.

colucix 08-09-2013 05:44 PM

Try to compute the output file name based on the value of n, i.e.
Code:

awk '/STRING/{n++} n%5{file = sprintf("out%03d.txt",n/5+1)}{print > file }' input_file.txt
The d specifier in the sprintf format ensures that the result of the division n/5 is an integer, hence for the first 4 records the result is 0, for the record from 5 to 9 the result is 1 and so on. Add one (as in my example) to start the file count from 1.
In addition I used the condition
Code:

n % 5
to avoid the change of name at the 5th, 10th, 15th records and so on, so that every file contains exactly 5 records (otherwise the 5th record would go to the new file). Hope this helps.

schneidz 08-09-2013 06:03 PM

would the split command work ?

captainentropy 08-09-2013 06:04 PM

Thanks colucix, it worked perfectly! I never would have figured that out.

captainentropy 08-09-2013 06:21 PM

schneidz, as I understand the man page for split, I can only split into files of equal size (bytes or lines). If my records were of equal length I would have used that. Split was my first thought too.

grail 08-09-2013 11:14 PM

Quote:

If my records were of equal length I would have used that.
I am not sure I follow?? If you use the awk you are splitting on each fifth consecutive line so could you not tell split to work on 5 lines at a time?

captainentropy 08-12-2013 07:20 PM

Quote:

Originally Posted by grail (Post 5006639)
I am not sure I follow?? If you use the awk you are splitting on each fifth consecutive line so could you not tell split to work on 5 lines at a time?

What I was saying is that split only works by splitting into discreet sizes (e.g. every 5, 10, 67 or whatever lines, or every 2kb, etc.).

My file contains lots of records where each record is a different length. One record might be 5 lines but the next could be 17, or 85, etc. Using
Code:

split -l 5 file.txt prefix
results in each file having 5 lines which cuts in the middle parts of each record (or wherever the 5 lines land). Split can't work for this type of file. colucix's code worked perfectly.


All times are GMT -5. The time now is 02:14 AM.