how do I split large file by string?
The scenario is like this,
i have a large text file (max 100MB), with no line terminator (cr/lf). Inside that file there is a repeated section which started by keyword HDR1. the question is how can i split that file by each HDR1 using bash/sed/awk. sample file SAL003.dat content (please note that the file is in one line since there is no cr/lf terminator); Code:
HDR1G003SAL1 0000004048 55110045 00004906 55110046 00000000 55110047 00000000 55110048 00041354 55110049 00000000 55110050 00002784 55110051 EOF1G003SAL1 HDR1G003SAL2 0000004048 55110045 00004906 55110046 00000000 55110047 00000000 55110048 00041354 55110049 00000000 55110050 00002784 55110051 00010044 55110052 00060759 EOF1G003SAL2 HDR1G003SAL3 0000004048 55110045 00004906 55110046 00000000 55110047 00000000 55110048 00041354 55110049 00000000 55110050 00002784 55110051 00010044 EOF1G003SAL3 file #1: SAL003a.dat Code:
HDR1G003SAL1 0000004048 55110045 00004906 55110046 00000000 55110047 00000000 55110048 00041354 55110049 00000000 55110050 00002784 55110051 EOF1G003SAL1 Code:
HDR1G003SAL2 0000004048 55110045 00004906 55110046 00000000 55110047 00000000 55110048 00041354 55110049 00000000 55110050 00002784 55110051 00010044 55110052 00060759 EOF1G003SAL2 Code:
HDR1G003SAL3 0000004048 55110045 00004906 55110046 00000000 55110047 00000000 55110048 00041354 55110049 00000000 55110050 00002784 55110051 00010044 EOF1G003SAL3 |
Code:
#!/bin/bash |
An awk code:
Code:
{ |
Try split or dd commands
I bet that file is a fixed-length record file given there are no CR+LF or LF line terminators. If so, then try the split Linux command. An excerpt of the man page is as follows:
Code:
SPLIT(1) User Commands SPLIT(1) bgoodr |
I don't think split by itself will do the trick -- the 3 sample segments are 3 different lengths. Also, using sed or awk alone seems complicated. Perhaps a 2-step process:
Code:
sed -r 's, (HDR1),\n\1,g' $FILE > $FILE.tmp |
Code:
awk 'NR>1{print "HDR1"$0 > "SAL003-"++i".dat"}' RS=" ?HDR1|\n" largefile if you need alphabet how do you plan to handle files after 26 ("z")? |
All times are GMT -5. The time now is 10:58 PM. |