LinuxQuestions.org - how do I split large file by string?

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - how do I split large file by string? (https://www.linuxquestions.org/questions/programming-9/how-do-i-split-large-file-by-string-637434/)

how do I split large file by string?

The scenario is like this,

i have a large text file (max 100MB), with no line terminator (cr/lf).
Inside that file there is a repeated section which started by keyword HDR1.

the question is how can i split that file by each HDR1 using bash/sed/awk.

sample file SAL003.dat content (please note that the file is in one line since there is no cr/lf terminator);

Code:

HDR1G003SAL1 0000004048 55110045 00004906    55110046 00000000    55110047 00000000    55110048 00041354    55110049 00000000    55110050 00002784    55110051 EOF1G003SAL1 HDR1G003SAL2 0000004048 55110045 00004906    55110046 00000000    55110047 00000000    55110048 00041354    55110049 00000000    55110050 00002784    55110051 00010044    55110052 00060759    EOF1G003SAL2 HDR1G003SAL3 0000004048 55110045 00004906    55110046 00000000    55110047 00000000    55110048 00041354    55110049 00000000    55110050 00002784    55110051 00010044    EOF1G003SAL3

for example the output of split of this file would be;
file #1: SAL003a.dat

Code:

HDR1G003SAL1 0000004048 55110045 00004906    55110046 00000000    55110047 00000000    55110048 00041354    55110049 00000000    55110050 00002784    55110051 EOF1G003SAL1

file #2: SAL003b.dat

Code:

HDR1G003SAL2 0000004048 55110045 00004906    55110046 00000000    55110047 00000000    55110048 00041354    55110049 00000000    55110050 00002784    55110051 00010044    55110052 00060759    EOF1G003SAL2

file #3: SAL003c.dat

Code:

HDR1G003SAL3 0000004048 55110045 00004906    55110046 00000000    55110047 00000000    55110048 00041354    55110049 00000000    55110050 00002784    55110051 00010044    EOF1G003SAL3

tq.

Code:

#!/bin/bash



FILE=SAL003.dat

COUNT=1

sed 's/HDR1/\n/g' "$FILE" | while read LINE ; do

    if [ "$LINE" ] ; then

        echo "$LINE" >"${FILE%.*}-${COUNT}.${FILE##*.}"

        COUNT=$((COUNT+1))

    fi

done

Note: the brackets + the echo command after 'sed' are to ensure the last line will be processed.

An awk code:

Code:

{

  num = split($0,array,"HDR1")

  for ( x = 2; x <= num; x++ )

      print "HDR1" array[x] > ("SAL003." count++ ".dat")

}

If you want letter instead of numbers in the filenames, it is a bit more complicated.

Try split or dd commands

I bet that file is a fixed-length record file given there are no CR+LF or LF line terminators. If so, then try the split Linux command. An excerpt of the man page is as follows:

Code:

SPLIT(1)                                                            User Commands                                                            SPLIT(1)



NAME

      split - split a file into pieces



SYNOPSIS

      split [OPTION] [INPUT [PREFIX]]



DESCRIPTION

      Output  fixed-size  pieces  of  INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is ‘x’.  With no INPUT, or when

      INPUT is -, read standard input.



      Mandatory arguments to long options are mandatory for short options too.



      -a, --suffix-length=N

              use suffixes of length N (default 2)



      -b, --bytes=SIZE

              put SIZE bytes per output file



      -C, --line-bytes=SIZE

              put at most SIZE bytes of lines per output file



      -d, --numeric-suffixes

              use numeric suffixes instead of alphabetic



      -l, --lines=NUMBER

              put NUMBER lines per output file



      --verbose

              print a diagnostic to standard error just before each output file is opened



      --help display this help and exit



      --version

              output version information and exit



      SIZE may have a multiplier suffix: b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024, GB 1000*1000*1000, G 1024*1024*1024, and so on for  T,  P,

      E, Z, Y.

I bet you want the --bytes option. You can also do some other useful things with the dd command.

bgoodr

I don't think split by itself will do the trick -- the 3 sample segments are 3 different lengths. Also, using sed or awk alone seems complicated. Perhaps a 2-step process:

Code:

sed -r 's, (HDR1),\n\1,g'  $FILE  > $FILE.tmp

split -a1 -l1 $FILE.tmp $FILE

# rm $FILE.tmp

Warning: I did not test this code.

Code:

awk 'NR>1{print "HDR1"$0 > "SAL003-"++i".dat"}' RS=" ?HDR1|\n" largefile

This will split them into "SAL003-1.dat" "SAL003-2.dat" etc..
if you need alphabet how do you plan to handle files after 26 ("z")?