how do I split large file by string?

khairil · 04-24-2008, 12:00 AM

The scenario is like this,

i have a large text file (max 100MB), with no line terminator (cr/lf).
Inside that file there is a repeated section which started by keyword HDR1.

the question is how can i split that file by each HDR1 using bash/sed/awk.

sample file SAL003.dat content (please note that the file is in one line since there is no cr/lf terminator);

Code:

HDR1G003SAL1 0000004048 55110045 00004906     55110046 00000000     55110047 00000000     55110048 00041354     55110049 00000000     55110050 00002784     55110051 EOF1G003SAL1 HDR1G003SAL2 0000004048 55110045 00004906     55110046 00000000     55110047 00000000     55110048 00041354     55110049 00000000     55110050 00002784     55110051 00010044     55110052 00060759     EOF1G003SAL2 HDR1G003SAL3 0000004048 55110045 00004906     55110046 00000000     55110047 00000000     55110048 00041354     55110049 00000000     55110050 00002784     55110051 00010044     EOF1G003SAL3

for example the output of split of this file would be;
file #1: SAL003a.dat

Code:

HDR1G003SAL1 0000004048 55110045 00004906     55110046 00000000     55110047 00000000     55110048 00041354     55110049 00000000     55110050 00002784     55110051 EOF1G003SAL1

file #2: SAL003b.dat

Code:

HDR1G003SAL2 0000004048 55110045 00004906     55110046 00000000     55110047 00000000     55110048 00041354     55110049 00000000     55110050 00002784     55110051 00010044     55110052 00060759     EOF1G003SAL2

file #3: SAL003c.dat

Code:

HDR1G003SAL3 0000004048 55110045 00004906     55110046 00000000     55110047 00000000     55110048 00041354     55110049 00000000     55110050 00002784     55110051 00010044     EOF1G003SAL3

tq.

Hko · 04-24-2008, 08:28 AM

Code:

#!/bin/bash

FILE=SAL003.dat
COUNT=1
sed 's/HDR1/\n/g' "$FILE" | while read LINE ; do
    if [ "$LINE" ] ; then
        echo "$LINE" >"${FILE%.*}-${COUNT}.${FILE##*.}"
        COUNT=$((COUNT+1))
    fi
done

Note: the brackets + the echo command after 'sed' are to ensure the last line will be processed.

colucix · 04-24-2008, 10:14 AM

An awk code:

Code:

{
  num = split($0,array,"HDR1")
  for ( x = 2; x <= num; x++ )
       print "HDR1" array[x] > ("SAL003." count++ ".dat")
}

If you want letter instead of numbers in the filenames, it is a bit more complicated.

bgoodr · 04-25-2008, 10:10 PM

I bet that file is a fixed-length record file given there are no CR+LF or LF line terminators. If so, then try the split Linux command. An excerpt of the man page is as follows:

Code:

SPLIT(1)                                                             User Commands                                                             SPLIT(1)

NAME
       split - split a file into pieces

SYNOPSIS
       split [OPTION] [INPUT [PREFIX]]

DESCRIPTION
       Output  fixed-size  pieces  of  INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is ‘x’.  With no INPUT, or when
       INPUT is -, read standard input.

       Mandatory arguments to long options are mandatory for short options too.

       -a, --suffix-length=N
              use suffixes of length N (default 2)

       -b, --bytes=SIZE
              put SIZE bytes per output file

       -C, --line-bytes=SIZE
              put at most SIZE bytes of lines per output file

       -d, --numeric-suffixes
              use numeric suffixes instead of alphabetic

       -l, --lines=NUMBER
              put NUMBER lines per output file

       --verbose
              print a diagnostic to standard error just before each output file is opened

       --help display this help and exit

       --version
              output version information and exit

       SIZE may have a multiplier suffix: b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024, GB 1000*1000*1000, G 1024*1024*1024, and so on for  T,  P,
       E, Z, Y.

I bet you want the --bytes option. You can also do some other useful things with the dd command.

bgoodr

archtoad6 · 04-28-2008, 01:07 PM

I don't think split by itself will do the trick -- the 3 sample segments are 3 different lengths. Also, using sed or awk alone seems complicated. Perhaps a 2-step process:

Code:

sed -r 's, (HDR1),\n\1,g'  $FILE  > $FILE.tmp
split -a1 -l1 $FILE.tmp $FILE
# rm $FILE.tmp

Warning: I did not test this code.

angrybanana · 04-28-2008, 10:37 PM

Code:

awk 'NR>1{print "HDR1"$0 > "SAL003-"++i".dat"}' RS=" ?HDR1|\n"  largefile

This will split them into "SAL003-1.dat" "SAL003-2.dat" etc..
if you need alphabet how do you plan to handle files after 26 ("z")?