LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   how do I split large file by string? (https://www.linuxquestions.org/questions/programming-9/how-do-i-split-large-file-by-string-637434/)

khairil 04-24-2008 12:00 AM

how do I split large file by string?
 
The scenario is like this,

i have a large text file (max 100MB), with no line terminator (cr/lf).
Inside that file there is a repeated section which started by keyword HDR1.

the question is how can i split that file by each HDR1 using bash/sed/awk.

sample file SAL003.dat content (please note that the file is in one line since there is no cr/lf terminator);
Code:

HDR1G003SAL1 0000004048 55110045 00004906    55110046 00000000    55110047 00000000    55110048 00041354    55110049 00000000    55110050 00002784    55110051 EOF1G003SAL1 HDR1G003SAL2 0000004048 55110045 00004906    55110046 00000000    55110047 00000000    55110048 00041354    55110049 00000000    55110050 00002784    55110051 00010044    55110052 00060759    EOF1G003SAL2 HDR1G003SAL3 0000004048 55110045 00004906    55110046 00000000    55110047 00000000    55110048 00041354    55110049 00000000    55110050 00002784    55110051 00010044    EOF1G003SAL3
for example the output of split of this file would be;
file #1: SAL003a.dat
Code:

HDR1G003SAL1 0000004048 55110045 00004906    55110046 00000000    55110047 00000000    55110048 00041354    55110049 00000000    55110050 00002784    55110051 EOF1G003SAL1
file #2: SAL003b.dat
Code:

HDR1G003SAL2 0000004048 55110045 00004906    55110046 00000000    55110047 00000000    55110048 00041354    55110049 00000000    55110050 00002784    55110051 00010044    55110052 00060759    EOF1G003SAL2
file #3: SAL003c.dat
Code:

HDR1G003SAL3 0000004048 55110045 00004906    55110046 00000000    55110047 00000000    55110048 00041354    55110049 00000000    55110050 00002784    55110051 00010044    EOF1G003SAL3
tq.

Hko 04-24-2008 08:28 AM

Code:

#!/bin/bash

FILE=SAL003.dat
COUNT=1
sed 's/HDR1/\n/g' "$FILE" | while read LINE ; do
    if [ "$LINE" ] ; then
        echo "$LINE" >"${FILE%.*}-${COUNT}.${FILE##*.}"
        COUNT=$((COUNT+1))
    fi
done

Note: the brackets + the echo command after 'sed' are to ensure the last line will be processed.

colucix 04-24-2008 10:14 AM

An awk code:
Code:

{
  num = split($0,array,"HDR1")
  for ( x = 2; x <= num; x++ )
      print "HDR1" array[x] > ("SAL003." count++ ".dat")
}

If you want letter instead of numbers in the filenames, it is a bit more complicated.

bgoodr 04-25-2008 10:10 PM

Try split or dd commands
 
I bet that file is a fixed-length record file given there are no CR+LF or LF line terminators. If so, then try the split Linux command. An excerpt of the man page is as follows:

Code:

SPLIT(1)                                                            User Commands                                                            SPLIT(1)

NAME
      split - split a file into pieces

SYNOPSIS
      split [OPTION] [INPUT [PREFIX]]

DESCRIPTION
      Output  fixed-size  pieces  of  INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is ‘x’.  With no INPUT, or when
      INPUT is -, read standard input.

      Mandatory arguments to long options are mandatory for short options too.

      -a, --suffix-length=N
              use suffixes of length N (default 2)

      -b, --bytes=SIZE
              put SIZE bytes per output file

      -C, --line-bytes=SIZE
              put at most SIZE bytes of lines per output file

      -d, --numeric-suffixes
              use numeric suffixes instead of alphabetic

      -l, --lines=NUMBER
              put NUMBER lines per output file

      --verbose
              print a diagnostic to standard error just before each output file is opened

      --help display this help and exit

      --version
              output version information and exit

      SIZE may have a multiplier suffix: b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024, GB 1000*1000*1000, G 1024*1024*1024, and so on for  T,  P,
      E, Z, Y.

I bet you want the --bytes option. You can also do some other useful things with the dd command.

bgoodr

archtoad6 04-28-2008 01:07 PM

I don't think split by itself will do the trick -- the 3 sample segments are 3 different lengths. Also, using sed or awk alone seems complicated. Perhaps a 2-step process:
Code:

sed -r 's, (HDR1),\n\1,g'  $FILE  > $FILE.tmp
split -a1 -l1 $FILE.tmp $FILE
# rm $FILE.tmp

Warning: I did not test this code.

angrybanana 04-28-2008 10:37 PM

Code:

awk 'NR>1{print "HDR1"$0 > "SAL003-"++i".dat"}' RS=" ?HDR1|\n"  largefile
This will split them into "SAL003-1.dat" "SAL003-2.dat" etc..
if you need alphabet how do you plan to handle files after 26 ("z")?


All times are GMT -5. The time now is 10:58 PM.