LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 04-24-2008, 12:00 AM   #1
khairil
LQ Newbie
 
Registered: May 2005
Distribution: gentoo
Posts: 23

Rep: Reputation: 15
how do I split large file by string?


The scenario is like this,

i have a large text file (max 100MB), with no line terminator (cr/lf).
Inside that file there is a repeated section which started by keyword HDR1.

the question is how can i split that file by each HDR1 using bash/sed/awk.

sample file SAL003.dat content (please note that the file is in one line since there is no cr/lf terminator);
Code:
HDR1G003SAL1 0000004048 55110045 00004906     55110046 00000000     55110047 00000000     55110048 00041354     55110049 00000000     55110050 00002784     55110051 EOF1G003SAL1 HDR1G003SAL2 0000004048 55110045 00004906     55110046 00000000     55110047 00000000     55110048 00041354     55110049 00000000     55110050 00002784     55110051 00010044     55110052 00060759     EOF1G003SAL2 HDR1G003SAL3 0000004048 55110045 00004906     55110046 00000000     55110047 00000000     55110048 00041354     55110049 00000000     55110050 00002784     55110051 00010044     EOF1G003SAL3
for example the output of split of this file would be;
file #1: SAL003a.dat
Code:
HDR1G003SAL1 0000004048 55110045 00004906     55110046 00000000     55110047 00000000     55110048 00041354     55110049 00000000     55110050 00002784     55110051 EOF1G003SAL1
file #2: SAL003b.dat
Code:
HDR1G003SAL2 0000004048 55110045 00004906     55110046 00000000     55110047 00000000     55110048 00041354     55110049 00000000     55110050 00002784     55110051 00010044     55110052 00060759     EOF1G003SAL2
file #3: SAL003c.dat
Code:
HDR1G003SAL3 0000004048 55110045 00004906     55110046 00000000     55110047 00000000     55110048 00041354     55110049 00000000     55110050 00002784     55110051 00010044     EOF1G003SAL3
tq.

Last edited by khairil; 04-24-2008 at 02:09 AM.
 
Old 04-24-2008, 08:28 AM   #2
Hko
Senior Member
 
Registered: Aug 2002
Location: Groningen, The Netherlands
Distribution: ubuntu
Posts: 2,530

Rep: Reputation: 108Reputation: 108
Code:
#!/bin/bash

FILE=SAL003.dat
COUNT=1
sed 's/HDR1/\n/g' "$FILE" | while read LINE ; do
    if [ "$LINE" ] ; then
        echo "$LINE" >"${FILE%.*}-${COUNT}.${FILE##*.}"
        COUNT=$((COUNT+1))
    fi
done
Note: the brackets + the echo command after 'sed' are to ensure the last line will be processed.
 
Old 04-24-2008, 10:14 AM   #3
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,503

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
An awk code:
Code:
{
  num = split($0,array,"HDR1")
  for ( x = 2; x <= num; x++ )
       print "HDR1" array[x] > ("SAL003." count++ ".dat")
}
If you want letter instead of numbers in the filenames, it is a bit more complicated.

Last edited by colucix; 04-24-2008 at 10:15 AM.
 
Old 04-25-2008, 10:10 PM   #4
bgoodr
Member
 
Registered: Dec 2006
Location: Oregon
Distribution: RHEL[45] {x86,x86_64}, Debian "testing" {x86,x86_64}
Posts: 219

Rep: Reputation: 36
Try split or dd commands

I bet that file is a fixed-length record file given there are no CR+LF or LF line terminators. If so, then try the split Linux command. An excerpt of the man page is as follows:

Code:
SPLIT(1)                                                             User Commands                                                             SPLIT(1)

NAME
       split - split a file into pieces

SYNOPSIS
       split [OPTION] [INPUT [PREFIX]]

DESCRIPTION
       Output  fixed-size  pieces  of  INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is x.  With no INPUT, or when
       INPUT is -, read standard input.

       Mandatory arguments to long options are mandatory for short options too.

       -a, --suffix-length=N
              use suffixes of length N (default 2)

       -b, --bytes=SIZE
              put SIZE bytes per output file

       -C, --line-bytes=SIZE
              put at most SIZE bytes of lines per output file

       -d, --numeric-suffixes
              use numeric suffixes instead of alphabetic

       -l, --lines=NUMBER
              put NUMBER lines per output file

       --verbose
              print a diagnostic to standard error just before each output file is opened

       --help display this help and exit

       --version
              output version information and exit

       SIZE may have a multiplier suffix: b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024, GB 1000*1000*1000, G 1024*1024*1024, and so on for  T,  P,
       E, Z, Y.
I bet you want the --bytes option. You can also do some other useful things with the dd command.

bgoodr
 
Old 04-28-2008, 01:07 PM   #5
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 231Reputation: 231Reputation: 231
I don't think split by itself will do the trick -- the 3 sample segments are 3 different lengths. Also, using sed or awk alone seems complicated. Perhaps a 2-step process:
Code:
sed -r 's, (HDR1),\n\1,g'  $FILE  > $FILE.tmp
split -a1 -l1 $FILE.tmp $FILE
# rm $FILE.tmp
Warning: I did not test this code.
 
Old 04-28-2008, 10:37 PM   #6
angrybanana
Member
 
Registered: Oct 2003
Distribution: Archlinux
Posts: 147

Rep: Reputation: 21
Code:
awk 'NR>1{print "HDR1"$0 > "SAL003-"++i".dat"}' RS=" ?HDR1|\n"  largefile
This will split them into "SAL003-1.dat" "SAL003-2.dat" etc..
if you need alphabet how do you plan to handle files after 26 ("z")?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Split large file in several files using scripting (awk etc.) chipix Programming 14 10-29-2007 11:16 AM
string split ovince Programming 4 06-10-2007 05:45 PM
Split a large file and get the names of output files using Perl Sherlock Programming 25 02-02-2007 12:43 PM
Split large file into multiples jdozarchuk Linux - Newbie 1 11-04-2004 09:42 AM
split a large mpeg file into two zstingx Linux - General 3 11-06-2003 06:26 PM


All times are GMT -5. The time now is 03:52 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration