Splitting humongously huge text file

frankie_DJ · 05-28-2007, 01:00 PM

Hi,

I have a text file that consists of 20,000 x 6,000 = 120,000,000 lines.

I would like to read in 6,000 lines at the time and process them, but I am having trouble with sed or awk as they don't return the prompt. What would be the best way to approach this? Thanks.

stress_junkie · 05-28-2007, 01:17 PM

You can use the split command as follows.

Code:

split -l 6000 file

frankie_DJ · 05-28-2007, 02:29 PM

Quote:

Originally Posted by stress_junkie

You can use the split command as follows.

Code:

split -l 6000 file

Split is not an option b/c the file is so huge, and I would make another bunch of files, essentialy doubling the data. So I need to extract pieces on the go. Is there a way to fix sed or awk so they can do the job?

gilead · 05-28-2007, 02:41 PM

You can specify addresses. If you wanted to change all instances of linux to Linux on each line in the first 6000 lines, you could use:

Code:

sed -e '1,6000s/linux/Linux/g' file

Check the man page though (I didn't before posting)...

stress_junkie · 05-28-2007, 05:21 PM

Quote:

Originally Posted by frankie_DJ

Split is not an option b/c the file is so huge, and I would make another bunch of files, essentialy doubling the data. So I need to extract pieces on the go. Is there a way to fix sed or awk so they can do the job?

Depending on what or how you are processing the file you could feed the split file directly to your processing job like this.

Code:

split -l 6000 file | processing-software

That way you wouldn't create any more files on disk.

If I understand your objection you want to insert EOF marks into the original file or something like that. I don't think that is going to work because you still have to deal with the file system managing the shorter file bits as new independent files. So you are always going to have the entire original file plus the new file segments.

makyo · 05-28-2007, 09:36 PM

Hi.

There are a few approaches to this problem. One is to extract 6000 line chunks from the file and feed it to your processing system.

Here's a script that will do that:

Code:

#!/bin/sh

# @(#) s1       Demonstrate chunk extraction.

set -o nounset
echo " sh version: $BASH_VERSION"

debug="echo"
debug=":"

FILE=${1-data1}
increment=${2-6}

# Place the fixed number here if wc takes too long.

LAST_LINE_IN_FILE=$( wc -l < $FILE )
$debug " final line number is $LAST_LINE_IN_FILE"

first=1
while :
do
        last=$(( first+increment-1 ))
        $debug " limits: $first $last"
        sed -n "${first},${last}p" $FILE |
        edges -n -l 1

        first=$(( last+1 ))
        if [ $first -gt $LAST_LINE_IN_FILE ]
        then
                echo " sequence ends, line $first beyond $LAST_LINE_IN_FILE"
                break
        fi
done

exit 0

The data on file "data1" is a set of sequenced 25 lines from Moby Dick. The chunk size is 6 lines. The command edges is a command I use to look at the first and last lines of a file. So each display below is a "first line ... last line" chunk. That's where you'd place your program. Running this produces:

Code:

% ./s1
 sh version: 2.05b.0(1)-release
     1  # Moby Dick, Chapter 1 The Loomings.  Page numbers removed.
   ...
     6  to interest me on shore, I thought I would sail about a little
     7  and see the watery part of the world.  It is a way I have of
   ...
    12  the rear of every funeral I meet; and especially whenever my
    13  hypos get such an upper hand of me, that it requires a strong
   ...
    18  This is my substitute for pistol and ball.  With a philosophical
    19  flourish Cato throws himself upon his sword; I quietly take to
   ...
    24  round by wharves as Indian isles by coral reefs--commerce
    25  (end of excerpt)
   ...
    25  (end of excerpt)
 sequence ends, line 31 beyond 25

As for timing, nothing is going to be cheap, but for large chunks, it isn't too bad. Here are the times for a 100K line file with 6000 line chunks:

Code:

% time ./s1 /tmp/sentence.3 6000 > /dev/null
0.653u 0.202s 0:00.87 97.7%     0+0k 0+0io 0pf+0w

I was concerned about the size and positioning of the file, but even a perl program that keeps track of the position of the file and then does seeks (position without reading) wasn't that much faster ... cheers, makyo
cheers, makyo

bigearsbilly · 05-30-2007, 02:38 AM

perl will handle a file that large

chrism01 · 05-30-2007, 07:23 AM

I'm curious as to why you want to do 6000 recs at a time?
Also, bigearsbilly, although I agree Perl should handle it, depends on rec len eg 100 chars (bytes) per rec => 12GB (in decimal) ... that's a lot of RAM/swap for a PC.
Normally I'd go rec-by-rec for that.

frankie_DJ · 05-30-2007, 04:38 PM

Quote:

Originally Posted by chrism01

I'm curious as to why you want to do 6000 recs at a time?
Also, bigearsbilly, although I agree Perl should handle it, depends on rec len eg 100 chars (bytes) per rec => 12GB (in decimal) ... that's a lot of RAM/swap for a PC.
Normally I'd go rec-by-rec for that.

Each 6000 lines is a particular configuration of the protein molecule (positions of each atom). It's like a snapshot of a vibrating molecule in a molecular dynamics simulation. And there are 20,000 snapshots.

Tinkster · 05-30-2007, 07:29 PM

What kind of transformations do you need to make?
It's always easier to make targeted suggestions if
there's a target ... in awk, for instance, you could
use something along the lines of NR%6000 to apply stuff
to every umpteenth line in a record.

Cheers,
Tink

frankie_DJ · 05-30-2007, 10:31 PM

Quote:

Originally Posted by makyo

Hi.

Code:

#!/bin/sh


        sed -n "${first},${last}p" $FILE |
       

exit 0

cheers, makyo

makyo,
thanks for elaborate script but the line from your code i quoted is exactly the reason for my post: it doesn't work. sed (same thing with awk) gives all 6000 lines but doesn't return prompt, so i can't put it in a loop. i was hoping someone knows a way to fix this.

frankie_DJ · 05-30-2007, 10:40 PM

Quote:

Originally Posted by Tinkster

What kind of transformations do you need to make?
It's always easier to make targeted suggestions if
there's a target ... in awk, for instance, you could
use something along the lines of NR%6000 to apply stuff
to every umpteenth line in a record.

Cheers,
Tink

Hi Tink,

Haven't seen your name around here for some time. Glad to see original gurus are still around :^)

Well it's not a transformation, I just need to input those 6000 lines to another program (which calculates volume of the whole molecule). I'm trying to write something in C to do this input; it shouldn't be complicated, I just wanted to understand if sed and awk could be 'forced into coercion', b/c they obviously have trouble with size.

syg00 · 05-31-2007, 12:38 AM

Quote:

Originally Posted by frankie_DJ

I just need to input those 6000 lines to another program (which calculates volume of the whole molecule). I'm trying to write something in C to do this input; it shouldn't be complicated, I just wanted to understand if sed and awk could be 'forced into coercion', b/c they obviously have trouble with size.

Nope - you are having trouble with the size ...

Patience my friend.
I just had a play - after the nominated lines are printed, I had a sleep. Then the disk went ballistic.
I used sed - a stream editor. Obviously it goes on and reads the file till EOF.

If it were me I'd probably do it in Perl (C should be just as easy). That way you can maintain a count, and exit when happy.
As for @chrism01 concern, you merely need to do the read in scalar rather than list context (i.e. a line at a time), and send it off to STDOUT.

Tinkster · 05-31-2007, 02:18 AM

Quote:

Originally Posted by frankie_DJ

Hi Tink,

Haven't seen your name around here for some time. Glad to see original gurus are still around :^)

Heh. I'm not a guru, just been here awhile ;}

Quote:

Originally Posted by frankie_DJ

Well it's not a transformation, I just need to input those 6000 lines to another program (which calculates volume of the whole molecule). I'm trying to write something in C to do this input; it shouldn't be complicated, I just wanted to understand if sed and awk could be 'forced into coercion', b/c they obviously have trouble with size.

What is supposed to happen with the chunks after the
'processing by other program'? There may well be a
way to do it in awk :}

Only today did I use awk to sieve through a 130MB
log file ;} and re-format it for data warehousing.

Cheers,
Tink

chrism01 · 05-31-2007, 02:20 AM

Yeah, rec-by-rec as I said.
You can use a piped open if you need to pipe it to the next prog, or just read+write 6000 lines to a new file if that's what it requires.