LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-28-2007, 01:00 PM   #1
frankie_DJ
Member
 
Registered: Sep 2004
Location: NorCal
Distribution: slackware 10.1 comfy, Solaris10 learning
Posts: 232

Rep: Reputation: 32
Splitting humongously huge text file


Hi,

I have a text file that consists of 20,000 x 6,000 = 120,000,000 lines.

I would like to read in 6,000 lines at the time and process them, but I am having trouble with sed or awk as they don't return the prompt. What would be the best way to approach this? Thanks.
 
Old 05-28-2007, 01:17 PM   #2
stress_junkie
Senior Member
 
Registered: Dec 2005
Location: Massachusetts, USA
Distribution: Ubuntu 10.04 and CentOS 5.5
Posts: 3,873

Rep: Reputation: 335Reputation: 335Reputation: 335Reputation: 335
You can use the split command as follows.
Code:
split -l 6000 file
 
Old 05-28-2007, 02:29 PM   #3
frankie_DJ
Member
 
Registered: Sep 2004
Location: NorCal
Distribution: slackware 10.1 comfy, Solaris10 learning
Posts: 232

Original Poster
Rep: Reputation: 32
Quote:
Originally Posted by stress_junkie
You can use the split command as follows.
Code:
split -l 6000 file
Split is not an option b/c the file is so huge, and I would make another bunch of files, essentialy doubling the data. So I need to extract pieces on the go. Is there a way to fix sed or awk so they can do the job?
 
Old 05-28-2007, 02:41 PM   #4
gilead
Senior Member
 
Registered: Dec 2005
Location: Brisbane, Australia
Distribution: Slackware64 14.0
Posts: 4,141

Rep: Reputation: 168Reputation: 168
You can specify addresses. If you wanted to change all instances of linux to Linux on each line in the first 6000 lines, you could use:
Code:
sed -e '1,6000s/linux/Linux/g' file
Check the man page though (I didn't before posting)...
 
Old 05-28-2007, 05:21 PM   #5
stress_junkie
Senior Member
 
Registered: Dec 2005
Location: Massachusetts, USA
Distribution: Ubuntu 10.04 and CentOS 5.5
Posts: 3,873

Rep: Reputation: 335Reputation: 335Reputation: 335Reputation: 335
Quote:
Originally Posted by frankie_DJ
Split is not an option b/c the file is so huge, and I would make another bunch of files, essentialy doubling the data. So I need to extract pieces on the go. Is there a way to fix sed or awk so they can do the job?
Depending on what or how you are processing the file you could feed the split file directly to your processing job like this.
Code:
split -l 6000 file | processing-software
That way you wouldn't create any more files on disk.

If I understand your objection you want to insert EOF marks into the original file or something like that. I don't think that is going to work because you still have to deal with the file system managing the shorter file bits as new independent files. So you are always going to have the entire original file plus the new file segments.

Last edited by stress_junkie; 05-28-2007 at 05:28 PM.
 
Old 05-28-2007, 09:36 PM   #6
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi.

There are a few approaches to this problem. One is to extract 6000 line chunks from the file and feed it to your processing system.

Here's a script that will do that:
Code:
#!/bin/sh

# @(#) s1       Demonstrate chunk extraction.

set -o nounset
echo " sh version: $BASH_VERSION"

debug="echo"
debug=":"

FILE=${1-data1}
increment=${2-6}

# Place the fixed number here if wc takes too long.

LAST_LINE_IN_FILE=$( wc -l < $FILE )
$debug " final line number is $LAST_LINE_IN_FILE"

first=1
while :
do
        last=$(( first+increment-1 ))
        $debug " limits: $first $last"
        sed -n "${first},${last}p" $FILE |
        edges -n -l 1

        first=$(( last+1 ))
        if [ $first -gt $LAST_LINE_IN_FILE ]
        then
                echo " sequence ends, line $first beyond $LAST_LINE_IN_FILE"
                break
        fi
done

exit 0
The data on file "data1" is a set of sequenced 25 lines from Moby Dick. The chunk size is 6 lines. The command edges is a command I use to look at the first and last lines of a file. So each display below is a "first line ... last line" chunk. That's where you'd place your program. Running this produces:
Code:
% ./s1
 sh version: 2.05b.0(1)-release
     1  # Moby Dick, Chapter 1 The Loomings.  Page numbers removed.
   ...
     6  to interest me on shore, I thought I would sail about a little
     7  and see the watery part of the world.  It is a way I have of
   ...
    12  the rear of every funeral I meet; and especially whenever my
    13  hypos get such an upper hand of me, that it requires a strong
   ...
    18  This is my substitute for pistol and ball.  With a philosophical
    19  flourish Cato throws himself upon his sword; I quietly take to
   ...
    24  round by wharves as Indian isles by coral reefs--commerce
    25  (end of excerpt)
   ...
    25  (end of excerpt)
 sequence ends, line 31 beyond 25
As for timing, nothing is going to be cheap, but for large chunks, it isn't too bad. Here are the times for a 100K line file with 6000 line chunks:
Code:
% time ./s1 /tmp/sentence.3 6000 > /dev/null
0.653u 0.202s 0:00.87 97.7%     0+0k 0+0io 0pf+0w
I was concerned about the size and positioning of the file, but even a perl program that keeps track of the position of the file and then does seeks (position without reading) wasn't that much faster ... cheers, makyo
cheers, makyo
 
Old 05-30-2007, 02:38 AM   #7
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
perl will handle a file that large
 
Old 05-30-2007, 07:23 AM   #8
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
I'm curious as to why you want to do 6000 recs at a time?
Also, bigearsbilly, although I agree Perl should handle it, depends on rec len eg 100 chars (bytes) per rec => 12GB (in decimal) ... that's a lot of RAM/swap for a PC.
Normally I'd go rec-by-rec for that.
 
Old 05-30-2007, 04:38 PM   #9
frankie_DJ
Member
 
Registered: Sep 2004
Location: NorCal
Distribution: slackware 10.1 comfy, Solaris10 learning
Posts: 232

Original Poster
Rep: Reputation: 32
Quote:
Originally Posted by chrism01
I'm curious as to why you want to do 6000 recs at a time?
Also, bigearsbilly, although I agree Perl should handle it, depends on rec len eg 100 chars (bytes) per rec => 12GB (in decimal) ... that's a lot of RAM/swap for a PC.
Normally I'd go rec-by-rec for that.
Each 6000 lines is a particular configuration of the protein molecule (positions of each atom). It's like a snapshot of a vibrating molecule in a molecular dynamics simulation. And there are 20,000 snapshots.
 
Old 05-30-2007, 07:29 PM   #10
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
What kind of transformations do you need to make?
It's always easier to make targeted suggestions if
there's a target ... in awk, for instance, you could
use something along the lines of NR%6000 to apply stuff
to every umpteenth line in a record.


Cheers,
Tink
 
Old 05-30-2007, 10:31 PM   #11
frankie_DJ
Member
 
Registered: Sep 2004
Location: NorCal
Distribution: slackware 10.1 comfy, Solaris10 learning
Posts: 232

Original Poster
Rep: Reputation: 32
Quote:
Originally Posted by makyo
Hi.


Code:
#!/bin/sh


        sed -n "${first},${last}p" $FILE |
       

exit 0
cheers, makyo
makyo,
thanks for elaborate script but the line from your code i quoted is exactly the reason for my post: it doesn't work. sed (same thing with awk) gives all 6000 lines but doesn't return prompt, so i can't put it in a loop. i was hoping someone knows a way to fix this.
 
Old 05-30-2007, 10:40 PM   #12
frankie_DJ
Member
 
Registered: Sep 2004
Location: NorCal
Distribution: slackware 10.1 comfy, Solaris10 learning
Posts: 232

Original Poster
Rep: Reputation: 32
Quote:
Originally Posted by Tinkster
What kind of transformations do you need to make?
It's always easier to make targeted suggestions if
there's a target ... in awk, for instance, you could
use something along the lines of NR%6000 to apply stuff
to every umpteenth line in a record.


Cheers,
Tink
Hi Tink,

Haven't seen your name around here for some time. Glad to see original gurus are still around :^)

Well it's not a transformation, I just need to input those 6000 lines to another program (which calculates volume of the whole molecule). I'm trying to write something in C to do this input; it shouldn't be complicated, I just wanted to understand if sed and awk could be 'forced into coercion', b/c they obviously have trouble with size.
 
Old 05-31-2007, 12:38 AM   #13
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,128

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Quote:
Originally Posted by frankie_DJ
I just need to input those 6000 lines to another program (which calculates volume of the whole molecule). I'm trying to write something in C to do this input; it shouldn't be complicated, I just wanted to understand if sed and awk could be 'forced into coercion', b/c they obviously have trouble with size.
Nope - you are having trouble with the size ... Patience my friend.
I just had a play - after the nominated lines are printed, I had a sleep. Then the disk went ballistic.
I used sed - a stream editor. Obviously it goes on and reads the file till EOF.

If it were me I'd probably do it in Perl (C should be just as easy). That way you can maintain a count, and exit when happy.
As for @chrism01 concern, you merely need to do the read in scalar rather than list context (i.e. a line at a time), and send it off to STDOUT.
 
Old 05-31-2007, 02:18 AM   #14
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Quote:
Originally Posted by frankie_DJ
Hi Tink,

Haven't seen your name around here for some time. Glad to see original gurus are still around :^)
Heh. I'm not a guru, just been here awhile ;}


Quote:
Originally Posted by frankie_DJ
Well it's not a transformation, I just need to input those 6000 lines to another program (which calculates volume of the whole molecule). I'm trying to write something in C to do this input; it shouldn't be complicated, I just wanted to understand if sed and awk could be 'forced into coercion', b/c they obviously have trouble with size.
What is supposed to happen with the chunks after the
'processing by other program'? There may well be a
way to do it in awk :}

Only today did I use awk to sieve through a 130MB
log file ;} and re-format it for data warehousing.



Cheers,
Tink
 
Old 05-31-2007, 02:20 AM   #15
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
Yeah, rec-by-rec as I said.
You can use a piped open if you need to pipe it to the next prog, or just read+write 6000 lines to a new file if that's what it requires.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
deleting specified lines in a huge text file ruh31 Linux - General 10 06-30-2006 03:34 AM
Splitting Text lines? 0aniel Programming 9 11-30-2005 03:08 AM
All text huge after Nvidia drivers installation schax Linux - Hardware 5 09-29-2005 08:25 PM
Huge text in man pages and help gekko9 Linux - Newbie 3 05-01-2004 12:01 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:51 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration