ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a text file that consists of 20,000 x 6,000 = 120,000,000 lines.
I would like to read in 6,000 lines at the time and process them, but I am having trouble with sed or awk as they don't return the prompt. What would be the best way to approach this? Thanks.
Split is not an option b/c the file is so huge, and I would make another bunch of files, essentialy doubling the data. So I need to extract pieces on the go. Is there a way to fix sed or awk so they can do the job?
Split is not an option b/c the file is so huge, and I would make another bunch of files, essentialy doubling the data. So I need to extract pieces on the go. Is there a way to fix sed or awk so they can do the job?
Depending on what or how you are processing the file you could feed the split file directly to your processing job like this.
Code:
split -l 6000 file | processing-software
That way you wouldn't create any more files on disk.
If I understand your objection you want to insert EOF marks into the original file or something like that. I don't think that is going to work because you still have to deal with the file system managing the shorter file bits as new independent files. So you are always going to have the entire original file plus the new file segments.
Last edited by stress_junkie; 05-28-2007 at 05:28 PM.
There are a few approaches to this problem. One is to extract 6000 line chunks from the file and feed it to your processing system.
Here's a script that will do that:
Code:
#!/bin/sh
# @(#) s1 Demonstrate chunk extraction.
set -o nounset
echo " sh version: $BASH_VERSION"
debug="echo"
debug=":"
FILE=${1-data1}
increment=${2-6}
# Place the fixed number here if wc takes too long.
LAST_LINE_IN_FILE=$( wc -l < $FILE )
$debug " final line number is $LAST_LINE_IN_FILE"
first=1
while :
do
last=$(( first+increment-1 ))
$debug " limits: $first $last"
sed -n "${first},${last}p" $FILE |
edges -n -l 1
first=$(( last+1 ))
if [ $first -gt $LAST_LINE_IN_FILE ]
then
echo " sequence ends, line $first beyond $LAST_LINE_IN_FILE"
break
fi
done
exit 0
The data on file "data1" is a set of sequenced 25 lines from Moby Dick. The chunk size is 6 lines. The command edges is a command I use to look at the first and last lines of a file. So each display below is a "first line ... last line" chunk. That's where you'd place your program. Running this produces:
Code:
% ./s1
sh version: 2.05b.0(1)-release
1 # Moby Dick, Chapter 1 The Loomings. Page numbers removed.
...
6 to interest me on shore, I thought I would sail about a little
7 and see the watery part of the world. It is a way I have of
...
12 the rear of every funeral I meet; and especially whenever my
13 hypos get such an upper hand of me, that it requires a strong
...
18 This is my substitute for pistol and ball. With a philosophical
19 flourish Cato throws himself upon his sword; I quietly take to
...
24 round by wharves as Indian isles by coral reefs--commerce
25 (end of excerpt)
...
25 (end of excerpt)
sequence ends, line 31 beyond 25
As for timing, nothing is going to be cheap, but for large chunks, it isn't too bad. Here are the times for a 100K line file with 6000 line chunks:
I was concerned about the size and positioning of the file, but even a perl program that keeps track of the position of the file and then does seeks (position without reading) wasn't that much faster ... cheers, makyo
cheers, makyo
I'm curious as to why you want to do 6000 recs at a time?
Also, bigearsbilly, although I agree Perl should handle it, depends on rec len eg 100 chars (bytes) per rec => 12GB (in decimal) ... that's a lot of RAM/swap for a PC.
Normally I'd go rec-by-rec for that.
I'm curious as to why you want to do 6000 recs at a time?
Also, bigearsbilly, although I agree Perl should handle it, depends on rec len eg 100 chars (bytes) per rec => 12GB (in decimal) ... that's a lot of RAM/swap for a PC.
Normally I'd go rec-by-rec for that.
Each 6000 lines is a particular configuration of the protein molecule (positions of each atom). It's like a snapshot of a vibrating molecule in a molecular dynamics simulation. And there are 20,000 snapshots.
What kind of transformations do you need to make?
It's always easier to make targeted suggestions if
there's a target ... in awk, for instance, you could
use something along the lines of NR%6000 to apply stuff
to every umpteenth line in a record.
#!/bin/sh
sed -n "${first},${last}p" $FILE |
exit 0
cheers, makyo
makyo,
thanks for elaborate script but the line from your code i quoted is exactly the reason for my post: it doesn't work. sed (same thing with awk) gives all 6000 lines but doesn't return prompt, so i can't put it in a loop. i was hoping someone knows a way to fix this.
What kind of transformations do you need to make?
It's always easier to make targeted suggestions if
there's a target ... in awk, for instance, you could
use something along the lines of NR%6000 to apply stuff
to every umpteenth line in a record.
Cheers,
Tink
Hi Tink,
Haven't seen your name around here for some time. Glad to see original gurus are still around :^)
Well it's not a transformation, I just need to input those 6000 lines to another program (which calculates volume of the whole molecule). I'm trying to write something in C to do this input; it shouldn't be complicated, I just wanted to understand if sed and awk could be 'forced into coercion', b/c they obviously have trouble with size.
I just need to input those 6000 lines to another program (which calculates volume of the whole molecule). I'm trying to write something in C to do this input; it shouldn't be complicated, I just wanted to understand if sed and awk could be 'forced into coercion', b/c they obviously have trouble with size.
Nope - you are having trouble with the size ... Patience my friend.
I just had a play - after the nominated lines are printed, I had a sleep. Then the disk went ballistic.
I used sed - a stream editor. Obviously it goes on and reads the file till EOF.
If it were me I'd probably do it in Perl (C should be just as easy). That way you can maintain a count, and exit when happy.
As for @chrism01 concern, you merely need to do the read in scalar rather than list context (i.e. a line at a time), and send it off to STDOUT.
Haven't seen your name around here for some time. Glad to see original gurus are still around :^)
Heh. I'm not a guru, just been here awhile ;}
Quote:
Originally Posted by frankie_DJ
Well it's not a transformation, I just need to input those 6000 lines to another program (which calculates volume of the whole molecule). I'm trying to write something in C to do this input; it shouldn't be complicated, I just wanted to understand if sed and awk could be 'forced into coercion', b/c they obviously have trouble with size.
What is supposed to happen with the chunks after the
'processing by other program'? There may well be a
way to do it in awk :}
Only today did I use awk to sieve through a 130MB
log file ;} and re-format it for data warehousing.
Yeah, rec-by-rec as I said.
You can use a piped open if you need to pipe it to the next prog, or just read+write 6000 lines to a new file if that's what it requires.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.