[SOLVED] Read large text files (~10GB), parse for columns, output
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Stick to the subject, please... "Cheap beer and forums do not mix."
No, it probably won't be "better than awk."
"awk" is a very well-written program that is specialized for doing what you are doing.
All of the delays associated with this task will be mechanical ones: disk I/O times and network time. But "awk" knows to tell the operating-system that the file is being read sequentially, and therefore the operating system will know how to line-up lots of file buffers and other tricks to streamline the operation as much as the hardware will allow.
If the time required to do this task is problematic to the business, then there are various things that you can do:
Invest in fast storage-hardware... SATA, FireWire.
Instead of using the disk controllers built into the motherboard, buy a controller card. An inexpensive unit can make a dramatic difference.
Put the input file and the output file on different disk volumes.
Do not follow the siren that says, "put it all in memory..." Abandon all hope, ye who enter there!
Face it: when you're dealing with 10 gigabytes of data, "some things take time." If you're doing the task in "awk," and doing it well, then you are using a robust tool that was specifically designed for the task. You have not erred in the approach that you are using right now. "Diddling with it" will not improve it.
By the way, if I understood the OP correctly, the lines are independent, i.e. line by line parsing should be OK.
If it's the case, then the very first legitimate question is: "Why is it single 10GB file and not a number of much smaller files ?".
The point is that a number of files may be stored on separate hard drives and better yet the drives can be connected to different CPUs, so the whole processing can be done in parallel and the the results can be merged.
That's right. He asked here instead of someplace that will help him.
Just because some of the members on here (Telemachos, Sergei) can't read or aren't smart enough to solve problems before posting doesn't mean that the entire community is worthless. LQ is representative of the internet with people of varying levels of intelligence. Some are stars (sundialsvcs), and others have no light on upstairs (jglands).
Just because some of the members on here (Telemachos, Sergei) can't read or aren't smart enough to solve problems before posting doesn't mean that the entire community is worthless.
Charming. Sergei and I said essentially the same thing as Sundialscvs, though I admit he said it more fully. What we all said was that the OP's C code was unlikely to beat a pre-existing tool (awk, Perl, Python, whatever) because the big issue was the simple math of the filesize.
Charming. Sergei and I said essentially the same thing as Sundialscvs, though I admit he said it more fully. What we all said was that the OP's C code was unlikely to beat a pre-existing tool (awk, Perl, Python, whatever) because the big issue was the simple math of the filesize.
I once incidentally looked into Perl regular expressions code, which is a derived work of some standard RE library.
The most frequent comment was "we are doing/have changed this and that for efficiency reasons".
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.