ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I'm reading a CSV file into my script (and this CSV may well contain upto 3 million records with upto 53 fields per record) and I do a number of things to the data like mathematical calculations and insertions into a database.
The database i'm using is Postgres.
I'm wondering if it would be better for me to:
A. read the entire CSV into an array (via fgetcsv) and INSERT it into a temporary table in the database and then select based on a LIMIT clause and process around 500 records at a time
B. read each CSV record, line by line (via fgets or fscanf) and process each line as i come accross it.
what i'm looking for is the fastest method for process these records. any suggestions?
The answer basically depends on what calculations you are doing. If you are doing calculations within a single line (eg, adding the first 3 fields together), or in a fashion that relies on reading data in order (eg, sum first field in all lines) I definately recommend doing it line at a time (B), it will be a lot faster.
If you have a lot of calculations between lines, you might be better off loading it into a temporary table before you start (A), however you will have to load it line at a time anyway - loading a 3 million line CSV file into memory would likely go past PHP's memory limits, or at best be terribly slow with much swap usage.
Basically, if you can impliment it without loading it into a temporary table and without using fseek(), do that, but if you can't, use the temporary table.
Use transactions in SQL. Insert (say) 100 lines then commit.
If you are dealing with 3 million records, it's time to "think COBOL." In other words, take the data CSV, verify it and write it out to a temporary file with fixed-length records, and then sort that file. This will not only put the keys in ascending order, it will also place all occurrences of the same key adjacent to one another. Now you can design your program to take advantage of that... just as they did with all those millions of punched cards (pre-computers!) in the 1930's.
It is "unexpectedly efficient" to sort a large file; it is very inefficient to do random reads and writes. Not only does this facilitate so-called "unit-record processing" as described above, but it also has the potential of improving SQL's throughput as well. If updates are made to a key that is "close to" the last one processed (especially in the same transaction), the SQL server might find the page still in its in-memory cache and a separate read/write might be avoided.
When processing very large amounts of data, "random access" is not your friend. Even if a seek requires only a millisecond or two, you very quickly "run out of milliseconds."
Another advantage to this approach is that it will scale-up well. When the number of records grows from 3 million to 30 million, as it may well do, then the completion time can be expected to grow "less than linearly," whereas any process that relies upon random seeks can be expected to grow in "somewhat exponential time."
Last edited by sundialsvcs; 07-25-2006 at 08:04 AM.
urzumph,
using fgetcsv was my initial plan. i don't know why i thought it read the entire file in at once. now, looking at the manual again i see that it reads one line at a time. thanks for pointing that out to me.
in between each CSV record iteration, i'm doing many calculations. the logic is extremely complicated. and about 50 tables are involved (selects, updates and inserts)
regarding PHP's memory issues, i could always increase the memory via php.ini because this runs off a dedicated server and i have root access. i believe the default memory allocation is set to 8 and 2MB. i'm guessing something like 20 and 12MB would do the job?
sundialsvcs,
could you please elaborate on how sorting the records would improve efficiency? you did explain it but i'm not quite sure i understood it.
I'm considering using the shell because I'm guaranteed that this application will only be run on a Linux/UNIX, Apache, Postgres, PHP environment.
i could bypass PHP entirely and use awk and parse the CSV and write them to a text file as SQL inserts and use CLI for Postgres to refer to the text file for the queries.
from thereon, i could access a few hundred records at a time from the database and process them.
that would be more efficient, i imagine, since it's the OS's shell that will be handling everything.
i just thought about this...sounds like a good idea right? except that i know very little awk but it's syntax seems relatively simple but i have no idea how to work Postgres via CLI.
As urzumph said, if the calculations are only dependent on one rec at a time (which you seem to imply), then line at a time will be faster than sorting first as this is not needed. Also short on memory usage (apart from DB usage if it's on same box).
i will be doing calculations on one record at a time.
the problem is these calculations are very long and complex.
each record goes through just over 1000 lines of code.
regarding the shell option, i don't think i can do this because part of the processing involves selecting some values from the database and awk doesn't have postgres connectivity capabilities, does it?
i was thinking and what do you think of this idea:
use awk to split the CSV into multiple files of 10000 records each and then use Perl to process the CSV's.
Do you think it would be faster to split the CSV's and work with them one by one?
i'm not too sure but it seems like it might be slower but i'm not aware of the internals of PHP so i was hoping you could advise me if you knew. thanks.
in between each CSV record iteration, i'm doing many calculations. the logic is extremely complicated. and about 50 tables are involved (selects, updates and inserts)
regarding PHP's memory issues, i could always increase the memory via php.ini because this runs off a dedicated server and i have root access. i believe the default memory allocation is set to 8 and 2MB. i'm guessing something like 20 and 12MB would do the job?
Now, if We take a worst case scenario, 3 M Rows, 52 Columns
And Make some (generous) Assumptions -
PHP Stores data read in from CSV as C Strings
PHP Needs to know where the C String is stored, so it needs a pointer.
Every item in the CSV file is 3 characters long (+ 1 for null terminator, added when reading)
No Malloc Overhead (This is never going to happen but is way easier than calculating it)
Which works out to 3,000,000 * 52 * 8 (4 Bytes String, 4 Bytes Pointer) = 1,248,000,000 Bytes.
Divide by (1024^2) to convert from Bytes to MegaBytes = 1,190.185... MBytes.
So No, 12MB definately isn't going to cut it.
Even if you could do this in a language which would let you (eg, C) what would end up happening is that you would either run out of available memory, or large sections of the data you just loaded off the disk would be swapped back onto the disk by the operating system, thus entirely defeating the purpose.
You are definately much better off processing it one line at a time if it is at all possible.
As for performance, reading it once through and doing all the processing line at a time is going to be more efficient than almost any alternative. The only thing I can think of which would be more efficient would be to have a multithreaded program - one thread per CPU to do the processing, and one thread to wait on disk IO, and maintaing a buffer of converted data (eg C String -> Int) for the working processes. This however is very complicated and introduces many new issues for what amounts to very little gain on a single CPU system.
All occurrences of a key value will occur at the same time, or not at all.
Keys will occur in a non-random sequence, thereby greatly improving the odds that when you ask SQL for "the next" key, it might already have the right database-page in a memory cache somewhere. It also improves the odds that, if the desired key isn't in the SQL server's cache, it'll be "nearby."
When you're processing large amounts of data, sorting can improve efficiency very considerably.
Another factor to think about is what we old mainframers called checkpoint/restart. In other words, "you don't want to have to 'start over.'" You want to be able to process a block of records, commit those changes, and know that if the ka-ka hits the air-movement device, you can pick up where you left off.
If you need to "select values from the database" for each record, that implies "n index-lookups" and you'd like to avoid that if you can. For instance, if you are processing "most of" the records in your table, you could do a "SELECT * ORDER BY" (no "WHERE") to retrieve all of (or a lot of) the records, asking SQL to sort them in the same sequence that you have sorted your input data. Now you are processing two identically-sorted streams, and you can process those very efficiently, literally using the same techniques that used to be done with punched cards.
When you're comparing two sorted streams, you've got seven possibilities at any point in time:
Initial state. (Read from both streams.)
Final state. (Both streams are empty; stop.)
End of stream-A but not stream-B.
Stream-B but not stream-A.
Records match.
Stream-A is less.
Stream-B is less.
Your program can zip sequentially through the datasets, just like in the old days of sci-fi movies where you saw two tape-drives spinning away. That's how they did it, and it still works.
The "1,000 lines of code" won't be what slows you down: it will be I/O. Specifically, it will be "millions of index-searches." To the extent that you can avoid this, the processing will be literally, orders of magnitude faster.
I once used these techniques to turn a process that used to take twenty or thirty hours and run it in about twenty minutes on the same hardware.
sundialsvcs: I wouldn't argue with your analysis in general, but unless I've misread koobi, I don't see any ref to a requirement to do things in any particular order.
If I'm right, his code will be much simpler without bothering with sorting etc, although I would agree that if processing is liable to break & it's difficult to make a clean restart, then making a note of your progress as you go is good.
No, koobi's strategy doesn't require records to appear in any particular order. And I'm certainly not saying that koobi's "wrong." All I'm suggesting is that there may be a situation where an alternate design, albeit slightly more difficult to conceptualize, may be justified.
So... what do we have, doing things "the conventional" way? For each record, we're going to build and execute an SQL query (or several), go out to the server (individually) to get the results, perform the 1000-or-so lines of calculations in about one microsecond, and repeat the whole process all over again. Our potential sources of (mechanical) delay include: network traverse (for a small singleton record), and the time required for one or more physical seeks. CPU time is probably insignificant by comparison.
And ooh, ick! We've got an SQL update, too! Hope that's not hitting a key field because if it is we're "stuck in the dark forest of B-trees." If we have to go there, three million times, we don't want to go there randomly. If it's just non-key fields, and maybe if the gods smile upon us and we're hitting the database when it's not too busy, then all we're stuck paying for is those initial seeks and our absolutely worst case for that (which the caches ought to substantially reduce) is about 3x7=21 million seeks. (At so-many fractions of a millisecond apiece, though, random seek-time adds up in a big fat hurry...)
Mind you, it certainly will work. Computer hardware is very fast, and therefore very forgiving. There are all sorts of caches and buffers and clever programming tricks being used behind the scenes to "soften the blow," but high-volume data processing sometimes calls for an alternate design. Mainframers are quite accustomed to shoehorning yet another job into "the batch window" and the same general principles [can, may] apply here. Purely mechanical sources of delays become significant to the designer. These are mostly eliminated by reducing or removing entirely the need for random access.
Last edited by sundialsvcs; 07-28-2006 at 12:18 PM.
i skimmed through the replies and the sorting issue did make some sense to me.
i will definitely read that post carefully and analyze it soon.
for the moment, for a demo, we have decided to use a small CSV file till we can gather enough information to optimize the entire process for the large CSV. thanks for your replies. it's highly appreciated
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.