Best way to retrieve a big file into a server's database

carcamal · 01-06-2016, 06:02 AM

Hi!

I'm trying to make a data visualization app. The app uses a php backend to retrieve a big CSV file and store it into a mySQL database, and then performs some statistical queries and returns the info with a nice processing.js chart.

The main concern I'm having is the CSV is *HUGE*, so I should get some caution to avoid stalling the webserver I grab the data from.

The first caution is making sure the data has changed before retrieving it again. I'm using libcurl to do that.

But, as a second caution, I think I should check whether the file was read entirely or not (so I can resume reading it). I don't know what is better:
a) retrieving the file with libcurl into a local server file, storing in the database the reported amount of bytes of the file, then comparing file size in db with actual file size before assuming the file was read entirely. In case of error, use libcurl to resume file download.

b) use open() to read csv from url, then storing in the database the last seek each time a row is read. If seek is less than file size, assume file has to be read again.

Which is better, (a) or (b)? I'm currently using libcurl to check file for modification, and open() to get the csv data.

Ramurd · 01-06-2016, 06:56 AM

There are a few unanswered questions here;

Is the entire csv read into a big TEXT field? (if so; forget about mySQL for 'huge' files, as TEXT is actually pretty limited in mySQL 65k charachters iirc)
I think there are some pretty libraries and examples around to import a csv into a database; I even believe mySQL has a option for it in itself...

I assume that each column in the csv is a column in the database (would make more sense anyway)
One thing to handle is the amount of time the script may run (webserver and php setting), the amount of time mySQL allows a query to run (or you get a lock timeout)

Then a question regarding the data in the table: is the entire table equal to the file? (in other words, prior to your inserts do you truncate the table or something like it)
If so, you can count the amount of rows in the table and compare that to the amount of rows in the csv.

And yet another is using transactions; in that case you have to make sure you use the mysql backend that allows transactions (transaction begin + rollback or commit)

and ... what is 'huge'? Thousands of rows? Millions? 10 columns in width or 30? What type of data? How big is the file itself? Does php allow you to upload the file actually? (in other words: do you have to change the default setting in php to allow uploading this large file?)

As you can see, with these kind of things one needs to know a few things to plan correctly.

NevemTeve · 01-06-2016, 07:12 AM

@OP: Try to forget irrelevant components such as curl, Apache, PHP. The actual question is: how to load a CSV file into Mysql-database?
I'd start here: http://dev.mysql.com/doc/refman/5.7/en/load-data.html

carcamal · 01-06-2016, 09:19 AM

Thank you, NevemTeve. I didn't know I could import the file on a single MySQL command.

But I want to load on Server1 a file from Server2 (not a file from client computer). And load-data assumes that file is either or server1 (the same host mySql is running in) or on the client computer. The file is on Server2, an external server from which I have to download it to my own server.

From http://dev.mysql.com/doc/refman/5.7/en/load-data.html:

If LOCAL is specified, the file is read by the client program on the client host and sent to the server. The file can be given as a full path name to specify its exact location. If given as a relative path name, the name is interpreted relative to the directory in which the client program was started.

I understand this means that LOCAL file is always on client host, I. E. its url is not http://server2.com/file.csv.

Also, I'm not sure on whether this command would check if a new database has to be generated or not. As I said, the CSV file is huge (100 megs), so reading it once and again every time a new statistic is requested by client would made server2 stall. And, since it is a third-party server operated by the government, I don't want they get angry with me.

Maybe I download the file with libCurl, then import it with MySQL. I have to make sure MySQL understands the date locale used by this CSV!

Ramurd · 01-07-2016, 12:08 AM

A few things then; you said new things which lead to new insights and new decisions :-)

100 megs is big, not huge actually... it also makes it a bit ... impractical for php to handle on a webserver. eg: you'd have to be able to upload a file that size (and any file that size, whether it's correct or not); That may actually be more of a concern than stalling the server.

So, loading the file is one thing that has to be taken care of; But you should consider that running a statistic is something different entirely than loading the data. So you could consider a few things:
- When do yo u know you have to read a new file? Can you 'see' it by the file name? Or is it known when the content will change? Try to determine the ways you can discover the file has changed, so you may have to load it into the database.
- Loading the file and running the statistics should be asynchronous jobs. Otherwise the file can just be downloaded, imported into excel (or a proper spreadsheet) and statistics run there...
- Try to figure how to avoid stalling of the server and having no data available; or how you handle that situation, since it may occur. (hint: database transactions are ideal for that:

some pseudo-code

Code:

transaction begin;

on error rollback; -- or how mysql can handle this :-)

truncate table;
import-data;

commit;

This way the data is near as possible 'always available'; another way could be to load the data in a 'new' table and then you can rename the tables, which takes a really short time as well... the options are endless.

carcamal · 01-13-2016, 02:49 PM

Thanks, Ramurd.
I can check last update date using curlib, or by using the actual filename (I need curlib also do get the actual filename, since webpage is redirected thus request url remains the same). The "last update" reported by server seems to be good (at least it is consistent).

My first idea was triggering a database update when user landed on the web page and data was found to be obsolete, but I'm thinking about using a cron script to asyncronously retrieve updates on a daily or weekly basis, as you suggested (data is generated monthly, but i can't be sure about which date on the month the date is generated). I could add a filename field and, when transactions are completed, update the statistics for that filename.

I'm not used to transactions, but I understand a it is a good database practice to use them.

teapottwo · 01-13-2016, 04:48 PM

If you have access to both servers...

- Use a hashsum to check if its changed.

- You could split the file into smaller chunks using functions such as fseek, then send those and either rebuild or insert on the other side (like torrent apps, etc do).

- You could use some form of "diff", here's a PHP example implementation: http://web.archive.org/web/200805061...t/inline-diff/

- As NevemTeve showed, its easy to load a whole CSV file to MySQL.