LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-06-2016, 06:02 AM   #1
carcamal
LQ Newbie
 
Registered: Sep 2015
Posts: 13

Rep: Reputation: Disabled
Best way to retrieve a big file into a server's database


Hi!

I'm trying to make a data visualization app. The app uses a php backend to retrieve a big CSV file and store it into a mySQL database, and then performs some statistical queries and returns the info with a nice processing.js chart.

The main concern I'm having is the CSV is *HUGE*, so I should get some caution to avoid stalling the webserver I grab the data from.

The first caution is making sure the data has changed before retrieving it again. I'm using libcurl to do that.

But, as a second caution, I think I should check whether the file was read entirely or not (so I can resume reading it). I don't know what is better:
a) retrieving the file with libcurl into a local server file, storing in the database the reported amount of bytes of the file, then comparing file size in db with actual file size before assuming the file was read entirely. In case of error, use libcurl to resume file download.

b) use open() to read csv from url, then storing in the database the last seek each time a row is read. If seek is less than file size, assume file has to be read again.

Which is better, (a) or (b)? I'm currently using libcurl to check file for modification, and open() to get the csv data.
 
Old 01-06-2016, 06:56 AM   #2
Ramurd
Member
 
Registered: Mar 2009
Location: Rotterdam, the Netherlands
Distribution: Slackwarelinux
Posts: 703

Rep: Reputation: 111Reputation: 111
There are a few unanswered questions here;

Is the entire csv read into a big TEXT field? (if so; forget about mySQL for 'huge' files, as TEXT is actually pretty limited in mySQL 65k charachters iirc)
I think there are some pretty libraries and examples around to import a csv into a database; I even believe mySQL has a option for it in itself...

I assume that each column in the csv is a column in the database (would make more sense anyway)
One thing to handle is the amount of time the script may run (webserver and php setting), the amount of time mySQL allows a query to run (or you get a lock timeout)

Then a question regarding the data in the table: is the entire table equal to the file? (in other words, prior to your inserts do you truncate the table or something like it)
If so, you can count the amount of rows in the table and compare that to the amount of rows in the csv.

And yet another is using transactions; in that case you have to make sure you use the mysql backend that allows transactions (transaction begin + rollback or commit)

and ... what is 'huge'? Thousands of rows? Millions? 10 columns in width or 30? What type of data? How big is the file itself? Does php allow you to upload the file actually? (in other words: do you have to change the default setting in php to allow uploading this large file?)

As you can see, with these kind of things one needs to know a few things to plan correctly.
 
Old 01-06-2016, 07:12 AM   #3
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,869
Blog Entries: 1

Rep: Reputation: 1870Reputation: 1870Reputation: 1870Reputation: 1870Reputation: 1870Reputation: 1870Reputation: 1870Reputation: 1870Reputation: 1870Reputation: 1870Reputation: 1870
@OP: Try to forget irrelevant components such as curl, Apache, PHP. The actual question is: how to load a CSV file into Mysql-database?
I'd start here: http://dev.mysql.com/doc/refman/5.7/en/load-data.html
 
Old 01-06-2016, 09:19 AM   #4
carcamal
LQ Newbie
 
Registered: Sep 2015
Posts: 13

Original Poster
Rep: Reputation: Disabled
Thank you, NevemTeve. I didn't know I could import the file on a single MySQL command.

But I want to load on Server1 a file from Server2 (not a file from client computer). And load-data assumes that file is either or server1 (the same host mySql is running in) or on the client computer. The file is on Server2, an external server from which I have to download it to my own server.

From http://dev.mysql.com/doc/refman/5.7/en/load-data.html:
If LOCAL is specified, the file is read by the client program on the client host and sent to the server. The file can be given as a full path name to specify its exact location. If given as a relative path name, the name is interpreted relative to the directory in which the client program was started.
I understand this means that LOCAL file is always on client host, I. E. its url is not http://server2.com/file.csv.

Also, I'm not sure on whether this command would check if a new database has to be generated or not. As I said, the CSV file is huge (100 megs), so reading it once and again every time a new statistic is requested by client would made server2 stall. And, since it is a third-party server operated by the government, I don't want they get angry with me.


Maybe I download the file with libCurl, then import it with MySQL. I have to make sure MySQL understands the date locale used by this CSV!
 
Old 01-07-2016, 12:08 AM   #5
Ramurd
Member
 
Registered: Mar 2009
Location: Rotterdam, the Netherlands
Distribution: Slackwarelinux
Posts: 703

Rep: Reputation: 111Reputation: 111
A few things then; you said new things which lead to new insights and new decisions :-)

100 megs is big, not huge actually... it also makes it a bit ... impractical for php to handle on a webserver. eg: you'd have to be able to upload a file that size (and any file that size, whether it's correct or not); That may actually be more of a concern than stalling the server.

So, loading the file is one thing that has to be taken care of; But you should consider that running a statistic is something different entirely than loading the data. So you could consider a few things:
- When do yo u know you have to read a new file? Can you 'see' it by the file name? Or is it known when the content will change? Try to determine the ways you can discover the file has changed, so you may have to load it into the database.
- Loading the file and running the statistics should be asynchronous jobs. Otherwise the file can just be downloaded, imported into excel (or a proper spreadsheet) and statistics run there...
- Try to figure how to avoid stalling of the server and having no data available; or how you handle that situation, since it may occur. (hint: database transactions are ideal for that:

some pseudo-code
Code:
transaction begin;

on error rollback; -- or how mysql can handle this :-)

truncate table;
import-data;

commit;
This way the data is near as possible 'always available'; another way could be to load the data in a 'new' table and then you can rename the tables, which takes a really short time as well... the options are endless.
 
Old 01-13-2016, 02:49 PM   #6
carcamal
LQ Newbie
 
Registered: Sep 2015
Posts: 13

Original Poster
Rep: Reputation: Disabled
Thanks, Ramurd.
I can check last update date using curlib, or by using the actual filename (I need curlib also do get the actual filename, since webpage is redirected thus request url remains the same). The "last update" reported by server seems to be good (at least it is consistent).

My first idea was triggering a database update when user landed on the web page and data was found to be obsolete, but I'm thinking about using a cron script to asyncronously retrieve updates on a daily or weekly basis, as you suggested (data is generated monthly, but i can't be sure about which date on the month the date is generated). I could add a filename field and, when transactions are completed, update the statistics for that filename.

I'm not used to transactions, but I understand a it is a good database practice to use them.
 
Old 01-13-2016, 04:48 PM   #7
teapottwo
Member
 
Registered: Sep 2013
Posts: 52

Rep: Reputation: Disabled
If you have access to both servers...

- Use a hashsum to check if its changed.

- You could split the file into smaller chunks using functions such as fseek, then send those and either rebuild or insert on the other side (like torrent apps, etc do).

- You could use some form of "diff", here's a PHP example implementation: http://web.archive.org/web/200805061...t/inline-diff/

- As NevemTeve showed, its easy to load a whole CSV file to MySQL.
 
  


Reply

Tags
php



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Move one Big file to a new server, Best way? Vodkaholic1983 Linux - Newbie 3 03-06-2011 11:24 AM
database server log file vodka33us Linux - Server 0 01-05-2009 03:03 AM
Retrieve a file from another server bijuhpd Linux - Newbie 2 09-29-2005 12:04 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 08:36 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration