ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I've done a search in this forum and haven't found an answer to my question. I have a very large XML file (26MB) I need to read in PHP. I can read smaller XML files in PHP but when I try to read the large one, it ends up that nothing appears (whereas the same code would work with smaller ones). "same code" meaning I've tried different code methods which read in the whole file and even one that read in 4096 byte chunks.
Can someone show me an example of code that will read large XML files? I'd like to see what coding methods might work.
You have 2 choices. Up the memory foot print of PHP in apache and in php.ini, or use fopen in a while pulling in and going through blocked chunks of the file at a time instead of using something like simplexml.
As you probably know, there are two main APIs for parsing XML files:
* DOM: read the entire data tree and manipulate it in-memory
* SAX: read the entire file, but register "event handlers" to process only those parts of the file you're actually interested in
For large files like this, SAX is the clear winner.
AFAIK, to use SAX from PHP, you would have to write a C/C++ or Java program to parse the XML file, then call your program from PHP using "exec()" or "popen()".
These articles suggest that maybe you can also do it directly from PHP:
You only need an external app to parse for sax in php4. Simplexml is built on sax, but sax still requires the whole file to be read into memory and if the memory size in apache and php.ini are not increase it will still have the same issue.
I *do* know that the whole *point* of SAX is to *avoid* having to read the whole file into memory, unless you want to.
So please look at the links I sent you. I'm not sure if either one fits the bill, but they both look promising, and they're both easily tested.
Please, too, consider the other alternative I suggested: writing a standalone C or Java SAX program (neither of which will "read the whole file into memory").
I honestly think you've got several viable alternatives: simply choose the one that works best for you.
I *do* know that the whole *point* of SAX is to *avoid* having to read the whole file into memory, unless you want to.
So please look at the links I sent you. I'm not sure if either one fits the bill, but they both look promising, and they're both easily tested.
Please, too, consider the other alternative I suggested: writing a standalone C or Java SAX program (neither of which will "read the whole file into memory").
I honestly think you've got several viable alternatives: simply choose the one that works best for you.
IMHO .. PSM
Actually you are off by a bit. Sax does read the full doc and processes it to generate stubs. It then releases the file handler that stores the doc as it is parsed. That is the reason why you need to up the memory amount in php.ini in order to work with very large xml docs.
You are correct in the fact that if he writes an external program to do this and calls it from php using exec or system that it will not have this limit placed on it. But as for working with a large XML doc in php and not trying to figure out a way to pass between a PHP script and an external app (as the stored variable in PHP will be the same size no matter what, which in the case of a large XML file will most likely be the same size). As for this, the only real ways to do this with out negative impact or excess coding are as described by me above.
I work on a daily basis with XML based data in blobs of 100-300 Mb with no issue in the PHP interpreter by upping memory usage, but on some of our systems that require less of a memory foot print doing read aheads with fopen is the only method that will work successfully with a limited amount of coding and little external needs.
If you want to use the Sax parser inside of PHP directly to prove my point of needing a memory bump still, be my guest. Here is an example.
Look - different implementations do different things.
But in general, SAX-based programs do *not* cache the entire image in-memory. To do so would basically defeat the entire purpose of having SAX in the first place.
I don't want to get into a pissing contest. And I certainly don't want to pretend I know anything about PHP's built-in "SimpleXML" (I don't). I just want to point out that FuzzieDice has several good alternatives. *Besides* increasing memory, and *besides* reading the file a small physical chunk at a time.
I'm sure we can agree on that, correct :-)?
FuzzieDice - one of the beauties of Open Source is that there are usually many different ways to do things. You certainly have many different alternatives besides the built-in PHP XML processor (good for general purposes, maybe not so good for your particular application).
Please look at the links I sent, please consider the suggestions all of us have made - and please let us know how things go for you. OK?
Look - different implementations do different things.
But in general, SAX-based programs do *not* cache the entire image in-memory. To do so would basically defeat the entire purpose of having SAX in the first place.
I don't want to get into a pissing contest. And I certainly don't want to pretend I know anything about PHP's built-in "SimpleXML" (I don't). I just want to point out that FuzzieDice has several good alternatives. *Besides* increasing memory, and *besides* reading the file a small physical chunk at a time.
I'm sure we can agree on that, correct :-)?
FuzzieDice - please look at the links I sent, please consider the suggestions all of us have made - and please let us know how things go for you. OK?
'Hope that helps .. PSM
As I mentioned before, you are incorrect and passing bad information.
If you took the time to even look at the links you provided you would see this exact chunk of code in the example.
Code:
while ($data = fread($file_stream, 4096)) {
$this_chunk_parsed = xml_parse($book_parser, $data, feof($file_stream));
if (!$this_chunk_parsed) {
$error_code = xml_get_error_code($book_parser);
$error_text = xml_error_string($error_code);
$error_line = xml_get_current_line_number($book_parser);
$output_text = "Parsing problem at line $error_line: $error_text";
die($output_text);
}
}
And that is in the first link, and yes, a similar loop is in the second. As this shows alone, parsing in small bits.
Please read more info on Sax parsers before you make comments relating to them. Sax does read the whole doc in order to provide stubbings to it. It then provides "Stubs" or memory structures that lead to portions of the doc as you call for them. This still requires reading of the full doc at once no matter what. As you can see in the examples you provided that they cut it down in to 4096 byte chunks at a time in order to allow for large file support with out increasing the memory foot print as it is only the datastructure in memory plus the current 4096 bytes being read instead of the data structure in memory plus the size of the file in memory. The way people make it seem that it is only a small portion in memory at a time is by restricting to only a small set of tags needed, but this is only good for searching needs, not displaying the doc.
Don't accuse me of "bad information". If you want to misinterpret clearly written English, that's your problem. But you're simply being rude. And you're *not* helping the OP.
It sounds like the "SimpleXML" implementation that comes standard with PHP 5.x might have some issues - if so, that's too bad. But there are other implementations written in PHP (for example, PEAR might offer a reasonable alternative). And there are lots and lots of ways to call non-PHP programs and libraries from PHP in such a way that you bypass the problem altogether.
The point is to try to help with as many different, good alternatives as possible. There *are* other alternatives besides increasing PHP memory; there *are* other alternatives besides reading a large XML file in small physical chunks. I hope you can respect and appreciate that.
Don't accuse me of "bad information". If you want to misinterpret clearly written English, that's your problem. But you're simply being rude. And you're *not* helping the OP.
It sounds like the "SimpleXML" implementation that comes standard with PHP 5.x might have some issues - if so, that's too bad. But there are other implementations written in PHP (for example, PEAR might offer a reasonable alternative). And there are lots and lots of ways to call non-PHP programs and libraries from PHP in such a way that you bypass the problem altogether.
The point is to try to help with as many different, good alternatives as possible. There *are* other alternatives besides increasing PHP memory; there *are* other alternatives besides reading a large XML file in small physical chunks. I hope you can respect and appreciate that.
IMHO .. PSM
You are the first to jump to accuse. You state that there are alternatives like calling an external program that can be used for this, but then you off load the scripts memory usage to another aspect of its run time. exec and system all fork another process that is still limited by the php interpreters memory limitations.
SimpleXML uses libxml for a reason and is for direct node access which is a sperate use all together.
You came to this thread to make suggestions to the OP based on what I provided. The options I provided are documented as the only other alternatives. The 2 pear packages that implemented SAX and libxml are built into php5, have been for years now.
You constantly jump back as if you are correct and you should stop giving the OP bad information. You stated completely that SAX was the solution, but you do not understand SAX as shown by your own comments. The links you provided even show the exact thing that I suggested, but again you jump back unable to accept that you are incorrect and not helping the OP of the thread at all with claims of solutions that are either wrong or EXACTLY WHAT I HAD SUGGESTED!
Sorry about all this nonsense. Suffice it to say - you *can* do what you're trying to do. The only thing I can suggest is try an avenue that looks promising, and test it.
Please post back if you have any questions along the way. And certainly please post back with what you find!
Good luck .. PSM
PS:
It would probably be a good idea to open a new thread when you do ;-)
PPS:
Not meaning to beat a dead horse, but:
Quote:
SimpleXML != SAX
libxml != SAX
MSXML != SAX
Xerces != SAX
<= all of these are examples of SAX implementations
In theory:
SAX != inherently caching
In practice:
Your mileage may vary.
My experience has been that SAX implementations very definitely do *not* need to cache the entire file (unless, of course, everything in the entire file depends on everything else).
For example, with one <status> tag for each item. Now there are more than one tag. <stock></stock> is one. Then <warehouse></warehouse> would be another with the same <price></price> and <items></items> tags and so on. So it gets BIG.
This is not good. This is dumped from a perl script. This is just an example of the structure of what I'm doing, not the actual tags and items. But it gives you an idea of the XML structure of the large file.
What I think I need to do is adjust the perl script so that I have something dumped like this:
Or something like that. At least it would help keeping the data straight! But as for how long the data is, that's a different story. I could also do this:
Code:
<item = "sticker set|2.95|In Stock"></item>
Which might greatly reduce the file size as well as provide me an easy way to parse the data and put it into arrays, etc. for displaying.
So I'm rethinking the actual XML file. Maybe with a bit of adjustment to that, I can get a common ground going where SAX would work.
Last edited by RavenLX; 02-18-2009 at 11:09 AM.
Reason: Fixed status tags in example code.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.