How do I work with large XML files in PHP?
I've done a search in this forum and haven't found an answer to my question. I have a very large XML file (26MB) I need to read in PHP. I can read smaller XML files in PHP but when I try to read the large one, it ends up that nothing appears (whereas the same code would work with smaller ones). "same code" meaning I've tried different code methods which read in the whole file and even one that read in 4096 byte chunks.
Can someone show me an example of code that will read large XML files? I'd like to see what coding methods might work. |
You have 2 choices. Up the memory foot print of PHP in apache and in php.ini, or use fopen in a while pulling in and going through blocked chunks of the file at a time instead of using something like simplexml.
|
Thanks. Are there any other options? If anyone knows of other things, I'm also open to them.
I'm not wanting to boost memory though. I like the way it's set now for various reasons. I'll check into the fopen() option. |
Hi -
As you probably know, there are two main APIs for parsing XML files: * DOM: read the entire data tree and manipulate it in-memory * SAX: read the entire file, but register "event handlers" to process only those parts of the file you're actually interested in For large files like this, SAX is the clear winner. AFAIK, to use SAX from PHP, you would have to write a C/C++ or Java program to parse the XML file, then call your program from PHP using "exec()" or "popen()". These articles suggest that maybe you can also do it directly from PHP: http://www.ibm.com/developerworks/we...ry/wa-php4sax/ http://www.informit.com/guides/conte...=xml&seqNum=48 'Hope that helps .. PSM |
You only need an external app to parse for sax in php4. Simplexml is built on sax, but sax still requires the whole file to be read into memory and if the memory size in apache and php.ini are not increase it will still have the same issue.
|
FuzzieDice -
I don't know about "simplexml". I *do* know that the whole *point* of SAX is to *avoid* having to read the whole file into memory, unless you want to. So please look at the links I sent you. I'm not sure if either one fits the bill, but they both look promising, and they're both easily tested. Please, too, consider the other alternative I suggested: writing a standalone C or Java SAX program (neither of which will "read the whole file into memory"). I honestly think you've got several viable alternatives: simply choose the one that works best for you. IMHO .. PSM |
Quote:
You are correct in the fact that if he writes an external program to do this and calls it from php using exec or system that it will not have this limit placed on it. But as for working with a large XML doc in php and not trying to figure out a way to pass between a PHP script and an external app (as the stored variable in PHP will be the same size no matter what, which in the case of a large XML file will most likely be the same size). As for this, the only real ways to do this with out negative impact or excess coding are as described by me above. I work on a daily basis with XML based data in blobs of 100-300 Mb with no issue in the PHP interpreter by upping memory usage, but on some of our systems that require less of a memory foot print doing read aheads with fopen is the only method that will work successfully with a limited amount of coding and little external needs. If you want to use the Sax parser inside of PHP directly to prove my point of needing a memory bump still, be my guest. Here is an example. http://www.brainbell.com/tutorials/p...L_With_SAX.htm |
Look - different implementations do different things.
But in general, SAX-based programs do *not* cache the entire image in-memory. To do so would basically defeat the entire purpose of having SAX in the first place. I don't want to get into a pissing contest. And I certainly don't want to pretend I know anything about PHP's built-in "SimpleXML" (I don't). I just want to point out that FuzzieDice has several good alternatives. *Besides* increasing memory, and *besides* reading the file a small physical chunk at a time. I'm sure we can agree on that, correct :-)? FuzzieDice - one of the beauties of Open Source is that there are usually many different ways to do things. You certainly have many different alternatives besides the built-in PHP XML processor (good for general purposes, maybe not so good for your particular application). Please look at the links I sent, please consider the suggestions all of us have made - and please let us know how things go for you. OK? 'Hope that helps .. PSM |
Quote:
If you took the time to even look at the links you provided you would see this exact chunk of code in the example. Code:
while ($data = fread($file_stream, 4096)) { Please read more info on Sax parsers before you make comments relating to them. Sax does read the whole doc in order to provide stubbings to it. It then provides "Stubs" or memory structures that lead to portions of the doc as you call for them. This still requires reading of the full doc at once no matter what. As you can see in the examples you provided that they cut it down in to 4096 byte chunks at a time in order to allow for large file support with out increasing the memory foot print as it is only the datastructure in memory plus the current 4096 bytes being read instead of the data structure in memory plus the size of the file in memory. The way people make it seem that it is only a small portion in memory at a time is by restricting to only a small set of tags needed, but this is only good for searching needs, not displaying the doc. |
Quote:
Quote:
|
Quote:
|
xhypno -
Don't accuse me of "bad information". If you want to misinterpret clearly written English, that's your problem. But you're simply being rude. And you're *not* helping the OP. It sounds like the "SimpleXML" implementation that comes standard with PHP 5.x might have some issues - if so, that's too bad. But there are other implementations written in PHP (for example, PEAR might offer a reasonable alternative). And there are lots and lots of ways to call non-PHP programs and libraries from PHP in such a way that you bypass the problem altogether. The point is to try to help with as many different, good alternatives as possible. There *are* other alternatives besides increasing PHP memory; there *are* other alternatives besides reading a large XML file in small physical chunks. I hope you can respect and appreciate that. IMHO .. PSM |
Quote:
SimpleXML uses libxml for a reason and is for direct node access which is a sperate use all together. You came to this thread to make suggestions to the OP based on what I provided. The options I provided are documented as the only other alternatives. The 2 pear packages that implemented SAX and libxml are built into php5, have been for years now. You constantly jump back as if you are correct and you should stop giving the OP bad information. You stated completely that SAX was the solution, but you do not understand SAX as shown by your own comments. The links you provided even show the exact thing that I suggested, but again you jump back unable to accept that you are incorrect and not helping the OP of the thread at all with claims of solutions that are either wrong or EXACTLY WHAT I HAD SUGGESTED! |
Hey, FuzzieDice -
Sorry about all this nonsense. Suffice it to say - you *can* do what you're trying to do. The only thing I can suggest is try an avenue that looks promising, and test it. Please post back if you have any questions along the way. And certainly please post back with what you find! Good luck .. PSM PS: It would probably be a good idea to open a new thread when you do ;-) PPS: Not meaning to beat a dead horse, but: Quote:
|
Thanks to everyone for their input. I've tried code from http://www.ibm.com/developerworks/we...ry/wa-php4sax/ and adapted it to this:
Code:
$xml_parser = xml_parser_create(); Start Element: opt ---Attributes: Parsing problem at line 3: XML_ERR_NAME_REQUIRED In looking at my XML file, I'm finding that the structure in the file is actually a mess. And not indicative to easy use. For example, here's what I mean: Code:
Code:
<item>Journal Set</item> Code:
<status>In stock</status> This is not good. This is dumped from a perl script. This is just an example of the structure of what I'm doing, not the actual tags and items. But it gives you an idea of the XML structure of the large file. What I think I need to do is adjust the perl script so that I have something dumped like this: Code:
<stock> Code:
<item = "sticker set|2.95|In Stock"></item> So I'm rethinking the actual XML file. Maybe with a bit of adjustment to that, I can get a common ground going where SAX would work. |
All times are GMT -5. The time now is 04:19 PM. |