LinuxQuestions.org - How do I work with large XML files in PHP?

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - How do I work with large XML files in PHP? (https://www.linuxquestions.org/questions/programming-9/how-do-i-work-with-large-xml-files-in-php-705047/)

How do I work with large XML files in PHP?

I've done a search in this forum and haven't found an answer to my question. I have a very large XML file (26MB) I need to read in PHP. I can read smaller XML files in PHP but when I try to read the large one, it ends up that nothing appears (whereas the same code would work with smaller ones). "same code" meaning I've tried different code methods which read in the whole file and even one that read in 4096 byte chunks.

Can someone show me an example of code that will read large XML files? I'd like to see what coding methods might work.

You have 2 choices. Up the memory foot print of PHP in apache and in php.ini, or use fopen in a while pulling in and going through blocked chunks of the file at a time instead of using something like simplexml.

Thanks. Are there any other options? If anyone knows of other things, I'm also open to them.

I'm not wanting to boost memory though. I like the way it's set now for various reasons. I'll check into the fopen() option.

Hi -

As you probably know, there are two main APIs for parsing XML files:

* DOM: read the entire data tree and manipulate it in-memory
* SAX: read the entire file, but register "event handlers" to process only those parts of the file you're actually interested in

For large files like this, SAX is the clear winner.

AFAIK, to use SAX from PHP, you would have to write a C/C++ or Java program to parse the XML file, then call your program from PHP using "exec()" or "popen()".

These articles suggest that maybe you can also do it directly from PHP:

http://www.ibm.com/developerworks/we...ry/wa-php4sax/
http://www.informit.com/guides/conte...=xml&seqNum=48

'Hope that helps .. PSM

You only need an external app to parse for sax in php4. Simplexml is built on sax, but sax still requires the whole file to be read into memory and if the memory size in apache and php.ini are not increase it will still have the same issue.

FuzzieDice -

I don't know about "simplexml".

I *do* know that the whole *point* of SAX is to *avoid* having to read the whole file into memory, unless you want to.

So please look at the links I sent you. I'm not sure if either one fits the bill, but they both look promising, and they're both easily tested.

Please, too, consider the other alternative I suggested: writing a standalone C or Java SAX program (neither of which will "read the whole file into memory").

I honestly think you've got several viable alternatives: simply choose the one that works best for you.

IMHO .. PSM

Quote:

Originally Posted by paulsm4 (Post 3445851)

Actually you are off by a bit. Sax does read the full doc and processes it to generate stubs. It then releases the file handler that stores the doc as it is parsed. That is the reason why you need to up the memory amount in php.ini in order to work with very large xml docs.

You are correct in the fact that if he writes an external program to do this and calls it from php using exec or system that it will not have this limit placed on it. But as for working with a large XML doc in php and not trying to figure out a way to pass between a PHP script and an external app (as the stored variable in PHP will be the same size no matter what, which in the case of a large XML file will most likely be the same size). As for this, the only real ways to do this with out negative impact or excess coding are as described by me above.

I work on a daily basis with XML based data in blobs of 100-300 Mb with no issue in the PHP interpreter by upping memory usage, but on some of our systems that require less of a memory foot print doing read aheads with fopen is the only method that will work successfully with a limited amount of coding and little external needs.

If you want to use the Sax parser inside of PHP directly to prove my point of needing a memory bump still, be my guest. Here is an example.

http://www.brainbell.com/tutorials/p...L_With_SAX.htm

Look - different implementations do different things.

But in general, SAX-based programs do *not* cache the entire image in-memory. To do so would basically defeat the entire purpose of having SAX in the first place.

I don't want to get into a pissing contest. And I certainly don't want to pretend I know anything about PHP's built-in "SimpleXML" (I don't). I just want to point out that FuzzieDice has several good alternatives. *Besides* increasing memory, and *besides* reading the file a small physical chunk at a time.

I'm sure we can agree on that, correct :-)?

FuzzieDice - one of the beauties of Open Source is that there are usually many different ways to do things. You certainly have many different alternatives besides the built-in PHP XML processor (good for general purposes, maybe not so good for your particular application).

Please look at the links I sent, please consider the suggestions all of us have made - and please let us know how things go for you. OK?

'Hope that helps .. PSM

Quote:

Originally Posted by paulsm4 (Post 3447014)

Look - different implementations do different things.

But in general, SAX-based programs do *not* cache the entire image in-memory. To do so would basically defeat the entire purpose of having SAX in the first place.

I don't want to get into a pissing contest. And I certainly don't want to pretend I know anything about PHP's built-in "SimpleXML" (I don't). I just want to point out that FuzzieDice has several good alternatives. *Besides* increasing memory, and *besides* reading the file a small physical chunk at a time.

I'm sure we can agree on that, correct :-)?

FuzzieDice - please look at the links I sent, please consider the suggestions all of us have made - and please let us know how things go for you. OK?

'Hope that helps .. PSM

As I mentioned before, you are incorrect and passing bad information.

If you took the time to even look at the links you provided you would see this exact chunk of code in the example.

Code:

 while ($data = fread($file_stream, 4096)) {



      $this_chunk_parsed = xml_parse($book_parser, $data, feof($file_stream));

      if (!$this_chunk_parsed) {

          $error_code = xml_get_error_code($book_parser);

          $error_text = xml_error_string($error_code);

          $error_line = xml_get_current_line_number($book_parser);



          $output_text = "Parsing problem at line $error_line: $error_text";

          die($output_text);

      }



  }

And that is in the first link, and yes, a similar loop is in the second. As this shows alone, parsing in small bits.

Please read more info on Sax parsers before you make comments relating to them. Sax does read the whole doc in order to provide stubbings to it. It then provides "Stubs" or memory structures that lead to portions of the doc as you call for them. This still requires reading of the full doc at once no matter what. As you can see in the examples you provided that they cut it down in to 4096 byte chunks at a time in order to allow for large file support with out increasing the memory foot print as it is only the datastructure in memory plus the current 4096 bytes being read instead of the data structure in memory plus the size of the file in memory. The way people make it seem that it is only a small portion in memory at a time is by restricting to only a small set of tags needed, but this is only good for searching needs, not displaying the doc.

Quote:

Originally Posted by xhypno (Post 3447021)

Code:

 while ($data = fread($file_stream, 4096)) {



      $this_chunk_parsed = xml_parse($book_parser, $data, feof($file_stream));

      if (!$this_chunk_parsed) {

          $error_code = xml_get_error_code($book_parser);

          $error_text = xml_error_string($error_code);

          $error_line = xml_get_current_line_number($book_parser);



          $output_text = "Parsing problem at line $error_line: $error_text";

          die($output_text);

      }



  }

xhypno, is that not exactly what paulsm4 is saying :confused:

Quote:

Originally Posted by paulsm4 (Post 3447014)

But in general, SAX-based programs do *not* cache the entire image in-memory.

Quote:

Originally Posted by Wim Sturkenboom (Post 3447177)

xhypno, is that not exactly what paulsm4 is saying :confused:

Actually it is considering it is from his direct links. He is mistaken about how sax parsing works and I have pointed out the difference to him.

xhypno -

Don't accuse me of "bad information". If you want to misinterpret clearly written English, that's your problem. But you're simply being rude. And you're *not* helping the OP.

It sounds like the "SimpleXML" implementation that comes standard with PHP 5.x might have some issues - if so, that's too bad. But there are other implementations written in PHP (for example, PEAR might offer a reasonable alternative). And there are lots and lots of ways to call non-PHP programs and libraries from PHP in such a way that you bypass the problem altogether.

The point is to try to help with as many different, good alternatives as possible. There *are* other alternatives besides increasing PHP memory; there *are* other alternatives besides reading a large XML file in small physical chunks. I hope you can respect and appreciate that.

IMHO .. PSM

Quote:

Originally Posted by paulsm4 (Post 3447236)

You are the first to jump to accuse. You state that there are alternatives like calling an external program that can be used for this, but then you off load the scripts memory usage to another aspect of its run time. exec and system all fork another process that is still limited by the php interpreters memory limitations.

SimpleXML uses libxml for a reason and is for direct node access which is a sperate use all together.

You came to this thread to make suggestions to the OP based on what I provided. The options I provided are documented as the only other alternatives. The 2 pear packages that implemented SAX and libxml are built into php5, have been for years now.

You constantly jump back as if you are correct and you should stop giving the OP bad information. You stated completely that SAX was the solution, but you do not understand SAX as shown by your own comments. The links you provided even show the exact thing that I suggested, but again you jump back unable to accept that you are incorrect and not helping the OP of the thread at all with claims of solutions that are either wrong or EXACTLY WHAT I HAD SUGGESTED!

Hey, FuzzieDice -

Sorry about all this nonsense. Suffice it to say - you *can* do what you're trying to do. The only thing I can suggest is try an avenue that looks promising, and test it.

Please post back if you have any questions along the way. And certainly please post back with what you find!

Good luck .. PSM

PS:
It would probably be a good idea to open a new thread when you do ;-)

PPS:
Not meaning to beat a dead horse, but:

Quote:

SimpleXML != SAX
libxml != SAX
MSXML != SAX
Xerces != SAX
<= all of these are examples of SAX implementations

In theory:
SAX != inherently caching

In practice:
Your mileage may vary.

My experience has been that SAX implementations very definitely do *not* need to cache the entire file (unless, of course, everything in the entire file depends on everything else).

Xhypno has apparently had a different experience.

Fair enough...

Thanks to everyone for their input. I've tried code from http://www.ibm.com/developerworks/we...ry/wa-php4sax/ and adapted it to this:

Code:

$xml_parser = xml_parser_create();

xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, 0);

xml_set_element_handler($xml_parser, "start_element", "end_element");

xml_set_character_data_handler($xml_parser, "characters");



$file = "logs.xml";

if ($file_stream = fopen($file, "r")) {



  while ($data = fread($file_stream, 4096)) {



      $this_chunk_parsed = xml_parse($xml_parser, $data, feof($file_stream));

      if (!$this_chunk_parsed) {

          $error_code = xml_get_error_code($xml_parser);

          $error_text = xml_error_string($error_code);

          $error_line = xml_get_current_line_number($xml_parser);



          $output_text = "Parsing problem at line $error_line: $error_text";

          die($output_text);

      }



  }



} else {



    die("Can't open XML file.");



}

xml_parser_free($xml_parser);





// Functions



function start_element($parser, $name, $attrs) {

    print "<b>Start Element:</b> $name<br />";

    print "<b>---Attributes:</b> <br />";

    foreach ($attrs as $key => $value) {

        print "$key = $value<br />";

    }

    print "<br />";

}



function end_element($parser, $name) {

    print "<b>End Element:</b> $name<br /><br />";

}



function characters($parser, $chars) {

    print "<p><i>$chars</i></p>";

}

This works if I have smaller XML files. But the big file it errors out with this:

Start Element: opt
---Attributes:

Parsing problem at line 3: XML_ERR_NAME_REQUIRED

In looking at my XML file, I'm finding that the structure in the file is actually a mess. And not indicative to easy use.

For example, here's what I mean:

Code:



<?xml version="1.0"?>

<opt>

  <stock>

    <price>14.95</price>

    <price>2.95</price>

    <price>3.50</price>

    <price>7.25</price>

    <price>29.85</price>

    <price>147.65</price>

    <price>1.49</price>

    <price>12.26</price>

    <price>15.00</price>

This goes on with tons of items, say maybe 50. then after that you got:

Code:

    <item>Journal Set</item>

    <item>Sticker Set</item>

    <item>Magnet Set</item>

    <item>Markers</item>

    <item>Printer</item>

    <item>Custom Pen</item>

    <item>Mug</item>

    <item>T-Shirt</item>

And so on, and after a matching number of items (same amount as prices) I would have:

Code:

    <status>In stock</status>

    <status>out of stock</status>

    <status>In stock</status>

    <status>In stock</status>

    <status>In stock</status>

    <status>In stock</status>

    <status>In stock</status> 

    <status>out of stock</status>

  </stock>

  <warehouse>

  ....

  </warehouse>

</opt>

For example, with one <status> tag for each item. Now there are more than one tag. <stock></stock> is one. Then <warehouse></warehouse> would be another with the same <price></price> and <items></items> tags and so on. So it gets BIG.

This is not good. This is dumped from a perl script. This is just an example of the structure of what I'm doing, not the actual tags and items. But it gives you an idea of the XML structure of the large file.

What I think I need to do is adjust the perl script so that I have something dumped like this:

Code:

<stock>

  <item = "Sticker Set">

      <price = "2.95"></price>    

      <status = "In Stock"></status>

  </item>

</stock>

Or something like that. At least it would help keeping the data straight! But as for how long the data is, that's a different story. I could also do this:

Code:

<item = "sticker set|2.95|In Stock"></item>

Which might greatly reduce the file size as well as provide me an easy way to parse the data and put it into arrays, etc. for displaying.

So I'm rethinking the actual XML file. Maybe with a bit of adjustment to that, I can get a common ground going where SAX would work.