LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 02-16-2009, 08:11 AM   #1
RavenLX
Member
 
Registered: Oct 2004
Posts: 88

Rep: Reputation: 15
Question How do I work with large XML files in PHP?


I've done a search in this forum and haven't found an answer to my question. I have a very large XML file (26MB) I need to read in PHP. I can read smaller XML files in PHP but when I try to read the large one, it ends up that nothing appears (whereas the same code would work with smaller ones). "same code" meaning I've tried different code methods which read in the whole file and even one that read in 4096 byte chunks.

Can someone show me an example of code that will read large XML files? I'd like to see what coding methods might work.
 
Old 02-16-2009, 10:12 AM   #2
xhypno
Member
 
Registered: Sep 2004
Posts: 62

Rep: Reputation: 16
You have 2 choices. Up the memory foot print of PHP in apache and in php.ini, or use fopen in a while pulling in and going through blocked chunks of the file at a time instead of using something like simplexml.
 
Old 02-16-2009, 11:16 AM   #3
RavenLX
Member
 
Registered: Oct 2004
Posts: 88

Original Poster
Rep: Reputation: 15
Thanks. Are there any other options? If anyone knows of other things, I'm also open to them.

I'm not wanting to boost memory though. I like the way it's set now for various reasons. I'll check into the fopen() option.
 
Old 02-16-2009, 11:25 AM   #4
paulsm4
Guru
 
Registered: Mar 2004
Distribution: SusE 8.2
Posts: 5,863
Blog Entries: 1

Rep: Reputation: Disabled
Hi -

As you probably know, there are two main APIs for parsing XML files:

* DOM: read the entire data tree and manipulate it in-memory
* SAX: read the entire file, but register "event handlers" to process only those parts of the file you're actually interested in

For large files like this, SAX is the clear winner.

AFAIK, to use SAX from PHP, you would have to write a C/C++ or Java program to parse the XML file, then call your program from PHP using "exec()" or "popen()".

These articles suggest that maybe you can also do it directly from PHP:

http://www.ibm.com/developerworks/we...ry/wa-php4sax/
http://www.informit.com/guides/conte...=xml&seqNum=48

'Hope that helps .. PSM

Last edited by paulsm4; 02-16-2009 at 11:27 AM.
 
Old 02-16-2009, 01:51 PM   #5
xhypno
Member
 
Registered: Sep 2004
Posts: 62

Rep: Reputation: 16
You only need an external app to parse for sax in php4. Simplexml is built on sax, but sax still requires the whole file to be read into memory and if the memory size in apache and php.ini are not increase it will still have the same issue.
 
Old 02-16-2009, 02:01 PM   #6
paulsm4
Guru
 
Registered: Mar 2004
Distribution: SusE 8.2
Posts: 5,863
Blog Entries: 1

Rep: Reputation: Disabled
FuzzieDice -

I don't know about "simplexml".

I *do* know that the whole *point* of SAX is to *avoid* having to read the whole file into memory, unless you want to.

So please look at the links I sent you. I'm not sure if either one fits the bill, but they both look promising, and they're both easily tested.

Please, too, consider the other alternative I suggested: writing a standalone C or Java SAX program (neither of which will "read the whole file into memory").

I honestly think you've got several viable alternatives: simply choose the one that works best for you.

IMHO .. PSM

Last edited by paulsm4; 02-16-2009 at 02:03 PM.
 
Old 02-17-2009, 07:12 AM   #7
xhypno
Member
 
Registered: Sep 2004
Posts: 62

Rep: Reputation: 16
Quote:
Originally Posted by paulsm4 View Post
FuzzieDice -

I don't know about "simplexml".

I *do* know that the whole *point* of SAX is to *avoid* having to read the whole file into memory, unless you want to.

So please look at the links I sent you. I'm not sure if either one fits the bill, but they both look promising, and they're both easily tested.

Please, too, consider the other alternative I suggested: writing a standalone C or Java SAX program (neither of which will "read the whole file into memory").

I honestly think you've got several viable alternatives: simply choose the one that works best for you.

IMHO .. PSM
Actually you are off by a bit. Sax does read the full doc and processes it to generate stubs. It then releases the file handler that stores the doc as it is parsed. That is the reason why you need to up the memory amount in php.ini in order to work with very large xml docs.

You are correct in the fact that if he writes an external program to do this and calls it from php using exec or system that it will not have this limit placed on it. But as for working with a large XML doc in php and not trying to figure out a way to pass between a PHP script and an external app (as the stored variable in PHP will be the same size no matter what, which in the case of a large XML file will most likely be the same size). As for this, the only real ways to do this with out negative impact or excess coding are as described by me above.

I work on a daily basis with XML based data in blobs of 100-300 Mb with no issue in the PHP interpreter by upping memory usage, but on some of our systems that require less of a memory foot print doing read aheads with fopen is the only method that will work successfully with a limited amount of coding and little external needs.

If you want to use the Sax parser inside of PHP directly to prove my point of needing a memory bump still, be my guest. Here is an example.

http://www.brainbell.com/tutorials/p...L_With_SAX.htm
 
Old 02-17-2009, 09:55 AM   #8
paulsm4
Guru
 
Registered: Mar 2004
Distribution: SusE 8.2
Posts: 5,863
Blog Entries: 1

Rep: Reputation: Disabled
Look - different implementations do different things.

But in general, SAX-based programs do *not* cache the entire image in-memory. To do so would basically defeat the entire purpose of having SAX in the first place.

I don't want to get into a pissing contest. And I certainly don't want to pretend I know anything about PHP's built-in "SimpleXML" (I don't). I just want to point out that FuzzieDice has several good alternatives. *Besides* increasing memory, and *besides* reading the file a small physical chunk at a time.

I'm sure we can agree on that, correct :-)?

FuzzieDice - one of the beauties of Open Source is that there are usually many different ways to do things. You certainly have many different alternatives besides the built-in PHP XML processor (good for general purposes, maybe not so good for your particular application).

Please look at the links I sent, please consider the suggestions all of us have made - and please let us know how things go for you. OK?

'Hope that helps .. PSM

Last edited by paulsm4; 02-17-2009 at 09:59 AM.
 
Old 02-17-2009, 10:02 AM   #9
xhypno
Member
 
Registered: Sep 2004
Posts: 62

Rep: Reputation: 16
Quote:
Originally Posted by paulsm4 View Post
Look - different implementations do different things.

But in general, SAX-based programs do *not* cache the entire image in-memory. To do so would basically defeat the entire purpose of having SAX in the first place.

I don't want to get into a pissing contest. And I certainly don't want to pretend I know anything about PHP's built-in "SimpleXML" (I don't). I just want to point out that FuzzieDice has several good alternatives. *Besides* increasing memory, and *besides* reading the file a small physical chunk at a time.

I'm sure we can agree on that, correct :-)?

FuzzieDice - please look at the links I sent, please consider the suggestions all of us have made - and please let us know how things go for you. OK?

'Hope that helps .. PSM
As I mentioned before, you are incorrect and passing bad information.

If you took the time to even look at the links you provided you would see this exact chunk of code in the example.

Code:
 while ($data = fread($file_stream, 4096)) {

       $this_chunk_parsed = xml_parse($book_parser, $data, feof($file_stream));
       if (!$this_chunk_parsed) {
           $error_code = xml_get_error_code($book_parser);
           $error_text = xml_error_string($error_code);
           $error_line = xml_get_current_line_number($book_parser);

           $output_text = "Parsing problem at line $error_line: $error_text";
           die($output_text);
       }

   }
And that is in the first link, and yes, a similar loop is in the second. As this shows alone, parsing in small bits.

Please read more info on Sax parsers before you make comments relating to them. Sax does read the whole doc in order to provide stubbings to it. It then provides "Stubs" or memory structures that lead to portions of the doc as you call for them. This still requires reading of the full doc at once no matter what. As you can see in the examples you provided that they cut it down in to 4096 byte chunks at a time in order to allow for large file support with out increasing the memory foot print as it is only the datastructure in memory plus the current 4096 bytes being read instead of the data structure in memory plus the size of the file in memory. The way people make it seem that it is only a small portion in memory at a time is by restricting to only a small set of tags needed, but this is only good for searching needs, not displaying the doc.

Last edited by xhypno; 02-17-2009 at 10:04 AM.
 
Old 02-17-2009, 12:02 PM   #10
Wim Sturkenboom
Senior Member
 
Registered: Jan 2005
Location: Roodepoort, South Africa
Distribution: Slackware 10.1/10.2/12, Ubuntu 12.04, Crunchbang Statler
Posts: 3,786

Rep: Reputation: 282Reputation: 282Reputation: 282
Quote:
Originally Posted by xhypno View Post
Code:
 while ($data = fread($file_stream, 4096)) {

       $this_chunk_parsed = xml_parse($book_parser, $data, feof($file_stream));
       if (!$this_chunk_parsed) {
           $error_code = xml_get_error_code($book_parser);
           $error_text = xml_error_string($error_code);
           $error_line = xml_get_current_line_number($book_parser);

           $output_text = "Parsing problem at line $error_line: $error_text";
           die($output_text);
       }

   }
xhypno, is that not exactly what paulsm4 is saying

Quote:
Originally Posted by paulsm4 View Post
But in general, SAX-based programs do *not* cache the entire image in-memory.
 
Old 02-17-2009, 12:18 PM   #11
xhypno
Member
 
Registered: Sep 2004
Posts: 62

Rep: Reputation: 16
Quote:
Originally Posted by Wim Sturkenboom View Post
xhypno, is that not exactly what paulsm4 is saying
Actually it is considering it is from his direct links. He is mistaken about how sax parsing works and I have pointed out the difference to him.
 
Old 02-17-2009, 12:54 PM   #12
paulsm4
Guru
 
Registered: Mar 2004
Distribution: SusE 8.2
Posts: 5,863
Blog Entries: 1

Rep: Reputation: Disabled
xhypno -

Don't accuse me of "bad information". If you want to misinterpret clearly written English, that's your problem. But you're simply being rude. And you're *not* helping the OP.

It sounds like the "SimpleXML" implementation that comes standard with PHP 5.x might have some issues - if so, that's too bad. But there are other implementations written in PHP (for example, PEAR might offer a reasonable alternative). And there are lots and lots of ways to call non-PHP programs and libraries from PHP in such a way that you bypass the problem altogether.

The point is to try to help with as many different, good alternatives as possible. There *are* other alternatives besides increasing PHP memory; there *are* other alternatives besides reading a large XML file in small physical chunks. I hope you can respect and appreciate that.

IMHO .. PSM
 
Old 02-17-2009, 01:22 PM   #13
xhypno
Member
 
Registered: Sep 2004
Posts: 62

Rep: Reputation: 16
Quote:
Originally Posted by paulsm4 View Post
xhypno -

Don't accuse me of "bad information". If you want to misinterpret clearly written English, that's your problem. But you're simply being rude. And you're *not* helping the OP.

It sounds like the "SimpleXML" implementation that comes standard with PHP 5.x might have some issues - if so, that's too bad. But there are other implementations written in PHP (for example, PEAR might offer a reasonable alternative). And there are lots and lots of ways to call non-PHP programs and libraries from PHP in such a way that you bypass the problem altogether.

The point is to try to help with as many different, good alternatives as possible. There *are* other alternatives besides increasing PHP memory; there *are* other alternatives besides reading a large XML file in small physical chunks. I hope you can respect and appreciate that.

IMHO .. PSM
You are the first to jump to accuse. You state that there are alternatives like calling an external program that can be used for this, but then you off load the scripts memory usage to another aspect of its run time. exec and system all fork another process that is still limited by the php interpreters memory limitations.

SimpleXML uses libxml for a reason and is for direct node access which is a sperate use all together.

You came to this thread to make suggestions to the OP based on what I provided. The options I provided are documented as the only other alternatives. The 2 pear packages that implemented SAX and libxml are built into php5, have been for years now.

You constantly jump back as if you are correct and you should stop giving the OP bad information. You stated completely that SAX was the solution, but you do not understand SAX as shown by your own comments. The links you provided even show the exact thing that I suggested, but again you jump back unable to accept that you are incorrect and not helping the OP of the thread at all with claims of solutions that are either wrong or EXACTLY WHAT I HAD SUGGESTED!

Last edited by xhypno; 02-17-2009 at 01:23 PM.
 
Old 02-17-2009, 02:02 PM   #14
paulsm4
Guru
 
Registered: Mar 2004
Distribution: SusE 8.2
Posts: 5,863
Blog Entries: 1

Rep: Reputation: Disabled
Hey, FuzzieDice -

Sorry about all this nonsense. Suffice it to say - you *can* do what you're trying to do. The only thing I can suggest is try an avenue that looks promising, and test it.

Please post back if you have any questions along the way. And certainly please post back with what you find!

Good luck .. PSM

PS:
It would probably be a good idea to open a new thread when you do ;-)

PPS:
Not meaning to beat a dead horse, but:
Quote:
SimpleXML != SAX
libxml != SAX
MSXML != SAX
Xerces != SAX
<= all of these are examples of SAX implementations

In theory:
SAX != inherently caching

In practice:
Your mileage may vary.

My experience has been that SAX implementations very definitely do *not* need to cache the entire file (unless, of course, everything in the entire file depends on everything else).

Xhypno has apparently had a different experience.

Fair enough...

Last edited by paulsm4; 02-17-2009 at 02:17 PM.
 
Old 02-18-2009, 11:08 AM   #15
RavenLX
Member
 
Registered: Oct 2004
Posts: 88

Original Poster
Rep: Reputation: 15
Thanks to everyone for their input. I've tried code from http://www.ibm.com/developerworks/we...ry/wa-php4sax/ and adapted it to this:

Code:
$xml_parser = xml_parser_create();
xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, 0);
xml_set_element_handler($xml_parser, "start_element", "end_element");
xml_set_character_data_handler($xml_parser, "characters");

$file = "logs.xml";
if ($file_stream = fopen($file, "r")) {

   while ($data = fread($file_stream, 4096)) {

       $this_chunk_parsed = xml_parse($xml_parser, $data, feof($file_stream));
       if (!$this_chunk_parsed) {
           $error_code = xml_get_error_code($xml_parser);
           $error_text = xml_error_string($error_code);
           $error_line = xml_get_current_line_number($xml_parser);

           $output_text = "Parsing problem at line $error_line: $error_text";
           die($output_text);
       }

   }

} else {

    die("Can't open XML file.");

}
xml_parser_free($xml_parser);


// Functions

function start_element($parser, $name, $attrs) {
    print "<b>Start Element:</b> $name<br />";
    print "<b>---Attributes:</b> <br />";
    foreach ($attrs as $key => $value) {
        print "$key = $value<br />";
    }
    print "<br />";
}

function end_element($parser, $name) {
    print "<b>End Element:</b> $name<br /><br />";
}

function characters($parser, $chars) {
    print "<p><i>$chars</i></p>";
}
This works if I have smaller XML files. But the big file it errors out with this:

Start Element: opt
---Attributes:

Parsing problem at line 3: XML_ERR_NAME_REQUIRED

In looking at my XML file, I'm finding that the structure in the file is actually a mess. And not indicative to easy use.

For example, here's what I mean:

Code:
<?xml version="1.0"?>
<opt>
  <stock>
    <price>14.95</price>
    <price>2.95</price>
    <price>3.50</price>
    <price>7.25</price>
    <price>29.85</price>
    <price>147.65</price>
    <price>1.49</price>
    <price>12.26</price>
    <price>15.00</price>
This goes on with tons of items, say maybe 50. then after that you got:

Code:
    <item>Journal Set</item>
    <item>Sticker Set</item>
    <item>Magnet Set</item>
    <item>Markers</item>
    <item>Printer</item>
    <item>Custom Pen</item>
    <item>Mug</item>
    <item>T-Shirt</item>
And so on, and after a matching number of items (same amount as prices) I would have:

Code:
    <status>In stock</status>
    <status>out of stock</status>
    <status>In stock</status>
    <status>In stock</status>
    <status>In stock</status>
    <status>In stock</status>
    <status>In stock</status> 
    <status>out of stock</status>
  </stock>
  <warehouse>
   ....
  </warehouse>
</opt>
For example, with one <status> tag for each item. Now there are more than one tag. <stock></stock> is one. Then <warehouse></warehouse> would be another with the same <price></price> and <items></items> tags and so on. So it gets BIG.

This is not good. This is dumped from a perl script. This is just an example of the structure of what I'm doing, not the actual tags and items. But it gives you an idea of the XML structure of the large file.

What I think I need to do is adjust the perl script so that I have something dumped like this:
Code:
<stock>
   <item = "Sticker Set">
      <price = "2.95"></price>    
      <status = "In Stock"></status>
   </item>
</stock>
Or something like that. At least it would help keeping the data straight! But as for how long the data is, that's a different story. I could also do this:

Code:
   <item = "sticker set|2.95|In Stock"></item>
Which might greatly reduce the file size as well as provide me an easy way to parse the data and put it into arrays, etc. for displaying.

So I'm rethinking the actual XML file. Maybe with a bit of adjustment to that, I can get a common ground going where SAX would work.

Last edited by RavenLX; 02-18-2009 at 11:09 AM. Reason: Fixed status tags in example code.
 
  


Reply

Tags
file, php, xml


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
ext3 performance -- very large number of files, large filesystems, etc. td3201 Linux - Server 5 11-25-2008 09:28 AM
getting xml to work with php/apache sholah Linux - Server 1 11-12-2008 11:41 PM
LXer: How to convert PDF files to HTML or XML files in openSUSE LXer Syndicated Linux News 0 08-20-2008 08:40 AM
LXer: The smart way to Process XML files with PHP LXer Syndicated Linux News 0 11-10-2007 12:30 AM
Splitting A Large Xml File anirudh Programming 7 09-03-2004 04:28 AM


All times are GMT -5. The time now is 08:44 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration