Parse streaming XML with C++

grob115 · 12-14-2011, 11:25 AM

Hi, I have a log file that has is recorded in XML format. In other words, instead of writing a line with the details in name value pairs, it's writing these name value pairs as an XML element.

While I can write using various commands to parse the file if it's written in a single line by tailing the file, and extracting the values of interest by using 'awk', I'm not sure if this can be easily done if the data is in XML format as the details are written into multiple lines per XML record. Basically if each XML record is written out in 10 lines, I need to wait until all 10 lines are written into the log before I can parse the record. Am interested in doing this in C++.

Looks like following is intended for this purpose. However, I can't find any actual example code illustrating how this can be done.
http://xerces.apache.org/xerces-c/pparse-3.html

Looks like LlamaXML and Libxml2 are good candidates? Any one has experience with them and can comment?

Nominal Animal · 12-14-2011, 04:25 PM

Did you look at the PParse sources? Download the source distribution, and the source is in the samples/src/PParse/ subdirectory. To me, the example seems pretty straightforward. Most of the code is just there to handle command-line parameters and careful error checking. Basically, you'll just need to add your own handling into the loop that calls parseNext().

Personally, I've never understood XML logging myself. The XML standard states that XML files may contain only one root node, but either the XML log files contain multiple nodes, or they lack the final end tag -- I've yet to see an XML log file to be fully XML compliant. (I don't really mean this as criticism, just as a background for why I think a nonstandard solution may be correct or even required. If it does not follow the XML specs, is it really XML?)

I feel the best approach would be to somehow split each root node from the logging stream, and handle it separately. Any XML library should be able to handle such fragments with ease, since the split part should be perfectly XML compliant (at least, if a fixed header such as <?xml version="1.0"?> is prepended to the string). In practical terms, the problem is how to split the input stream into separate root nodes.

In the generic case, it is possible and not too difficult to write an XML-like stack-based buffer, which reads a new root XML node from the input, and when complete, emits it to some other handler. (You need a finite state machine that can detect start and end tags, pushing each new start tag to a stack, and popping the topmost tag when/if it is closed using the corresponding end tag. When the stack becomes empty, you send the processed data for a real XML handler. It is not too difficult to write, but in my opinion, it may be much more robust than what you really need.)

In your case, your log file most likely contains a fixed root node that wraps around each logged event, and does not appear elsewhere in the XML. If so, you could simply use either the start of the root node (say, <logEvent followed by whitespace, /, or>), or the end of the closing element of the root node (say, </logEvent> ) as a separator. Read the log stream into a text buffer, but whenever you encounter the separator, flush the preceding string -- by sending it to an XML parser for further processing. You may wish to prepend a fixed header string like <?xml version="1.0"?> to each string to make sure the XML parser is happy. This way the XML parser sees each event as a separate XML file.

For example, if your log stream was

Code:

<logEvent id="foo">
 ... XML data related to event foo ...
</logEvent>
<logEvent id="bar">
 ... XML data related to event bar ...
</logEvent>
<logEvent id="baz">
 ... XML data related to event baz ...
</logEvent>

your C++ splitting code would need to act like this GNU awk snippet:

Code:

gawk -v cmd="cat" 'BEGIN { RS = "</logEvent>[\t\n\v\f\r ]*" }
    { printf("<?xml version=\"1.0\" encoding=\"utf-8\" ?>\n%s%s\n", $0, RT) | cmd ;
      close(cmd) }' input-stream

That is, cat will be called once per completed root node. (Use cmd="date; cat" if you want to see that each root node really is split into a new document.)

To make it easier to adjust if the logging format changes, you might wish to write the splitting code so that the separator is read from a configuration file. You might also wish to allow more than one separator string.

There are a number of ways you could do the splitting in C++, but I personally prefer to use C (C99 or GNU C) for this kind of processing (using low-level I/O, i.e. unistd.h and not stdio.h -- my data files are very large, and I really need the code to be as efficient as possible), so I'm not the best person to help you with the actual C++ implementation.

grob115 · 12-14-2011, 04:50 PM

Thanks for the link for the source. Can you compare it to the other two libs?
Also, you have provided a lot of good informtion on breaking down the log file base on nodes using non-C++ commands, which is good to know. Thanks.

Nominal Animal · 12-14-2011, 07:39 PM

Like I said, I prefer C over C++ for data processing. LlamaXML and Xerces are both C++ only libraries, AFAICT. Libxml2 is the only one of the three with C bindings, but is used by a lot of projects. I'm not certain if it is because libxml2 is that good, or just because there are tons more C projects than C++ projects.

(It is usually pretty simple to write C++ bindings for a C library. Writing C wrappers for a C++ library is usually impractical. This is why C++ libraries are rarely used for anything other than C++ projects, but C libraries are used across languages. Since I don't use C++ for this kind of stuff, I just don't know which library is best for you.)

I recommend you look at the documentation, especially the samples and examples that are included in the source packages, and make up your own mind.

theNbomr · 12-17-2011, 01:35 PM

Quote:

Originally Posted by Nominal Animal

Personally, I've never understood XML logging myself. The XML standard states that XML files may contain only one root node, but either the XML log files contain multiple nodes, or they lack the final end tag -- I've yet to see an XML log file to be fully XML compliant. (I don't really mean this as criticism, just as a background for why I think a nonstandard solution may be correct or even required. If it does not follow the XML specs, is it really XML?)

In practice, you are right. Log files generally tend to grow unbounded and without expectation that there will be some 'closing off' point. This is, however, not necessarily the case, and one could easily imagine a tool which creates log files delimited by time periods such as days, weeks, hours, minutes, etc and which might formally close the log by completing the XML closing tags for each period, to conform to the DTD.
I mention this because I would like to draw attention to the contrasting nature of two popular classes of XML parsers. In one case, the parser reads the entire document into data structures that reflect the nature of the XML data. The application then accesses the data by navigating through it in DOM fashion. This is probably not appropriate for the OP's use case. In another common style of XML parser, the user code registers callbacks to user code, to be made by the parser as it encounters specified XML elements. The user code is then able to respond to the found elements as appropriate to the application. Using this model, the application will be able to 'get' the data before the XML structure is fully terminated, and the matter of full XML compliance becomes largely irrelevant. SAX is such an event-based parser, is implemented in libxml2 and may suit the OP's use case for parsing log files.

--- rod.