Did you look at the PParse sources? Download
the source distribution, and the source is in the
samples/src/PParse/ subdirectory. To me, the example seems pretty straightforward. Most of the code is just there to handle command-line parameters and careful error checking. Basically, you'll just need to add your own handling into the loop that calls
parseNext().
Personally, I've never understood XML logging myself. The XML standard states that XML files may contain only one root node, but either the XML log files contain multiple nodes, or they lack the final end tag -- I've yet to see an XML log file to be fully XML compliant. (I don't really mean this as criticism, just as a background for why I think a nonstandard solution may be correct or even required. If it does not follow the XML specs, is it really XML?)
I feel the best approach would be to somehow split each root node from the logging stream, and handle it separately. Any XML library should be able to handle such fragments with ease, since the split part should be perfectly XML compliant (at least, if a fixed header such as
<?xml version="1.0"?> is prepended to the string). In practical terms, the problem is how to split the input stream into separate root nodes.
In the generic case, it is possible and not too difficult to write an XML-like stack-based buffer, which reads a new root XML node from the input, and when complete, emits it to some other handler. (You need a finite state machine that can detect start and end tags, pushing each new start tag to a stack, and popping the topmost tag when/if it is closed using the corresponding end tag. When the stack becomes empty, you send the processed data for a real XML handler. It is not too difficult to write, but in my opinion, it may be much more robust than what you really need.)
In your case, your log file most likely contains a fixed root node that wraps around each logged event, and does not appear elsewhere in the XML. If so, you could simply use either the start of the root node (say,
<logEvent followed by whitespace,
/, or
>), or the end of the closing element of the root node (say,
</logEvent> ) as a separator. Read the log stream into a text buffer, but whenever you encounter the separator, flush the preceding string -- by sending it to an XML parser for further processing. You may wish to prepend a fixed header string like
<?xml version="1.0"?> to each string to make sure the XML parser is happy. This way the XML parser sees each event as a separate XML file.
For example, if your log stream was
Code:
<logEvent id="foo">
... XML data related to event foo ...
</logEvent>
<logEvent id="bar">
... XML data related to event bar ...
</logEvent>
<logEvent id="baz">
... XML data related to event baz ...
</logEvent>
your C++ splitting code would need to act like this GNU awk snippet:
Code:
gawk -v cmd="cat" 'BEGIN { RS = "</logEvent>[\t\n\v\f\r ]*" }
{ printf("<?xml version=\"1.0\" encoding=\"utf-8\" ?>\n%s%s\n", $0, RT) | cmd ;
close(cmd) }' input-stream
That is,
cat will be called once per completed root node. (Use
cmd="date; cat" if you want to see that each root node really is split into a new document.)
To make it easier to adjust if the logging format changes, you might wish to write the splitting code so that the separator is read from a configuration file. You might also wish to allow more than one separator string.
There are a number of ways you could do the splitting in C++, but I personally prefer to use C (C99 or GNU C) for this kind of processing (using low-level I/O, i.e. unistd.h and not stdio.h -- my data files are very large, and I really need the code to be as efficient as possible), so I'm not the best person to help you with the actual C++ implementation.