Quick question on XML parsing.

vxc69 · 02-24-2010, 01:11 PM

Hello,

When I parse a XML file, should I rely on the order of elements?

For example say we have:

<book>
<author></author>
<title></title>
</book>

Should I rely on the above order?

Would the following still be valid:
<book>
<title></title>
<author></author>
<book>

I'm trying to find out if a well formed XML document should have an ordered structure, or if it's still valid XML if it has no order.

I think I'm doing it wrong if I rely on the order, because order shouldn't be important, wouldn't make sense if it did, right?

Thanks

Sergei Steshenko · 02-24-2010, 01:32 PM

Quote:

Originally Posted by vxc69

Hello,

When I parse a XML file, should I rely on the order of elements?

For example say we have:

<book>
<author></author>
<title></title>
</book>

Should I rely on the above order?

Would the following still be valid:
<book>
<title></title>
<author></author>
<book>

I'm trying to find out if a well formed XML document should have an ordered structure, or if it's still valid XML if it has no order.

I think I'm doing it wrong if I rely on the order, because order shouldn't be important, wouldn't make sense if it did, right?

Thanks

AFAIK XML does not guarantee order.

And there are ready-made libraries for XML parsing (libxml2), so probably they should be used.

tuxdev · 02-24-2010, 01:40 PM

XML itself doesn't care anything about the data except that it's formatted correctly. This sort of concern would be defined by the schema of your particular flavor of XML.

vxc69 · 02-24-2010, 03:34 PM

Quote:

Originally Posted by tuxdev

XML itself doesn't care anything about the data except that it's formatted correctly. This sort of concern would be defined by the schema of your particular flavor of XML.

Well this is the problem. It's a huge xml file. To speed it up, once I find a particular parent node, to take just the information I want, I skip the parser a number of times so I get to the child node(s) I want in a particular parent node. This speeds it up immensely, however, the code isn't very nice, specially if in the future, the order is changed.

If I have a series of if statements to check every child node for what I want, it slows down.

This is a streaming pull parser.

Performance or Reliability?

mattca · 02-24-2010, 04:04 PM

I think relying on the order would be a Bad Idea. Unless something about the inherent nature of the data implies an order (ie, a list of dates).

Quote:

Originally Posted by vxc69

To speed it up, once I find a particular parent node, to take just the information I want, I skip the parser a number of times so I get to the child node(s) I want in a particular parent node.

Hmmm.. not sure I understand exactly what you're dealing with here. But it sounds like you have a parent node that has multiple child nodes of the same type? And which child node you need changes?

Any chance of getting a snippet of your XML that demonstrates this?

Also, what language are you parsing this in?

Quote:

Originally Posted by vxc69

Performance or Reliability?

I say reliability. Performance isn't worth much if it doesn't work.

vxc69 · 02-24-2010, 05:21 PM

Quote:

Originally Posted by mattca

I say reliability. Performance isn't worth much if it doesn't work.

Well, not if the given XML is assured to have that order.

The XML is of this nature, the file is a couple of gigs. Parsing in Java using STAX:

Code:

<dingBatData>
  <dingBatEvent>
   <id>34</id>
   <name>LL(K)*</name>
   <apc>B1C9</apc>
   <killPos>
     <x>29.2</x>
     <y>32.1</y>
   </killPos>
 </dingBatEvent>
 <dingBatEvent>
    .
    .
    .
  <killPos>
    .
    .
  </killPos>
 </dingBatEvent>
    .
    .
    .
    .
<dingBatData>

mattca · 02-24-2010, 05:41 PM

Quote:

Originally Posted by vxc69

Well, not if the given XML is assured to have that order.

Well then order has no impact on reliability, and your performance vs reliability question is meaningless in this context.

Quote:

Code:

<dingBatData>
  <dingBatEvent>
   <id>34</id>
   <name>LL(K)*</name>
   <apc>B1C9</apc>
   <killPos>
     <x>29.2</x>
     <y>32.1</y>
   </killPos>
 </dingBatEvent>
 <dingBatEvent>
    .
    .
    .
  <killPos>
    .
    .
  </killPos>
 </dingBatEvent>
    .
    .
    .
    .
<dingBatData>

I assume the nodes you're trying to avoid iterating through are "dingBatEvents"?

Unfortunately I don't know much about parsing XML in java.. I've done a bit in PHP though and was hoping you were using that.

nadroj · 02-24-2010, 06:38 PM

I haven't fully followed this thread but just wanted to make a few comments.

If you want to enforce the structure, content, ordering, etc, of an XML document, the only way to do it is to use, as tuxdev said, schemas. Especially if the ordering of elements, etc, are very important, that should be more evidence that reliability has higher priority than performance. The cornerstone of good software is quality. You could have a program that sometimes doesn't work, but doesn't work very fast. Alternatively, you could have a program that always works, but the amount of time it takes may be unpredictable or untimely.

This could be compared to TCP vs UDP, in networking, where TCP is reliable with more overhead, and UDP is less reliable with less overhead. Each has their own application. You probably wouldn't prefer to use TCP to listen to streaming music. Also, you probably wouldn't prefer to use UDP in some critical web service API.