XML can't store HTML?!?

Travis86 · 08-20-2003, 10:17 AM

I'm trying to get XML to store HTML data. I figured that if I told it in the DTD that it was #PCDATA, it would just take it as text and wouldn't try to interpret it. Unfortunately, it thinks I'm bringing up tags I didn't mention in the DTD.

What do I do now? If I were to use a syntax like <html>(etc...) the file would be bigger, and I would have to do a bunch of text crunching. Why can't it just take it as #PCDATA?

Thanks.

lackluster · 08-20-2003, 10:20 AM

because they're tags, of course. if you want to store HTML stuff, either do what you suggested with < > or throw out the DTD ..... good luck

Travis86 · 08-20-2003, 12:09 PM

Well, yeah, the're tags, but can't I tell it they're not? What's a DTD for becides to tell it what things are? And what do you mean by throw out the DTD?

lackluster · 08-20-2003, 01:51 PM

what exactly are you trying to do? if it's just your own XML data and your own XML parser, you don't really need the DTD. the DTD is just saying "this document is valid XML as long is it follows the rules of XML .... here are the elements which are valid for this XML document". that's why you can't throw random tags in there. On the other hand, if you omit the DTD then you're just saying "this document is valid XML as long as it follows the rules of XML .... the tags are arbitrary"

Travis86 · 08-20-2003, 03:37 PM

I'm trying to send the XML through PHP. I think I tried it withough a DTD when I first started, but PHP requires a DTD.

However, I might try something like:

<!DOCTYPE menu [
<!ELEMENT menu ALL>
]>

hmmm.... Do you think I could still use attibutes and things then?

I'd really just like to tell it not to interpret what I've marked as #PCDATA, but alas.

lackluster · 08-20-2003, 10:33 PM

i've never used PHP & XML together. just stick with your original workaround. is the html you're tryong to store static? if so the following perl code might help (i purposely avoided modules so you won't need to install anything else):

PHP Code:



#!/usr/bin/perl 
 
open (T, "the_html_file.html"); 
my $dirty = join <T>, ''; 
$dirty =~ s/<(?:BACKSLASHw)/&lt;/g; 
$dirty =~ s/(?:BACKSLASHw)>/&gt;/g; 
print $dirty; 
close (T);

where BACKSLASH is \

now that's totally untested, but it should work fine. if your html data is dynamic, you can use the same regex's in php. try to keep the DTD, it's good practice.

let me know how it turned out for you.

Travis86 · 08-20-2003, 11:09 PM

I dunno. A little too dirty for me, but it might be what I'll have to do. I'll give this some more thought.

german · 08-23-2003, 10:00 AM

Taken from O'reilly's Java and XML book:

"A CDATA section is used when a significant amount of data should be passed on to the calling application without any XML parsing."

<![CDATA[

<html></html>

]]>

CDATA sections keep all whitespace, all bad characters, etc. etc. but you MUST be 100% sure that the contents do not contain "]]>" or bad things will happen.

HTH

B.

cludwin · 08-23-2003, 04:25 PM

Yes that is correct,

PCDATA stands for processed character data, which means that the data goes through the xml parser, and html will cause it to raise an exception.

CDATA is not processed so what you put in is essentially what you get out. However I would suggest that you try to avoid wraping html in xml, a cleaner way is to encode only your data in xml, then use xsl to generate your markup (html) and use css for your style. If speed is a concern then implement a cache.

hope this helps,
cludwin

german · 08-25-2003, 12:04 AM

that is exactly the architecture I have been working with for ~2 years now, and it works extremely well. However, there are instances where you may wish to preserve someone else's preformatted HTML and you would have to parse it into XML, then back into HTML in a similar form which would prove a PITA, so CDATA sections would probably come in useful.

B.

Travis86 · 08-25-2003, 07:18 PM

Ah! Just what I was looking for. Thanks.