Of course if you'd really read the links given, you'd understand that it's not a good idea to trust any regex-based solution when extracting xml/html data.
, I came up with a solution that pretty much duplicates the desired output above:
xmlstarlet fo -H -R -Q parsing-html-the-cthulhu-way.html | xmlstarlet sel -T -t -o '<text>' -v '//div[@class="blogbody"]/*[position()>2 and position()<10]' -v '//div[@class="blogbody"]/blockquote/node()[not(self::*)]' -o '</text>' | grep -v '^$'
I have to admit that this particular case was rather tricky, due to the way the paragraphs, blockquotes, and pre tags nest, and to my own inexperience with xpath
. Most of the time it shouldn't be quite this difficult.
To explain it though:
First, I downloaded the actual page source, rather than using the above cut&paste section. The headers needed to be included before it would correctly handle the unicode text.
xmlstarlet fo -H -R -Q parsing-html-the-cthulhu-way.html
The first run-through ensures that the page is formatted (fo) correctly in clean xhtml. See the documentation for details on the options.
xmlstarlet sel -T -t -o '<text>' ... -o '</text>' |
(select) command begins with -T
for plain text output, and -t
, which starts the template string. All the rest of the command is the template. The two -o
options print literal text strings before and after the commands that do the extraction.
-v '//div[@class="blogbody"]/*[position()>2 and position()<10]'
The first xpath expression locates the first <div>
that has the class "blogbody
", and then prints the text values of child elements 3-9. This gives us most of the text, but not the last line. That's the tricky part.
The last line we want is inside a <blockquote>
that also contains a <pre>
tag and a few other <p>
elements. So this time we match the 2nd blockquote inside "blogbody", and using a not()
function match I found with google, print its contents while excluding all of its child elements.
Finally, I piped the output through grep
to remove the extra blank lines. There's probably a way to do it through xpath, but I don't know how at this point.
The output I get:
Among programmers of any experience, it is generally regarded as A Bad Ideatm to attempt to parse HTML with regular expressions. How bad of an idea? It apparently drove one Stack Overflow user to the brink of madness:
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.
Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes.
That's right, if you attempt to parse HTML with regular expressions, you're succumbing to the temptations of the dark god Cthulhu's â€¦ er â€¦ code.
This is all good fun, but the warning here is only partially tongue in cheek, and it is born of a very real frustration.
I have heard this argument before. Usually, I hear it as justification for seeing something like the following code:
You might also be interested in the pyx command, which converts xhtml into a line-based format
that can more safely be processed with regex.
xmlstarlet fo -H -R -Q -e utf-8 source.html | xmlstarlet pyx
(Unfortunately though, there seems to be a bug involved with this particular page that causes it to crash.)
In addition, there are a couple more tools you might consider. The html-xml-utils
are a suite of small applications that can be very useful, particularly hxselect
, which you can use to extract values from html based on tags or other css objects. Another tool it has is hxpipe
, which converts html/xml to ESIS format
, which is the foundation for the pyx format mentioned above. It seems to be more robust too, so I highly recommend using this one when you need to do simple extraction jobs on arbitrary html, as it's less likely to run up against errors on poorly formed html.
Then there's always html2text
and similar commands for general stripping of html tags.
Or, as already mentioned, if you really want to go hardcore switch to perl
or another language that has real parsing ability built in.
But do try learn how to properly
work with html in any case.