LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 12-23-2012, 01:30 PM   #1
corfuitl
Member
 
Registered: Mar 2012
Posts: 31

Rep: Reputation: Disabled
Clean up html file and generete a xml


hi,

I want to extract some text from a .hmtl file, clean up and generate a new xml. I use the command

Code:
sed -n -e '/start/, /end/p'  < "$FILE" > /tmp/$$
in order to extract the text ant the command

Code:
sed -e 's/<[a-zA-Z\/][^>]*>//g' -e '/^$/d' </tmp/$$
in order to remove the unnecessary tags. Could you tell me how I can put the extracted text in tags using bash?
Thank you in advance!
 
Old 12-23-2012, 05:43 PM   #2
Snark1994
Senior Member
 
Registered: Sep 2010
Location: Wales, UK
Distribution: Arch
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 345Reputation: 345Reputation: 345Reputation: 345
Yay, I get to share the link again!

In all seriousness, unless I'm misunderstanding what you're doing then you would be better off using an XML parser in a language like perl, python or ruby.

Could we have a specific example of the (input) HTML file and the (output) XML file you want to generate? That way we can assess how to go about it the best way.

Thanks,
 
1 members found this post helpful.
Old 12-23-2012, 06:30 PM   #3
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,569
Blog Entries: 1

Rep: Reputation: 1026Reputation: 1026Reputation: 1026Reputation: 1026Reputation: 1026Reputation: 1026Reputation: 1026Reputation: 1026
Quote:
Originally Posted by Snark1994 View Post
Yay, I get to share the link again!
LOL.

To the OP: As the above poster states: 1. show us what exactly you want to do (code samples), 2. Most probably XML parsers (google) are what you need.
 
Old 12-23-2012, 11:22 PM   #4
dugan
Senior Member
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 4,769

Rep: Reputation: 1467Reputation: 1467Reputation: 1467Reputation: 1467Reputation: 1467Reputation: 1467Reputation: 1467Reputation: 1467Reputation: 1467Reputation: 1467
Quote:
Originally Posted by Snark1994 View Post
Yay, I get to share the link again!
I thought it was going to be this link.
 
1 members found this post helpful.
Old 12-24-2012, 02:15 AM   #5
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,569
Blog Entries: 1

Rep: Reputation: 1026Reputation: 1026Reputation: 1026Reputation: 1026Reputation: 1026Reputation: 1026Reputation: 1026Reputation: 1026
Quote:
Originally Posted by dugan View Post
I thought it was going to be this link.
Another good one
 
Old 12-24-2012, 04:41 AM   #6
corfuitl
Member
 
Registered: Mar 2012
Posts: 31

Original Poster
Rep: Reputation: Disabled
hi!

thank you for your answers!

I have a html file and want to extract some string between tags. Then, I want to clean up the unnecessary tags, and and create a new file with the extracted text into tags.
For example from the following text:

Code:
<h2 class="title"><a href="http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html" class="title-link">Parsing Html The Cthulhu Way</a></h2> 

<h3 class="date">November 15, 2009</h3> 

<p>
Among programmers of any experience, it is generally regarded as A Bad Idea<sup>tm</sup> to attempt to parse HTML with regular expressions. How bad of an idea? It apparently <a href="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454">drove one Stack Overflow user to the brink of madness</a>:
<p>
<blockquote>
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
<p>
Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.
<p>
Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The &lt;center&gt; cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes.
</p></p></blockquote>
<p>
That's right, if you attempt to parse HTML with regular expressions, you're succumbing to the temptations of the dark god <a href="http://en.wikipedia.org/wiki/Cthulhu">Cthulhu's</a> € er € <i>code</i>.
<p>
<img alt="kraken-cthulhu.jpg" class="at-xid-6a0120a85dcdae970b0120a86e32a6970b" height="405" src="http://codinghorror.typepad.com/.a/6a0120a85dcdae970b0120a86e32a6970b-pi" width="540" />
<p>
This is all good fun, but the warning here is only partially tongue in cheek, and it is born of <a href="http://oubliette.alpha-geek.com/2004/01/12/bring_me_your_regexs_i_will_create_html_to_break_them">a very real frustration</a>. 
<p>
<blockquote>
I have heard this argument before. Usually, I hear it as justification for seeing something like the following code:
<p>
<pre>
 # pull out data between &lt;td&gt; tags
($table_data) = $html =~ /&lt;td&gt;(.*?)&lt;\/td&gt;/gis;
</pre>
I want to create:

Code:
<text>Among programmers of any experience, it is generally regarded as A Bad Ideatm to attempt to parse HTML with regular expressions. How bad of an idea? It apparently drove one Stack Overflow user to the brink of madness:
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.
Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes.
That's right, if you attempt to parse HTML with regular expressions, you're succumbing to the temptations of the dark god Cthulhu's € er € code.
This is all good fun, but the warning here is only partially tongue in cheek, and it is born of a very real frustration.
I have heard this argument before. Usually, I hear it as justification for seeing something like the following code:</text>
 
Old 12-24-2012, 05:15 AM   #7
Snark1994
Senior Member
 
Registered: Sep 2010
Location: Wales, UK
Distribution: Arch
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 345Reputation: 345Reputation: 345Reputation: 345
In which case, you're fine using regex. It's as simple as:

Code:
echo "<text>$(sed -e 's/<[a-zA-Z\/][^>]*>//g' -e '/^$/d' </tmp/$$)</text>"
(the red text is the second command you gave us)

Hope this helps,
 
1 members found this post helpful.
Old 12-24-2012, 12:02 PM   #8
corfuitl
Member
 
Registered: Mar 2012
Posts: 31

Original Poster
Rep: Reputation: Disabled
Thank you! It works perfectly!
 
Old 12-24-2012, 03:48 PM   #9
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948
Of course if you'd really read the links given, you'd understand that it's not a good idea to trust any regex-based solution when extracting xml/html data.


Using xmlstarlet, I came up with a solution that pretty much duplicates the desired output above:

Code:
xmlstarlet fo -H -R -Q parsing-html-the-cthulhu-way.html | xmlstarlet sel -T -t -o '<text>' -v '//div[@class="blogbody"][1]/*[position()>2 and position()<10]' -v '//div[@class="blogbody"][1]/blockquote[2]/node()[not(self::*)]' -o '</text>' | grep -v '^$'
I have to admit that this particular case was rather tricky, due to the way the paragraphs, blockquotes, and pre tags nest, and to my own inexperience with xpath. Most of the time it shouldn't be quite this difficult.

To explain it though:

First, I downloaded the actual page source, rather than using the above cut&paste section. The headers needed to be included before it would correctly handle the unicode text.

Code:
xmlstarlet fo -H -R -Q parsing-html-the-cthulhu-way.html
The first run-through ensures that the page is formatted (fo) correctly in clean xhtml. See the documentation for details on the options.

Code:
xmlstarlet sel -T -t -o '<text>' ... -o '</text>' |
The sel (select) command begins with -T for plain text output, and -t, which starts the template string. All the rest of the command is the template. The two -o options print literal text strings before and after the commands that do the extraction.

Code:
-v '//div[@class="blogbody"][1]/*[position()>2 and position()<10]'
The first xpath expression locates the first <div> that has the class "blogbody", and then prints the text values of child elements 3-9. This gives us most of the text, but not the last line. That's the tricky part.

Code:
-v '//div[@class="blogbody"][1]/blockquote[2]/node()[not(self::*)]'
The last line we want is inside a <blockquote> that also contains a <pre> tag and a few other <p> elements. So this time we match the 2nd blockquote inside "blogbody", and using a not() function match I found with google, print its contents while excluding all of its child elements.

Finally, I piped the output through grep to remove the extra blank lines. There's probably a way to do it through xpath, but I don't know how at this point.

The output I get:

Code:
<text>
Among programmers of any experience, it is generally regarded as A Bad Ideatm to attempt to parse HTML with regular expressions. How bad of an idea? It apparently drove one Stack Overflow user to the brink of madness:
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.
Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes.
That's right, if you attempt to parse HTML with regular expressions, you're succumbing to the temptations of the dark god Cthulhu's … er … code.
This is all good fun, but the warning here is only partially tongue in cheek, and it is born of a very real frustration.
I have heard this argument before. Usually, I hear it as justification for seeing something like the following code:
</text>
You might also be interested in the pyx command, which converts xhtml into a line-based format that can more safely be processed with regex.

Code:
xmlstarlet fo -H -R -Q -e utf-8 source.html | xmlstarlet pyx
(Unfortunately though, there seems to be a bug involved with this particular page that causes it to crash.)

In addition, there are a couple more tools you might consider. The html-xml-utils are a suite of small applications that can be very useful, particularly hxselect, which you can use to extract values from html based on tags or other css objects. Another tool it has is hxpipe, which converts html/xml to ESIS format, which is the foundation for the pyx format mentioned above. It seems to be more robust too, so I highly recommend using this one when you need to do simple extraction jobs on arbitrary html, as it's less likely to run up against errors on poorly formed html.

Then there's always html2text and similar commands for general stripping of html tags.

Or, as already mentioned, if you really want to go hardcore switch to perl or ruby or another language that has real parsing ability built in.

But do try learn how to properly work with html in any case.
 
Old 12-25-2012, 12:55 PM   #10
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,494

Rep: Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850
Quote:
Originally Posted by David the H. View Post
Of course if you'd really read the links given, you'd understand that it's not a good idea to trust any regex-based solution when extracting xml/html data.
Extracting certain tags is probably unsafe (the sed '/start/, /end/p') in that formatting changes to the HTML can break it. On the other hand, parser based solutions that pick out the nth element might also break if the site structure changes, so they're only a little bit more safe.

Extracting just the text from a marked up document is safe to do with regex, because the document is treated as a list of {text or tag} which is a regular language. Using sed isn't completely safe because it's line based, though.

Quote:
I have to admit that this particular case was rather tricky, due to the way the paragraphs, blockquotes, and pre tags nest, and to my own inexperience with xpath. Most of the time it shouldn't be quite this difficult.
Here's a solution that tries to do a bit less node counting, although it still makes quite a few assumptions about the HTML (for instance, that the begin and end nodes are siblings):
Code:
xmlstarlet sel -T -t \
    --var beg='//div[@class="blogbody"]//*[contains(text(), "Among programm")]' \
    --var end='//div[@class="blogbody"]//*[contains(text(), "following code:")]' \
    -o '<text>' \
    -v '$beg' -v 'set:leading($beg/following-sibling::*, $end)' -v '$end/text()[1]' \
    -o '</text>' \
    parsing-html-the-cthulhu-way.xhtml | grep -v '^$'
Where parsing-html-the-cthulhu-way.xhtml is the output from the xml fo command. The set:leading() function is listed at exslt.org.

Quote:
Finally, I piped the output through grep to remove the extra blank lines. There's probably a way to do it through xpath, but I don't know how at this point.
Probably doable, but way more awkward.

Last edited by ntubski; 12-25-2012 at 12:56 PM. Reason: s/xml/xmlstarlet
 
Old 12-25-2012, 03:28 PM   #11
Snark1994
Senior Member
 
Registered: Sep 2010
Location: Wales, UK
Distribution: Arch
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 345Reputation: 345Reputation: 345Reputation: 345
@David the H.: To be fair, the last paragraph of the article says

Quote:
It's considered good form to demand that regular expressions be considered verboten, totally off limits for processing HTML, but I think that's just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine. It's more important to understand the tools, and their strengths and weaknesses, than it is to knuckle under to knee-jerk dogmatism.
Given all the OP's matching is any tag, I don't see much wrong with that, as long as ey's aware that any text in angle brackets will also be removed.
 
Old 12-26-2012, 10:32 AM   #12
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948
Yes, yes. I admit that it's not always necessary to go the full parsing route. The first sentence in my last post was intended to be kind of tongue-in-cheek.

Relatively simple actions on well-formed input can certainly be done with regex tools, and may even be better in some cases. Faster and more convenient, certainly. But do note the well-formed caveat. The input has to be predictably regular if you want to safely use regular expressions (although I think tag nesting will always be a pain to handle).

That's why I also suggested the pyx/ESIS formats. Even if if direct parsing isn't feasible, at least you can re-format first it into something that's safer and easier to work with.

In any case my post was, at least in part, just me learning how to do it myself, and then sharing the results.

(And special thanks to ntubski, by the way, for introducing me to xpath parsing in the first place, and continuing to help me refine my understanding of it. I'd been wondering how the variables option could be used.)
 
  


Reply

Tags
bash script $@, sed bash


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
how to add xml-stylesheet tag in a XML File using libxml2 ? peacemission Programming 6 05-26-2012 02:20 AM
[SOLVED] Strip HTML tags from XML file corfuitl Programming 6 03-26-2012 04:39 PM
how to import XML or HTML file into mediawiki file?? apzc2529 Linux - Server 0 11-10-2006 06:58 AM
RSS XML File to be read and generate HTML page redhatrosh General 10 02-15-2006 02:16 AM
How can i read an write to a HTML or xml file using C alix123 Programming 1 11-24-2004 05:07 AM


All times are GMT -5. The time now is 05:27 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration