HTML scraping

meadensi · 06-08-2005, 09:56 PM

Hi,

Ok I want to systematically extract information from a series of web pages that are formatted the same. Usually, I would fire up Visual Basic and then parse the HTML with Microsoft's XML parser and pull out the necessary node using DOM or something like that. Some cleaning up of the HTML may be required as often it is not XML compliant.

Given that I am trying to learn to use Linux tools, how would I achieve this?

I presume some sort of bash script would be in order but is there a perl solution. I'd need an XML parser as well I suppose because I think sed etc might not suffice.

Any ideas?

Yours, trying to break the Microsoft habit,

Meadensi

carl.waldbieser · 06-08-2005, 11:07 PM

If you know any perl or python, both those scripting languages have modules you can get that do web scraping. Check out python's ClientCookie, for example.

If you are more comfortable using the techniques you described (get the raw page HTML, parse it, process it), both those tools have modules to accomplish that. For example, Python's urllib2 can download the raw HTML, and you can parse it with the built in expat parser using SAX or DOM (or minidom-- kind of a DOM lite).

Of course, the simple shell commands curl or wget will also retrieve web pages for you, though you probably have to pipe them to some other utility to process the markup.

lowpro2k3 · 06-09-2005, 01:17 AM

Bla, you dont need no DOM to parse/strip HTML :) Obviously in Perl TMTOWTDI, but you might want to look at some modules on CPAN... more specifically poke around this general section (bookmark it! :) ):

http://search.cpan.org/modlist/World_Wide_Web

The sections at the top you'd probably be interested in are:

CGI:: - maybe/maybe not. you can redirect to newfound links, but you can do that with LWP too.
HTML:: - for link processing capabilities (check out HTML::LinkExtractor)
HTTP:: - you usually use a HTTP::Request and an HTTP::Response object with LWP
LWP:: - theres always LWP::Simple, works nicely. You might need more capabilities, read the main LWP CPAN page for more: http://search.cpan.org/~gaas/libwww-...803/lib/LWP.pm

And the oh-so-handy Data::Dumper is located here, learn it, use it, love it :D

http://search.cpan.org/~ilyam/Data-D....121/Dumper.pm

Of course you can do it with XML, especially if you know how. I dont so I cant help you there, LWP is really popular in Perl though, at least it seems like it to me. You can write some pretty powerful web-bots in it.