LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-08-2005, 09:56 PM   #1
meadensi
LQ Newbie
 
Registered: Feb 2005
Posts: 18

Rep: Reputation: 0
HTML scraping


Hi,

Ok I want to systematically extract information from a series of web pages that are formatted the same. Usually, I would fire up Visual Basic and then parse the HTML with Microsoft's XML parser and pull out the necessary node using DOM or something like that. Some cleaning up of the HTML may be required as often it is not XML compliant.

Given that I am trying to learn to use Linux tools, how would I achieve this?

I presume some sort of bash script would be in order but is there a perl solution. I'd need an XML parser as well I suppose because I think sed etc might not suffice.

Any ideas?

Yours, trying to break the Microsoft habit,

Meadensi
 
Old 06-08-2005, 11:07 PM   #2
carl.waldbieser
Member
 
Registered: Jun 2005
Location: Pennsylvania
Distribution: Kubuntu
Posts: 197

Rep: Reputation: 32
If you know any perl or python, both those scripting languages have modules you can get that do web scraping. Check out python's ClientCookie, for example.

If you are more comfortable using the techniques you described (get the raw page HTML, parse it, process it), both those tools have modules to accomplish that. For example, Python's urllib2 can download the raw HTML, and you can parse it with the built in expat parser using SAX or DOM (or minidom-- kind of a DOM lite).

Of course, the simple shell commands curl or wget will also retrieve web pages for you, though you probably have to pipe them to some other utility to process the markup.
 
Old 06-09-2005, 01:17 AM   #3
lowpro2k3
Member
 
Registered: Oct 2003
Location: Canada
Distribution: Slackware
Posts: 340

Rep: Reputation: 30
Bla, you dont need no DOM to parse/strip HTML :) Obviously in Perl TMTOWTDI, but you might want to look at some modules on CPAN... more specifically poke around this general section (bookmark it! :) ):

http://search.cpan.org/modlist/World_Wide_Web

The sections at the top you'd probably be interested in are:

CGI:: - maybe/maybe not. you can redirect to newfound links, but you can do that with LWP too.
HTML:: - for link processing capabilities (check out HTML::LinkExtractor)
HTTP:: - you usually use a HTTP::Request and an HTTP::Response object with LWP
LWP:: - theres always LWP::Simple, works nicely. You might need more capabilities, read the main LWP CPAN page for more: http://search.cpan.org/~gaas/libwww-...803/lib/LWP.pm

And the oh-so-handy Data::Dumper is located here, learn it, use it, love it :D

http://search.cpan.org/~ilyam/Data-D....121/Dumper.pm



Of course you can do it with XML, especially if you know how. I dont so I cant help you there, LWP is really popular in Perl though, at least it seems like it to me. You can write some pretty powerful web-bots in it.

Last edited by lowpro2k3; 06-09-2005 at 01:18 AM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Merge Of Html Files Into A Single Html (or Pdf) fiomba Linux - Software 10 05-11-2018 11:28 AM
html code and including html files Hockeyfan Programming 2 08-22-2005 05:11 PM
how to convert text(html) back to html. d1l2w3 Linux - Software 4 04-08-2005 08:16 PM
Konqueror + file:/usr/share/doc/HTML/index.html jon_k Linux - Software 2 11-25-2003 05:06 AM
HTML Guru's or website Geeks (anyone who knows html) MasterC General 6 07-05-2002 01:59 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:44 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration