LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-29-2004, 09:21 AM   #1
smaida
Member
 
Registered: Apr 2004
Location: Richmond, VA - USA
Distribution: Debian
Posts: 62

Rep: Reputation: 15
Parsing HTML using Perl


Hi,

I have a perl script that uses WWW::Mechanize to connect to various web pages, search for specific links and the return the content of the pages. I need to parse the content of the returned pages for certain lines of text and need a little help taking the next step.

Here is the code that loops through the links and gets the content of the resulting page.
Code:
       foreach $link(@links){

           print $link->URI();
           my $url = $link->URI();
           #print "\n";

           my $b = WWW::Mechanize->new;
           $b->get("$site_name"."$url");

           print $b->title();
           #print "\n\n";

           $html = $b->content;
Example of returned HTML Code would be:

Code:
<hr>
The statistics were last updated <b>Saturday, 29 May 2004 at 10:03</b>
<!-- Begin `Daily' Graph (5 Minute -->
<hr>
<b>`Daily' Graph (5 Minute Average)</b><br>
<img VSPACE=10 WIDTH=500 HEIGHT=135 ALIGN=TOP 
     SRC="devsadsta1_cpu_util-day.png" ALT="day">
 <table CELLPADDING=0 CELLSPACING=0>
<tr>
  <td ALIGN=right><small>Max <font COLOR="#00cc00">&nbsp;Load:&nbsp;</font></small></td>
  <td ALIGN=left><small>53.0 %
   </small></td>
  <td WIDTH=5></td>

  <td ALIGN=right><small>Average <font COLOR="#00cc00">&nbsp;Load:&nbsp;</font></small></td>
  <td ALIGN=left><small>9.0 %
  </small></td>
  <td WIDTH=5></td>
  <td ALIGN=right><small>Current <font COLOR="#00cc00">&nbsp;Load:&nbsp;</font></small></td>
  <td ALIGN=left><small>0.0 %
  </small></td>
 </tr>
From the above code I need to retreive the line containing the date and the line containing the Average % (in this case 9%).

I am gathering information from many different sites with different information so if someone could just get me started I am sure I could manage the rest. Thank you for the help.

Shawn
 
Old 05-29-2004, 10:58 AM   #2
rkef
Member
 
Registered: Mar 2004
Location: bursa
Posts: 110

Rep: Reputation: 15
HTML::Parser is good at this sort of thing; check the docs.

That or you can use an ugly regex hack to snag the date:
Code:
$ cat content |perl -e 'foreach (<>) { $date = $1 if /updated <b>([^<]*)<\/b>/; } print "$date\n";'
Saturday, 29 May 2004 at 10:03
$
Yanking the 9% would be similar, but you'd have to ignore the first result ("53.0%"). You would have to ignore it with HTML::Parser anyway, I guess.

HTML parsing is pretty specialized; whatever you come up with, it'll likely only be good for the site/page you're targeting anyway I believe.

I hope that was helpful .

p.s. I haven't tried it, but I assume in your script you'd want to do something like "$b->content =~ /junk here/m" (I believe the /m will allow you to search across multiple newlines? I forget! Maybe just use the parser ).

Last edited by rkef; 05-29-2004 at 11:02 AM.
 
Old 05-29-2004, 01:20 PM   #3
smaida
Member
 
Registered: Apr 2004
Location: Richmond, VA - USA
Distribution: Debian
Posts: 62

Original Poster
Rep: Reputation: 15
I will take a look at HTML::Parser.

Thanks for the help.
Shawn
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Parsing out html with egrep binaryechoes Linux - Software 2 12-02-2005 11:49 PM
Parsing out html with egrep binaryechoes Linux - Newbie 3 12-02-2005 12:41 AM
HTML parsing library nodger Programming 1 09-01-2005 01:42 AM
HTML parsing with HTML::TreeBuilder smaida Programming 0 07-10-2005 09:58 PM
Parsing Text from a html file. Rezon Programming 6 10-18-2003 12:09 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:54 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration