Parsing HTML using Perl

smaida · 05-29-2004, 09:21 AM

Hi,

I have a perl script that uses WWW::Mechanize to connect to various web pages, search for specific links and the return the content of the pages. I need to parse the content of the returned pages for certain lines of text and need a little help taking the next step.

Here is the code that loops through the links and gets the content of the resulting page.

Code:

       foreach $link(@links){

           print $link->URI();
           my $url = $link->URI();
           #print "\n";

           my $b = WWW::Mechanize->new;
           $b->get("$site_name"."$url");

           print $b->title();
           #print "\n\n";

           $html = $b->content;

Example of returned HTML Code would be:

Code:

<hr>
The statistics were last updated <b>Saturday, 29 May 2004 at 10:03</b>
<!-- Begin `Daily' Graph (5 Minute -->
<hr>
<b>`Daily' Graph (5 Minute Average)</b><br>
<img VSPACE=10 WIDTH=500 HEIGHT=135 ALIGN=TOP 
     SRC="devsadsta1_cpu_util-day.png" ALT="day">
 <table CELLPADDING=0 CELLSPACING=0>
<tr>
  <td ALIGN=right><small>Max <font COLOR="#00cc00">&nbsp;Load:&nbsp;</font></small></td>
  <td ALIGN=left><small>53.0 %
   </small></td>
  <td WIDTH=5></td>

  <td ALIGN=right><small>Average <font COLOR="#00cc00">&nbsp;Load:&nbsp;</font></small></td>
  <td ALIGN=left><small>9.0 %
  </small></td>
  <td WIDTH=5></td>
  <td ALIGN=right><small>Current <font COLOR="#00cc00">&nbsp;Load:&nbsp;</font></small></td>
  <td ALIGN=left><small>0.0 %
  </small></td>
 </tr>

From the above code I need to retreive the line containing the date and the line containing the Average % (in this case 9%).

I am gathering information from many different sites with different information so if someone could just get me started I am sure I could manage the rest. Thank you for the help.

Shawn

rkef · 05-29-2004, 10:58 AM

HTML::Parser is good at this sort of thing; check the docs.

That or you can use an ugly regex hack to snag the date:

Code:

$ cat content |perl -e 'foreach (<>) { $date = $1 if /updated <b>([^<]*)<\/b>/; } print "$date\n";'
Saturday, 29 May 2004 at 10:03
$

Yanking the 9% would be similar, but you'd have to ignore the first result ("53.0%"). You would have to ignore it with HTML::Parser anyway, I guess.

HTML parsing is pretty specialized; whatever you come up with, it'll likely only be good for the site/page you're targeting anyway I believe.

I hope that was helpful

.

p.s. I haven't tried it, but I assume in your script you'd want to do something like "$b->content =~ /junk here/m" (I believe the /m will allow you to search across multiple newlines? I forget! Maybe just use the parser

).

smaida · 05-29-2004, 01:20 PM

I will take a look at HTML::Parser.

Thanks for the help.
Shawn