converting html to text

ikinnu · 07-29-2008, 09:28 AM

Hi,

I have written a perl script which will fetch a URL using the LWP::Simple and stores in a scalar variable. I want to grep for a particular line and extract those lines into an array. I am unable to do that as the whole text content is residing in a single scalar variable. Can someone please help?

chrism01 · 07-29-2008, 10:03 PM

Depends on what the 'line' separators are, but eg if they are '<br>' you can use this:

Code:

# get 'lines' containing 'tty', separated by <br>
$var1 = 'weertty<br> dshdhdfsh<p> kjfcdkl<br> ddtty<br>';
@arr = split(/<br>/, $var1);
print "@arr\n";
@arr1 = grep(/tty/, @arr);
print "@arr1\n";

Adjust to taste

makyo · 07-30-2008, 05:35 AM

Hi.

Sometimes the page may have lines that are short enough to read easily once they have put in an appropriate structure. Here's an example that looks for string Models at weather.gov:

Code:

#!/usr/bin/perl

# @(#) p2       Demonstrate string extraction bounded by newlines in scalar.

use warnings;
use strict;
use LWP::Simple;

my ($debug);
$debug = 0;
$debug = 1;

my ( $chars, $content, $t1, @a );
my ( @occurrences, $hits );
my ($string) = "Models";
my ($url)    = "http://www.weather.gov/";
my ($line)   = 0;

$content = get($url);
die "Couldn't get it!" unless defined $content;
$chars = length($content);
print " Got $chars characters from $url\n" if $debug;

@a = split /\n/, $content;
$t1 = scalar @a;
print " Split content into $t1 line array.\n" if $debug;

@occurrences = grep /$string/, @a;
$hits = scalar @occurrences;
print " Got $hits for string $string\n" if $debug;
print " Extracted:\n";

foreach $t1 (@occurrences) {
  print "$t1\n";
}

exit(0);

Producing:

Code:

% ./p2
 Got 90367 characters from http://www.weather.gov/
 Split content into 683 line array.
 Got 6 for string Models
 Extracted:
    <td class="white" id="menuitem"><a href="/maps.php"><span class="yellow">Forecast Models</span></a><br />
       <a href="http://www.nco.ncep.noaa.gov/pmb/nwprod/analysis/">Numerical Models</a><br />
       Statistical Models...<br />
      <p class="bottomnav"><a href="/maps.php">Forecast Models</a></p>
      <span class="smalllink"><a href="http://www.nco.ncep.noaa.gov/pmb/nwprod/analysis/">Numerical Models</a></span><br />
      <span class="smalllink">Statistical Models</span><br />

Once the page is in the scalar, split is used to make entries in an array for each line -- text ending in a newline, "\n".

Then, as Chris did, the grep function is used to extract the lines containing the string of interest ... cheers, makyo

ikinnu · 07-31-2008, 01:50 AM

Both ways work for me. Thanks a lot for the prompt reply as always.