LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
LinkBack Search this Thread
Old 07-29-2008, 09:28 AM   #1
ikinnu
Member
 
Registered: Jun 2007
Posts: 64

Rep: Reputation: 15
converting html to text


Hi,

I have written a perl script which will fetch a URL using the LWP::Simple and stores in a scalar variable. I want to grep for a particular line and extract those lines into an array. I am unable to do that as the whole text content is residing in a single scalar variable. Can someone please help?
 
Old 07-29-2008, 10:03 PM   #2
chrism01
Guru
 
Registered: Aug 2004
Location: Brisbane
Distribution: Centos 6.2, Centos 5.8
Posts: 11,740

Rep: Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905
Depends on what the 'line' separators are, but eg if they are '<br>' you can use this:

Code:
# get 'lines' containing 'tty', separated by <br>
$var1 = 'weertty<br> dshdhdfsh<p> kjfcdkl<br> ddtty<br>';
@arr = split(/<br>/, $var1);
print "@arr\n";
@arr1 = grep(/tty/, @arr);
print "@arr1\n";
Adjust to taste
 
Old 07-30-2008, 05:35 AM   #3
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 706

Rep: Reputation: 68
Hi.

Sometimes the page may have lines that are short enough to read easily once they have put in an appropriate structure. Here's an example that looks for string Models at weather.gov:
Code:
#!/usr/bin/perl

# @(#) p2       Demonstrate string extraction bounded by newlines in scalar.

use warnings;
use strict;
use LWP::Simple;

my ($debug);
$debug = 0;
$debug = 1;

my ( $chars, $content, $t1, @a );
my ( @occurrences, $hits );
my ($string) = "Models";
my ($url)    = "http://www.weather.gov/";
my ($line)   = 0;

$content = get($url);
die "Couldn't get it!" unless defined $content;
$chars = length($content);
print " Got $chars characters from $url\n" if $debug;

@a = split /\n/, $content;
$t1 = scalar @a;
print " Split content into $t1 line array.\n" if $debug;

@occurrences = grep /$string/, @a;
$hits = scalar @occurrences;
print " Got $hits for string $string\n" if $debug;
print " Extracted:\n";

foreach $t1 (@occurrences) {
  print "$t1\n";
}

exit(0);
Producing:
Code:
% ./p2
 Got 90367 characters from http://www.weather.gov/
 Split content into 683 line array.
 Got 6 for string Models
 Extracted:
    <td class="white" id="menuitem"><a href="/maps.php"><span class="yellow">Forecast Models</span></a><br />
       <a href="http://www.nco.ncep.noaa.gov/pmb/nwprod/analysis/">Numerical Models</a><br />
       Statistical Models...<br />
      <p class="bottomnav"><a href="/maps.php">Forecast Models</a></p>
      <span class="smalllink"><a href="http://www.nco.ncep.noaa.gov/pmb/nwprod/analysis/">Numerical Models</a></span><br />
      <span class="smalllink">Statistical Models</span><br />
Once the page is in the scalar, split is used to make entries in an array for each line -- text ending in a newline, "\n".

Then, as Chris did, the grep function is used to extract the lines containing the string of interest ... cheers, makyo
 
Old 07-31-2008, 01:50 AM   #4
ikinnu
Member
 
Registered: Jun 2007
Posts: 64

Original Poster
Rep: Reputation: 15
Both ways work for me. Thanks a lot for the prompt reply as always.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help converting a link into a text field html orfiyus Programming 6 11-30-2007 11:34 AM
converting Windows text to Linux text joshknape Linux - Software 3 09-11-2005 12:52 PM
how to convert text(html) back to html. d1l2w3 Linux - Software 4 04-08-2005 08:16 PM
Converting pdf to Html linuxeagle Linux - General 1 04-23-2004 08:03 AM
Converting Text To HTML Glock Shooter Programming 6 07-03-2002 06:08 PM


All times are GMT -5. The time now is 06:37 AM.

Main Menu
 
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: @linuxquestions
Open Source Consulting | Domain Registration