LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 07-29-2008, 09:28 AM   #1
ikinnu
Member
 
Registered: Jun 2007
Posts: 64

Rep: Reputation: 15
converting html to text


Hi,

I have written a perl script which will fetch a URL using the LWP::Simple and stores in a scalar variable. I want to grep for a particular line and extract those lines into an array. I am unable to do that as the whole text content is residing in a single scalar variable. Can someone please help?
 
Old 07-29-2008, 10:03 PM   #2
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
Depends on what the 'line' separators are, but eg if they are '<br>' you can use this:

Code:
# get 'lines' containing 'tty', separated by <br>
$var1 = 'weertty<br> dshdhdfsh<p> kjfcdkl<br> ddtty<br>';
@arr = split(/<br>/, $var1);
print "@arr\n";
@arr1 = grep(/tty/, @arr);
print "@arr1\n";
Adjust to taste
 
Old 07-30-2008, 05:35 AM   #3
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi.

Sometimes the page may have lines that are short enough to read easily once they have put in an appropriate structure. Here's an example that looks for string Models at weather.gov:
Code:
#!/usr/bin/perl

# @(#) p2       Demonstrate string extraction bounded by newlines in scalar.

use warnings;
use strict;
use LWP::Simple;

my ($debug);
$debug = 0;
$debug = 1;

my ( $chars, $content, $t1, @a );
my ( @occurrences, $hits );
my ($string) = "Models";
my ($url)    = "http://www.weather.gov/";
my ($line)   = 0;

$content = get($url);
die "Couldn't get it!" unless defined $content;
$chars = length($content);
print " Got $chars characters from $url\n" if $debug;

@a = split /\n/, $content;
$t1 = scalar @a;
print " Split content into $t1 line array.\n" if $debug;

@occurrences = grep /$string/, @a;
$hits = scalar @occurrences;
print " Got $hits for string $string\n" if $debug;
print " Extracted:\n";

foreach $t1 (@occurrences) {
  print "$t1\n";
}

exit(0);
Producing:
Code:
% ./p2
 Got 90367 characters from http://www.weather.gov/
 Split content into 683 line array.
 Got 6 for string Models
 Extracted:
    <td class="white" id="menuitem"><a href="/maps.php"><span class="yellow">Forecast Models</span></a><br />
       <a href="http://www.nco.ncep.noaa.gov/pmb/nwprod/analysis/">Numerical Models</a><br />
       Statistical Models...<br />
      <p class="bottomnav"><a href="/maps.php">Forecast Models</a></p>
      <span class="smalllink"><a href="http://www.nco.ncep.noaa.gov/pmb/nwprod/analysis/">Numerical Models</a></span><br />
      <span class="smalllink">Statistical Models</span><br />
Once the page is in the scalar, split is used to make entries in an array for each line -- text ending in a newline, "\n".

Then, as Chris did, the grep function is used to extract the lines containing the string of interest ... cheers, makyo
 
Old 07-31-2008, 01:50 AM   #4
ikinnu
Member
 
Registered: Jun 2007
Posts: 64

Original Poster
Rep: Reputation: 15
Both ways work for me. Thanks a lot for the prompt reply as always.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help converting a link into a text field html orfiyus Programming 6 11-30-2007 11:34 AM
converting Windows text to Linux text joshknape Linux - Software 3 09-11-2005 12:52 PM
how to convert text(html) back to html. d1l2w3 Linux - Software 4 04-08-2005 08:16 PM
Converting pdf to Html linuxeagle Linux - General 1 04-23-2004 08:03 AM
Converting Text To HTML Glock Shooter Programming 6 07-03-2002 06:08 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 11:47 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration