LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-13-2006, 12:46 PM   #1
rjcrews
Member
 
Registered: Apr 2004
Distribution: Debian
Posts: 193

Rep: Reputation: 30
Perl - Parsing HTML fails


Hi all -

I am trying to parse some HTML I grabbed with the WWW:Mechanize class.

Here is the code:
Code:
#!/usr/bin/perl

use strict;
use WWW::Mechanize;
use HTTP::Cookies;

my $outfile = "out.htm";
my $url = "http://mytestpage/";
my $mech = WWW::Mechanize->new();

$mech->cookie_jar(HTTP::Cookies->new());
$mech->get($url);
my $output_page = $mech->content();

ParseLines(@$output_page);
print "finsihed\n";

sub ParseLines
{

  my (@lines) = @_;
  #my ($rx);
  my $rx;
  my $line;
  foreach $line (@lines)
  {
        if($line =~ m/ReportTitle/){

          $rx="found it";

        }
  }

print $rx;
}
The mechanize part is fine, but i get this for an error:
Code:
mybox:/usr/lib/cgi-bin# perl test5.pl
Can't use string ("<html>
        <head>
                <!--<meta htt") as an ARRAY ref while "strict refs" in use at test5.pl line 20.
The top of the HTML is

Code:
<html>
        <head>
                <!--<meta http-equiv="REFRESH" content="; URL=">-->
Any ideas how to get the script to parse it?

Thanks
 
Old 06-13-2006, 03:50 PM   #2
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Without having tried to actually run your code, I'll hazard a guess that the line

Code:
 
my (@lines) = @_;
should really be

Code:
  
my @lines = @_;
--- rod.
 
Old 06-14-2006, 08:44 AM   #3
rjcrews
Member
 
Registered: Apr 2004
Distribution: Debian
Posts: 193

Original Poster
Rep: Reputation: 30
That change did not correct the problem.

The code works fine for other "files", I am trying to determine why the HTML chunk I grab is unable to be parsed the same way. Is there something in there that perl does not like?

Code:
<html>
        <head>
                <!--<meta http-equiv="REFRESH" content="; URL=">-->
                <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
                <title></title>
 
Old 06-15-2006, 02:56 AM   #4
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,360

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
Looks like it's not recognising end-of-line char(s)? Maybe the diff between page created/stored on MSWin/IIS vs Linux ?
 
Old 06-15-2006, 08:47 PM   #5
rjcrews
Member
 
Registered: Apr 2004
Distribution: Debian
Posts: 193

Original Poster
Rep: Reputation: 30
I changed the variable being passed from an array to a normal variable:

Code:
ParseLines(@$output_page);
Code:
ParseLines($output_page);
And the error goes away, but I am having problems parsing it still. It is an ASP generated script, so you may be correct. I am looking at removing all HTML tags from it then trying again.

Thanks!
 
Old 06-15-2006, 09:19 PM   #6
rjcrews
Member
 
Registered: Apr 2004
Distribution: Debian
Posts: 193

Original Poster
Rep: Reputation: 30
Using HTML::TokeParser::Simple I was able to remove basically everything, and parse it that way.

Thanks for the help, there was a problem with the end of line, or lack thereof. (When i used the HTML::TokeParser::Simple I received 1 long line.) I added line breaks in place of the tags..and we are good to go!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
HTML parsing library Dark_Helmet Programming 1 04-27-2006 07:43 AM
Parsing out html with egrep binaryechoes Linux - Software 2 12-02-2005 11:49 PM
Parsing out html with egrep binaryechoes Linux - Newbie 3 12-02-2005 12:41 AM
HTML parsing with HTML::TreeBuilder smaida Programming 0 07-10-2005 09:58 PM
Parsing HTML using Perl smaida Programming 2 05-29-2004 01:20 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 04:21 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration