Lynx experts, step inside...

bengoavs · 11-23-2007, 07:58 AM

Hi all,

I'm trying to build a tool that will analyze sports box scores stats. What I have right now is a script that does the analysis when given the exact address of the game. No Good !

Let's say I have a list like this:
Jones, Colorado
Brown, Dallas
Smith, Detroit

Now I want my script to go to the main boxscore page (ESPN) where scores appear like this: (i.e. http://scores.espn.go.com/nhl/scoreboard)

Team a 4
Team b 3
boxscroe (link)

Script looks in the player list, and for every player, it checks if his team is on the scoreboard today. If it is, it goes inside, dumps the whole page and go back. As I said I already have this analysis part (besides the go back thing, since I'm using the exact game address).

As a lynx newbie I will really appreciate your help. (No need to explain the script side actions)
Thanks, Ben.

theNbomr · 11-23-2007, 08:46 AM

Since this is partly a hockey question...

Is your question about how to follow the links to the box scores? If so, I rephrase the problem as "how to find an anchor tag where the content part is "Box Score", and then download the page referenced by the link".
This involves parsing the HTML to locate <A> tags saving both the HREF attribute value, and the content of the tag. Then, loop through each saved tag, and test for the proper content. When found, download the reference page. To do this, I would not use Lynx, but rather any of these tools:
1. wget - my first choice only because I'm familiar with it and I've used it for this kind of task.
2. curl
3. In perl (you said 'script' but no language specified), LWP::UserAgent
Any of these can be used to capture the content of a given URL, which you evidently already know how to handle (but in Perl, I would use HTML::Parser).
--- rod.

theNbomr · 11-23-2007, 10:58 AM

Interesting. I first visited the page referenced in your post when I made my first reply. Later, I re-visited, and the format of the page had completely changed. Is this behavior known to you? Does it happen daily or on some other schedule? This will make it difficult to create a scraper. Perhaps a different source for your information would work better. Like maybe http://nhl.com?
--- rod.

bengoavs · 11-23-2007, 12:09 PM

The only changes that happen are that yesterday's games are removed and today's games appear - which is in the favor of my purpose... Never seen any format change, today or lately

I do use a perl script and want the whole env to be windows friendly - don't know if wget got a win version (same for curl). I will check the module you're recommending, and if you got some shortcuts for a lazy guy,

Quote:

(Speak code to me!)

It'd be great.

Thanks a lot

bengoavs · 11-23-2007, 01:42 PM

OK It works

Not a perfect way, and for some reason some boxscores are searched twice, but no problem to overcome it. Here's the code if you want it (without my secret analysis receipt)

Code:

#!/usr/bin/perl -w
chdir ("c:/lynx");
open SCORES,"lynx.exe -dump http://scores.espn.go.com/nhl/scoreboard |" or die "$!\n";
while (<SCORES>){
    if (/boxscore/i){
        s/\s+\d+\.\s+//;
        open GAME,"lynx.exe -dump $_ |" or die "$!\n";
        while (<GAME>){
            if ( $_ =~ /\d\s\d\s\d/ ){
                print;
            }
        }
        close GAME;
    }
}
close SCORES;

theNbomr · 11-23-2007, 03:15 PM

Nice. Concise. But...
Will break if the formatting of the HTML text changes, especially with regard to insertion or removal of newlines. Your script relies on the existence of one and only one complete URL of interest per line of text. That's why I recommend the use of HTML::Parser, which is much more robust about this kind of trap and a whole host of others. There is a useful discussion of this subject in LQ Awk scripting and usage of regex to locate a hyperlink

Lynx seems to have a feature that I didn't know of; '-dump', which makes life easier. wget does the same (and a lot more, including recursive gets). My grab of the HTML (using wget) doesn't look like your 'boxscore' regex would work; perhps Lynx reformats its output differently.

--- rod.

theNbomr · 11-23-2007, 03:20 PM

Ahh, now I see. Lynx does behave much differently than I expected. So your script should be fine. Kind of feels like cheating, though...
--- rod.

theNbomr · 11-23-2007, 04:52 PM

Okay, so I can now support my original proposal with a defensible argument. The 'extra' links are in the little sidebars next to the player face images on the ESPN page. My original proposal was to qualify each link with the text visible to the user, in this case 'Box Score'. The text adjacent to the images does not contain the phrase 'Box Score', so would not be found by my method.
Sorry to make this sound like a pi55ing contest. Must be the hockey thing. Go Canucks. I gather you're an Avalanche fan.

FWIW, here's my offering in all of it's glory.

Code:

#! /usr/bin/perl -w
#
#   LQbengoavs.pl
#
#
#========================================
#

use HTML::Parser;

my  $currentUrl;
my @urlsToVisit;


# ============== Callback handler for tag starts ================
sub start_handler{

my $tag = shift;
my $attr = shift;
my $self = shift;

    my %attrList = %{$attr};

    # We only want to look for '<a>' tags with an HREF attribute
    if( $tag eq "a" && exists $attrList{ "href" } ){
        $currentUrl = $attrList{ "href" };
    }
    else{
        $currentUrl = undef;
    }

}

# ============== Callback handler for tagged text ================
sub text_handler{

my $text = shift;
my $self = shift;

    if( defined( $currentUrl ) && $text =~ m/Box Score/i ){
        push @urlsToVisit, $currentUrl;
    }
    return;    
}

# ============= Main starts here ============
    my $baseUrl = $ARGV[0];
    $baseUrl =~ s/(?<=[^:\/])\/.+$//;

    my $p = HTML::Parser->new(api_version => 3);
    $p->handler( start => \&start_handler, "tagname,attr,self" );
    $p->handler( text => \&text_handler, "text,self" );

    open( ESPN, "wget -q -O - $ARGV[0] | " ) || die "Cannot open $ARGV[0] : $! \n";
    $p->parse_file(*ESPN) || die $!;
    close( ESPN );
    
    foreach my $url ( @urlsToVisit ){
        if( $url !~ m/^http:\/\//i ){
            $url = $baseUrl.$url;
        }
        
        print $url,"\n";
        open( GAME, "wget -q -O - $url |" ) || die "Cannot open Game '$url' : $! \n";
        while( <GAME> ){
            #
            #   I can't seem to tell what parts you are trying to grab here.
            #
        }
        close( GAME );
    }

--- rod.

bengoavs · 11-23-2007, 05:38 PM

Oh so it's the three stars of the eve that create the sisyphean runs...
I just check if a boxscore was parsed yet with an array.
Using modules is cheating !

Sorry mate,
Avalanche win