LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 11-23-2007, 07:58 AM   #1
bengoavs
LQ Newbie
 
Registered: Jun 2006
Posts: 20

Rep: Reputation: 0
Question Lynx experts, step inside...


Hi all,

I'm trying to build a tool that will analyze sports box scores stats. What I have right now is a script that does the analysis when given the exact address of the game. No Good !

Let's say I have a list like this:
Jones, Colorado
Brown, Dallas
Smith, Detroit

Now I want my script to go to the main boxscore page (ESPN) where scores appear like this: (i.e. http://scores.espn.go.com/nhl/scoreboard)

Team a 4
Team b 3
boxscroe (link)

Script looks in the player list, and for every player, it checks if his team is on the scoreboard today. If it is, it goes inside, dumps the whole page and go back. As I said I already have this analysis part (besides the go back thing, since I'm using the exact game address).

As a lynx newbie I will really appreciate your help. (No need to explain the script side actions)
Thanks, Ben.

Last edited by bengoavs; 11-23-2007 at 11:37 AM.
 
Old 11-23-2007, 08:46 AM   #2
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Since this is partly a hockey question...

Is your question about how to follow the links to the box scores? If so, I rephrase the problem as "how to find an anchor tag where the content part is "Box Score", and then download the page referenced by the link".
This involves parsing the HTML to locate <A> tags saving both the HREF attribute value, and the content of the tag. Then, loop through each saved tag, and test for the proper content. When found, download the reference page. To do this, I would not use Lynx, but rather any of these tools:
1. wget - my first choice only because I'm familiar with it and I've used it for this kind of task.
2. curl
3. In perl (you said 'script' but no language specified), LWP::UserAgent
Any of these can be used to capture the content of a given URL, which you evidently already know how to handle (but in Perl, I would use HTML::Parser).
--- rod.

Last edited by theNbomr; 11-23-2007 at 08:47 AM.
 
Old 11-23-2007, 10:58 AM   #3
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Interesting. I first visited the page referenced in your post when I made my first reply. Later, I re-visited, and the format of the page had completely changed. Is this behavior known to you? Does it happen daily or on some other schedule? This will make it difficult to create a scraper. Perhaps a different source for your information would work better. Like maybe http://nhl.com?
--- rod.
 
Old 11-23-2007, 12:09 PM   #4
bengoavs
LQ Newbie
 
Registered: Jun 2006
Posts: 20

Original Poster
Rep: Reputation: 0
The only changes that happen are that yesterday's games are removed and today's games appear - which is in the favor of my purpose... Never seen any format change, today or lately

I do use a perl script and want the whole env to be windows friendly - don't know if wget got a win version (same for curl). I will check the module you're recommending, and if you got some shortcuts for a lazy guy,
Quote:
(Speak code to me!)
It'd be great.

Thanks a lot
 
Old 11-23-2007, 01:42 PM   #5
bengoavs
LQ Newbie
 
Registered: Jun 2006
Posts: 20

Original Poster
Rep: Reputation: 0
OK It works
Not a perfect way, and for some reason some boxscores are searched twice, but no problem to overcome it. Here's the code if you want it (without my secret analysis receipt)
Code:
#!/usr/bin/perl -w
chdir ("c:/lynx");
open SCORES,"lynx.exe -dump http://scores.espn.go.com/nhl/scoreboard |" or die "$!\n";
while (<SCORES>){
    if (/boxscore/i){
        s/\s+\d+\.\s+//;
        open GAME,"lynx.exe -dump $_ |" or die "$!\n";
        while (<GAME>){
            if ( $_ =~ /\d\s\d\s\d/ ){
                print;
            }
        }
        close GAME;
    }
}
close SCORES;

Last edited by bengoavs; 11-23-2007 at 01:43 PM.
 
Old 11-23-2007, 03:15 PM   #6
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Nice. Concise. But...
Will break if the formatting of the HTML text changes, especially with regard to insertion or removal of newlines. Your script relies on the existence of one and only one complete URL of interest per line of text. That's why I recommend the use of HTML::Parser, which is much more robust about this kind of trap and a whole host of others. There is a useful discussion of this subject in LQ Awk scripting and usage of regex to locate a hyperlink

Lynx seems to have a feature that I didn't know of; '-dump', which makes life easier. wget does the same (and a lot more, including recursive gets). My grab of the HTML (using wget) doesn't look like your 'boxscore' regex would work; perhps Lynx reformats its output differently.

--- rod.
 
Old 11-23-2007, 03:20 PM   #7
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Ahh, now I see. Lynx does behave much differently than I expected. So your script should be fine. Kind of feels like cheating, though...
--- rod.
 
Old 11-23-2007, 04:52 PM   #8
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Okay, so I can now support my original proposal with a defensible argument. The 'extra' links are in the little sidebars next to the player face images on the ESPN page. My original proposal was to qualify each link with the text visible to the user, in this case 'Box Score'. The text adjacent to the images does not contain the phrase 'Box Score', so would not be found by my method.
Sorry to make this sound like a pi55ing contest. Must be the hockey thing. Go Canucks. I gather you're an Avalanche fan.

FWIW, here's my offering in all of it's glory.
Code:
#! /usr/bin/perl -w
#
#   LQbengoavs.pl
#
#
#========================================
#

use HTML::Parser;

my  $currentUrl;
my @urlsToVisit;


# ============== Callback handler for tag starts ================
sub start_handler{

my $tag = shift;
my $attr = shift;
my $self = shift;

    my %attrList = %{$attr};

    # We only want to look for '<a>' tags with an HREF attribute
    if( $tag eq "a" && exists $attrList{ "href" } ){
        $currentUrl = $attrList{ "href" };
    }
    else{
        $currentUrl = undef;
    }

}

# ============== Callback handler for tagged text ================
sub text_handler{

my $text = shift;
my $self = shift;

    if( defined( $currentUrl ) && $text =~ m/Box Score/i ){
        push @urlsToVisit, $currentUrl;
    }
    return;    
}

# ============= Main starts here ============
    my $baseUrl = $ARGV[0];
    $baseUrl =~ s/(?<=[^:\/])\/.+$//;

    my $p = HTML::Parser->new(api_version => 3);
    $p->handler( start => \&start_handler, "tagname,attr,self" );
    $p->handler( text => \&text_handler, "text,self" );

    open( ESPN, "wget -q -O - $ARGV[0] | " ) || die "Cannot open $ARGV[0] : $! \n";
    $p->parse_file(*ESPN) || die $!;
    close( ESPN );
    
    foreach my $url ( @urlsToVisit ){
        if( $url !~ m/^http:\/\//i ){
            $url = $baseUrl.$url;
        }
        
        print $url,"\n";
        open( GAME, "wget -q -O - $url |" ) || die "Cannot open Game '$url' : $! \n";
        while( <GAME> ){
            #
            #   I can't seem to tell what parts you are trying to grab here.
            #
        }
        close( GAME );
    }
--- rod.
 
Old 11-23-2007, 05:38 PM   #9
bengoavs
LQ Newbie
 
Registered: Jun 2006
Posts: 20

Original Poster
Rep: Reputation: 0
Oh so it's the three stars of the eve that create the sisyphean runs...
I just check if a boxscore was parsed yet with an array.
Using modules is cheating !

Sorry mate,
Avalanche win

Last edited by bengoavs; 11-23-2007 at 05:40 PM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Step-By-Step Instruction to install Linksys WPC11 Ver.4 Wireless Card Zypher Linux - Hardware 8 08-12-2009 10:43 AM
Install Netgear wg311v wireless card on Fedora Core 6 (step by step help thanks!) jpop Linux - Hardware 1 03-07-2007 12:16 PM
extremely detailed step by step instructions wanted for nvdia driver installation!!! saquib Linux - Hardware 1 03-09-2006 01:21 AM
I need a step by step help to instal Suse 9.3 Pro on the same hdd as XP Home & 2003 suse91pro Linux - General 4 09-07-2005 01:15 PM
Step-by-Step: Making integrated Broadcome wireless adapter work with Mandrake 9.2 jmp875 Linux - Wireless Networking 16 06-30-2004 12:50 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 07:09 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration