Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I'm trying to build a tool that will analyze sports box scores stats. What I have right now is a script that does the analysis when given the exact address of the game. No Good !
Let's say I have a list like this:
Jones, Colorado
Brown, Dallas
Smith, Detroit
Script looks in the player list, and for every player, it checks if his team is on the scoreboard today. If it is, it goes inside, dumps the whole page and go back. As I said I already have this analysis part (besides the go back thing, since I'm using the exact game address).
As a lynx newbie I will really appreciate your help. (No need to explain the script side actions)
Thanks, Ben.
Is your question about how to follow the links to the box scores? If so, I rephrase the problem as "how to find an anchor tag where the content part is "Box Score", and then download the page referenced by the link".
This involves parsing the HTML to locate <A> tags saving both the HREF attribute value, and the content of the tag. Then, loop through each saved tag, and test for the proper content. When found, download the reference page. To do this, I would not use Lynx, but rather any of these tools:
1. wget - my first choice only because I'm familiar with it and I've used it for this kind of task.
2. curl
3. In perl (you said 'script' but no language specified), LWP::UserAgent
Any of these can be used to capture the content of a given URL, which you evidently already know how to handle (but in Perl, I would use HTML::Parser).
--- rod.
Interesting. I first visited the page referenced in your post when I made my first reply. Later, I re-visited, and the format of the page had completely changed. Is this behavior known to you? Does it happen daily or on some other schedule? This will make it difficult to create a scraper. Perhaps a different source for your information would work better. Like maybe http://nhl.com?
--- rod.
The only changes that happen are that yesterday's games are removed and today's games appear - which is in the favor of my purpose... Never seen any format change, today or lately
I do use a perl script and want the whole env to be windows friendly - don't know if wget got a win version (same for curl). I will check the module you're recommending, and if you got some shortcuts for a lazy guy,
OK It works
Not a perfect way, and for some reason some boxscores are searched twice, but no problem to overcome it. Here's the code if you want it (without my secret analysis receipt)
Code:
#!/usr/bin/perl -w
chdir ("c:/lynx");
open SCORES,"lynx.exe -dump http://scores.espn.go.com/nhl/scoreboard |" or die "$!\n";
while (<SCORES>){
if (/boxscore/i){
s/\s+\d+\.\s+//;
open GAME,"lynx.exe -dump $_ |" or die "$!\n";
while (<GAME>){
if ( $_ =~ /\d\s\d\s\d/ ){
print;
}
}
close GAME;
}
}
close SCORES;
Nice. Concise. But...
Will break if the formatting of the HTML text changes, especially with regard to insertion or removal of newlines. Your script relies on the existence of one and only one complete URL of interest per line of text. That's why I recommend the use of HTML::Parser, which is much more robust about this kind of trap and a whole host of others. There is a useful discussion of this subject in LQ Awk scripting and usage of regex to locate a hyperlink
Lynx seems to have a feature that I didn't know of; '-dump', which makes life easier. wget does the same (and a lot more, including recursive gets). My grab of the HTML (using wget) doesn't look like your 'boxscore' regex would work; perhps Lynx reformats its output differently.
Okay, so I can now support my original proposal with a defensible argument. The 'extra' links are in the little sidebars next to the player face images on the ESPN page. My original proposal was to qualify each link with the text visible to the user, in this case 'Box Score'. The text adjacent to the images does not contain the phrase 'Box Score', so would not be found by my method.
Sorry to make this sound like a pi55ing contest. Must be the hockey thing. Go Canucks. I gather you're an Avalanche fan.
FWIW, here's my offering in all of it's glory.
Code:
#! /usr/bin/perl -w
#
# LQbengoavs.pl
#
#
#========================================
#
use HTML::Parser;
my $currentUrl;
my @urlsToVisit;
# ============== Callback handler for tag starts ================
sub start_handler{
my $tag = shift;
my $attr = shift;
my $self = shift;
my %attrList = %{$attr};
# We only want to look for '<a>' tags with an HREF attribute
if( $tag eq "a" && exists $attrList{ "href" } ){
$currentUrl = $attrList{ "href" };
}
else{
$currentUrl = undef;
}
}
# ============== Callback handler for tagged text ================
sub text_handler{
my $text = shift;
my $self = shift;
if( defined( $currentUrl ) && $text =~ m/Box Score/i ){
push @urlsToVisit, $currentUrl;
}
return;
}
# ============= Main starts here ============
my $baseUrl = $ARGV[0];
$baseUrl =~ s/(?<=[^:\/])\/.+$//;
my $p = HTML::Parser->new(api_version => 3);
$p->handler( start => \&start_handler, "tagname,attr,self" );
$p->handler( text => \&text_handler, "text,self" );
open( ESPN, "wget -q -O - $ARGV[0] | " ) || die "Cannot open $ARGV[0] : $! \n";
$p->parse_file(*ESPN) || die $!;
close( ESPN );
foreach my $url ( @urlsToVisit ){
if( $url !~ m/^http:\/\//i ){
$url = $baseUrl.$url;
}
print $url,"\n";
open( GAME, "wget -q -O - $url |" ) || die "Cannot open Game '$url' : $! \n";
while( <GAME> ){
#
# I can't seem to tell what parts you are trying to grab here.
#
}
close( GAME );
}
Oh so it's the three stars of the eve that create the sisyphean runs...
I just check if a boxscore was parsed yet with an array.
Using modules is cheating !
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.