LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-13-2005, 10:04 AM   #1
pld
Member
 
Registered: Jun 2003
Location: Southern US
Distribution: Ubuntu 5.10
Posts: 206

Rep: Reputation: 30
curl scrape javascript output


Hi all,

ive googled all morning and not really found a good response to this question (though I may have just not formed a good google search string).

I am trying to do a simple scrape of www.nyse.com for the % change in a stock price. Nothing fancy, just a simple data scrape for a single piece of info. However I am having a problem I have not run into before. the nyse.com page that lists the stock info on a company uses extensive javascript to write out the data, and I have no clue on how to handle it!

Curl returns the page source, but unparsed, so it is a bunch of javascript. Firefox viewing the page source does the same thing. But if i highlight just what i am interested in and "View Selection Source", I can see the data I am looking for.

This makes me think that perhaps I need to have the page parsed first, perhaps with an intermediate page(?) and then output the data somewhere for me to read.

Before I delve into this rather unslightly hack, I was wondering if anyone had a response to doing this a bit easier? Is there a switch I am missing somewhere in Curl for instance that would do this for me?

TIA ( as usual )
pld
 
Old 04-13-2005, 10:09 AM   #2
ahwkong
Member
 
Registered: Aug 2004
Location: Australia
Distribution: Fedora
Posts: 282

Rep: Reputation: 30
Ok. Is it a kind of page like this one : http://www.nyse.com/about/listed/lcd...ml?ticker=MSFT
and you are after the field "Change" in the topmost table which shows these fields: "Symbol", "Last Trade", "Change", "Volume"
 
Old 04-13-2005, 10:14 AM   #3
pld
Member
 
Registered: Jun 2003
Location: Southern US
Distribution: Ubuntu 5.10
Posts: 206

Original Poster
Rep: Reputation: 30
That is almost exactly what I mean. Sorry I forgot to put an example url in, but that will work just fine. So, that is exactly what I am looking for

Sure do appreciate any help here...

its the javascript document.write()'s that are screwing me up. don't know how to get those variables resolved to a value when scraping...
 
Old 04-13-2005, 11:38 AM   #4
ahwkong
Member
 
Registered: Aug 2004
Location: Australia
Distribution: Fedora
Posts: 282

Rep: Reputation: 30
Very interesting automation question...



That table (lets name it the "data table") is created in runtime by a jscript function dtlPageMarketData().

So, either we can
1) hack that js function (e.g. by understand how this function get the data)
2) capture the html for analysis.

Well, 1 is obviously not a nice solution. Once the developers make any change internally the hack is broken

2 is difficult too because we can only see the data if the browser support jscript. So it exclude some command line tools such as lynx. But it is not entirely undoable.

I find the following combination can work to extract a value with "minimum" pain:
1) use either firefox or konqueror to view the page
2) print the page as postscript
3) convert postscript into text (ps2ascii)
4) use a custom script to parse and extract values

In firefox, there is this ' firefox -remote "action()" ' to control the application. It can be used to direct firefox to the nyse page. However there is no remote "print" message. So, it cannot be fully automated.

Konqueror is more advance because it supports DCOP, a protocol to control a program. Again, the print function missing.

Actually it is not that hard to add this console-command print capacity to the browsers. But since print is missing, i cannot see a way I can fully automate the processs outlined above.

There is a lib in perl that may support js. It is libwww.

At end I would suggest you to use yahoo's finanical information. Maybe slower in real time reporting, but excellent in historical data.
 
Old 04-13-2005, 03:52 PM   #5
pld
Member
 
Registered: Jun 2003
Location: Southern US
Distribution: Ubuntu 5.10
Posts: 206

Original Poster
Rep: Reputation: 30
Glad I wasn't missing something trivial for a change!

I have been plugging at this all day to no avail. In the end I just went and scraped yahoo! as suggested and it works like a peach, but it does leave cause to wonder about the ability to scrape a jscript generated content page...

I am going to crack open the js function and take a peek. It may be something relatively simple to use, but as you stated, it would break quickly with any changes by developers...

I appreciate your help taking a look at this. I'll post back with any results that I come up with for posterities sake...
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
where is curl and gawk ???????? bozart ROCK 11 10-23-2005 10:41 PM
Curl and PHP...? hurieka Linux - Software 6 09-28-2004 11:14 PM
Scrape Website for TV listings drspangle Linux - Software 4 07-13-2004 04:20 AM
the sound gives output when using mic but no output when run a music file medo Debian 0 04-19-2004 07:17 PM
php and curl Risen Linux - Software 0 09-02-2003 03:41 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:20 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration