LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   curl scrape javascript output (https://www.linuxquestions.org/questions/programming-9/curl-scrape-javascript-output-312806/)

pld 04-13-2005 11:04 AM

curl scrape javascript output
 
Hi all,

ive googled all morning and not really found a good response to this question (though I may have just not formed a good google search string).

I am trying to do a simple scrape of www.nyse.com for the % change in a stock price. Nothing fancy, just a simple data scrape for a single piece of info. However I am having a problem I have not run into before. the nyse.com page that lists the stock info on a company uses extensive javascript to write out the data, and I have no clue on how to handle it!

Curl returns the page source, but unparsed, so it is a bunch of javascript. Firefox viewing the page source does the same thing. But if i highlight just what i am interested in and "View Selection Source", I can see the data I am looking for.

This makes me think that perhaps I need to have the page parsed first, perhaps with an intermediate page(?) and then output the data somewhere for me to read.

Before I delve into this rather unslightly hack, I was wondering if anyone had a response to doing this a bit easier? Is there a switch I am missing somewhere in Curl for instance that would do this for me?

TIA ( as usual )
pld

ahwkong 04-13-2005 11:09 AM

Ok. Is it a kind of page like this one : http://www.nyse.com/about/listed/lcd...ml?ticker=MSFT
and you are after the field "Change" in the topmost table which shows these fields: "Symbol", "Last Trade", "Change", "Volume"

pld 04-13-2005 11:14 AM

That is almost exactly what I mean. Sorry I forgot to put an example url in, but that will work just fine. So, that is exactly what I am looking for :)

Sure do appreciate any help here...

its the javascript document.write()'s that are screwing me up. don't know how to get those variables resolved to a value when scraping...

ahwkong 04-13-2005 12:38 PM

Very interesting automation question...



That table (lets name it the "data table") is created in runtime by a jscript function dtlPageMarketData().

So, either we can
1) hack that js function (e.g. by understand how this function get the data)
2) capture the html for analysis.

Well, 1 is obviously not a nice solution. Once the developers make any change internally the hack is broken

2 is difficult too because we can only see the data if the browser support jscript. So it exclude some command line tools such as lynx. But it is not entirely undoable.

I find the following combination can work to extract a value with "minimum" pain:
1) use either firefox or konqueror to view the page
2) print the page as postscript
3) convert postscript into text (ps2ascii)
4) use a custom script to parse and extract values

In firefox, there is this ' firefox -remote "action()" ' to control the application. It can be used to direct firefox to the nyse page. However there is no remote "print" message. So, it cannot be fully automated.

Konqueror is more advance because it supports DCOP, a protocol to control a program. Again, the print function missing.

Actually it is not that hard to add this console-command print capacity to the browsers. But since print is missing, i cannot see a way I can fully automate the processs outlined above.

There is a lib in perl that may support js. It is libwww.

At end I would suggest you to use yahoo's finanical information. Maybe slower in real time reporting, but excellent in historical data.

pld 04-13-2005 04:52 PM

Glad I wasn't missing something trivial for a change!

I have been plugging at this all day to no avail. In the end I just went and scraped yahoo! as suggested and it works like a peach, but it does leave cause to wonder about the ability to scrape a jscript generated content page...

I am going to crack open the js function and take a peek. It may be something relatively simple to use, but as you stated, it would break quickly with any changes by developers...

I appreciate your help taking a look at this. I'll post back with any results that I come up with for posterities sake...


All times are GMT -5. The time now is 10:03 PM.