Python: Logging into a site

Baix · 08-10-2005, 09:53 PM

Hi all,
I'm pretty new to python but so far I've made some pretty cool things with it. My latest project is similar to one I've done before (go to a myipaddress.com, look for the line with the ip address and than strip away the source code), however, this time I'm trying to do this for www.mypoints.com. I want to be able to run the program and it will tell me how many points I have based off of the number it finds in the source code. The problem is that you need to first log in in order to view how many points you have.

I've looked at a lot of solutions, however they don't make much sense. Some involve filling out the form with ClientForms, some say you can just give the username and pass to urllib2 while still others talk about capturing the cookie and then reusing it.

Hope I'm making sense, thanks in advance!

Edit: Oh, also the program must be able to port smoothly with windows, as I'm making it for a windows user

carl.waldbieser · 08-11-2005, 06:35 AM

Quote:

Originally posted by Baix
Hi all,
I'm pretty new to python but so far I've made some pretty cool things with it. My latest project is similar to one I've done before (go to a myipaddress.com, look for the line with the ip address and than strip away the source code), however, this time I'm trying to do this for www.mypoints.com. I want to be able to run the program and it will tell me how many points I have based off of the number it finds in the source code. The problem is that you need to first log in in order to view how many points you have.

I've looked at a lot of solutions, however they don't make much sense. Some involve filling out the form with ClientForms, some say you can just give the username and pass to urllib2 while still others talk about capturing the cookie and then reusing it.

Hope I'm making sense, thanks in advance!

Edit: Oh, also the program must be able to port smoothly with windows, as I'm making it for a windows user

What version of Python are you using?
I don't have an account at the site you mention, so I don't know exactly how it works, but it looks like you just want to log into the site and scrape some data off the web page, correct? They look like they are using some sort of form-based authentication, but you will have to do some experimenting to find out what is required to log in (from a technical standpoint).

If you turn off cookies in your browser, can you still log into the site and see the info you want to get? If not, the site may be trying to maintain some sort of persistent state using cookies, in which case you will probably need something like ClientCookie.

If you don't need to have cookies enabled, can you bookmark the page with the info on it, close the browser, open a new browser, and navigate directly to that page using the bookmark? If so, you can probably just use urllib with the URL in your bookmark. (If you are on a GNU/Linux system, you could try just getting the page with "wget" to see if you can get the info you want as a test).

Baix · 08-11-2005, 10:09 AM

Looks like cookies must be enabled, otherwise I am redirected to this page before i even get a chance to log on:

Code:

http://www.mypoints.com/emp/u/cookie...mpl%3Dindex.vm

Which says: "Please enable cookies and retry your request."

I'm using Python v2.3.5. And yes, all I really intend to do at this point to take a few lines from the source of the site and strip away the html code so I'm left with the line "You have xx points."

I'll look into ClientCookie, but if you have any tips on how to use it I'd appreciate it. Thanks for the reply

carl.waldbieser · 08-11-2005, 07:02 PM

Quote:

Originally posted by Baix
Looks like cookies must be enabled, otherwise I am redirected to this page before i even get a chance to log on:

Code:

http://www.mypoints.com/emp/u/cookie...mpl%3Dindex.vm

Which says: "Please enable cookies and retry your request."

I'm using Python v2.3.5. And yes, all I really intend to do at this point to take a few lines from the source of the site and strip away the html code so I'm left with the line "You have xx points."

I'll look into ClientCookie, but if you have any tips on how to use it I'd appreciate it. Thanks for the reply

Well, it has been some time since I tried it using python < 2.4. In 2.4, it is built into urllib2 and cookielib was added to the standard library. I can show you what you need to do in 2.4, but for earlier versions, you will have to download the ClientCookie module and possibly make some slight adjustments.

Code:

import urllib
import urllib2
import cookielib

#Create empty cookie jar.
cj = cookielib.LWPCookieJar()
#Install cookie handler for urllib2.
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
#For ClientCookie module(?)
# opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj))
# ClientCookie.install_opener(opener)

#Create initial request -- This is like when you first browse to the page.  Since the cookie jar was empty, it will
#be like you initially cleared them from your browser.
#Cookies may set at this point.
request = urllib2.Request("http://www.mypoints.com", None)
f = urllib2.urlopen(request)
f.close()
#Now you have to make a request like you submitted the form on the page.  
#ClientForms would be good for this, but I don't have the docs handy.  I will just do it the hard way.  Assume
#the form action is "http://www.mypoints.com/login.cgi" and the method is POST.
#Further assume the names of the login and password fields are "login" and "password".
data = urllib.urlencode({"login": "your-login", "password" : "your-password"})
request = urllib2.Request("http://www.mypoints.com/login.cgi", data)
f = urllib2.urlopen(request)
#I am assuming that at this point you log into the screen you want to scrape.
#If not, you will have to request the page you want to scrape at this point.

#Read the page.
html = f.read()
f.close()

#Parse the html here (html contains the page markup).

I hope that makes sense.

Baix · 08-11-2005, 08:45 PM

I assumed cookielib is supposed to be ClientCookie, as it came up saying cookielib didn't exist:

Code:

#!/usr/bin/python
import urllib
import urllib2
import ClientCookie

#Create empty cookie jar.
cj = ClientCookie.LWPCookieJar()
#Install cookie handler for urllib2.
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
#For ClientCookie module(?)
# opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj))
# ClientCookie.install_opener(opener)

#Create initial request -- This is like when you first browse to the page.  Since the cookie jar was empty, it will
#be like you initially cleared them from your browser.
#Cookies may set at this point.
request = urllib2.Request("http://www.mypoints.com", None)
f = urllib2.urlopen(request)
f.close()
#Now you have to make a request like you submitted the form on the page.  
#ClientForms would be good for this, but I don't have the docs handy.  I will just do it the hard way.  Assume
#the form action is "http://www.mypoints.com/login.cgi" and the method is POST.
#Further assume the names of the login and password fields are "login" and "password".
data = urllib.urlencode({"email": "me@gmail.com", "password" : "pass"})
request = urllib2.Request("http://www.mypoints.com/login.cgi", data)
f = urllib2.urlopen(request)
#I am assuming that at this point you log into the screen you want to scrape.
#If not, you will have to request the page you want to scrape at this point.

#Read the page.
html = f.read()
f.close()

#Parse the html here (html contains the page markup).

print html

Not sure if I was supposed to change 'login' to 'email' but I'm pretty sure that's its name on the form, doesn't matter though as I didn't get far enough to try :-/

Code:

Traceback (most recent call last):
  File "./myPoints.py", line 7, in ?
    cj = ClientCookie.LWPCookieJar()
AttributeError: 'module' object has no attribute 'LWPCookieJar'

carl.waldbieser · 08-11-2005, 09:40 PM

Quote:

Not sure if I was supposed to change 'login' to 'email' but I'm pretty sure that's its name on the form, doesn't matter though as I didn't get far enough to try :-/

Code:

Traceback (most recent call last):
  File "./myPoints.py", line 7, in ?
    cj = ClientCookie.LWPCookieJar()
AttributeError: 'module' object has no attribute 'LWPCookieJar'

cookielib is the name on the library in Python 2.4. I mentioned that you may have to make some minor alterations to get it to work with ClientCookie (in python < 2.4). Try the following for python 2.3:

Code:

cj = ClientCookie.CookieJar()

Also, check out the docs at http://wwwsearch.sourceforge.net/ClientCookie/doc.html

to find examples and differences.

I pulled up the main page in FireFox and opened the DOM browser. The form you want has the following ACTION:
https://www.mypoints.com/emp/u/login.do

The form fields are:
action (hidden feld-- value is "login")
email (text field)
password (password field)
proceed (hidden field-- value is "Sign In")

Am I explaining the concepts so you can understand them?

EDIT: I signed up for an account and fooled around with it until I got it to work under Python 2.4. It's not too hard to do. If you need further help with specific points, please let me know.

Baix · 08-11-2005, 11:01 PM

I think it would be easier for both you and me if I upgraded to 2.4, so I did

. I only have limited experience with python but so far I've done pretty much everything I've set out to do. Unfortunatly, while I do have some past experience with programming, I'm clueless when it comes to http, cookies, web protocols etc.

Thanks for bearing with me carl.waldbieser, I appreciate it. I'm looking at the difference between 2.3 and 2.4 and trying to fiigure how they may be involved in this, I'm also trying to interpret the ClientForm and ClientCookies doc's but it looks like here is where my lack of http know-how shows.

carl.waldbieser · 08-12-2005, 05:01 PM

Quote:

Originally posted by Baix
I think it would be easier for both you and me if I upgraded to 2.4, so I did . I only have limited experience with python but so far I've done pretty much everything I've set out to do. Unfortunatly, while I do have some past experience with programming, I'm clueless when it comes to http, cookies, web protocols etc.

Thanks for bearing with me carl.waldbieser, I appreciate it. I'm looking at the difference between 2.3 and 2.4 and trying to fiigure how they may be involved in this, I'm also trying to interpret the ClientForm and ClientCookies doc's but it looks like here is where my lack of http know-how shows.

OK, here is the really quick HTTP + Cookies overview. It is way oversimplified-- you can read the RFCs if you really want the nitty-gritty (like if you wanted to code your own web-server in C from scratch):

Q) What is HTTP?
A) It is a protocol layered on top of TCP. It is mostly a line-based protocol, which means that lines of text are exchanged as the basic messages. A basic HTTP transaction consists of a REQUEST, which is initiated by the client, and a RESPONSE, which is the the reply the server sends. HTTP is a *stateless* protocol, which means after the REQUEST-RESPONSE, it is like the client and the server never knew each other. This state of affairs is somewhat at odds with traditional programming models where some sort of state or context is established, and the program transitions from one state to another in response to external input.

Q) What are cookies (with respect to HTTP)?
A) Cookies are a way to maintain state across HTTP transactions. You can think of them like global variables that are stored on your web-browser. Since HTTP didn't have and state-mechanism built into the original protocol, cookies had to be shoe-horned into the transmission somehow. It was decided that cookies would be transmitted in the HTTP headers.
An HTTP transmission (either a REQUEST or a RESPONSE) has a bit at the beginning which consists of headers. The headers are basically name-value pairs, one pair per line. In theory, they can be whatever you want them to be. For example:

foo: 2-3-5-7-11-13
xyzzy: magic
plugh: A hollow voice rings through the cavern!

In practice, there are certain headers that are commonly used to provide some useful information to the parties involved:

Content-Length: 785
User-Agent: Mozilla Firefox

Cookies are just another header. The value associated with the cookie header is actually a bunch of name-value pairs itself. They are delimited in some fashion. The client is free to interpret this header however it wants to, but most browsers will honor a request from a server from the same domain as the cookie to store or modify its value.

Q) So how does this apply to the problem at hand?
A) Well, how does a website force a user to log in if it doesn't maintain some sort of state? It can't! Think about it. Suppose all it did was redirect you to a particular URL once you entered your email and password. If there was no concept of state, you could just bookmark that URL and go there directly any time you wanted. In fact, you could post the URL on a forum and anybody who wanted to could pull up that page in their browser.
So to make the whole logging concpt work, the server uses cookies to establish a *session*. Once you log in, the server hands out some unique identifier to your browser via a cookie. Internally, the server keeps a database of who is currently using what identifier it handed out. The next REQUEST your browser makes, the server can check the cookie and make sure it is still you and not someone else. Of course, since you don't have to explicitly log off, the server also sets some sort of expiration date for the identifier it hands out, so after some period of time it can reclaim that identifier.
This is why you can't simply use urllib2 to pull the web page you want directly. The correct cookie headers must first be passed to the server to establish the session that lets the server know it is OK to deal with you. If you could acurately predict what the cookie headers would be, you could pass them in, but the since the whole point of the session was to make guessing the identifier handed out very hard, yo have to simulate logging in, just like you would via a browser.
So the cookielib module adds some extra machinery to urllib2. First you create a "cookiejar". This is basically like a dictionary that stores all your cookies. It lets you do stuff like load and save them from cookie files. For our purposes, the cookie state only has to last a brief moment in time-- just long enough to login and read the points.
Once you have the cookie jar, you have to register the cookie-processor with urllib2. This basically lets it know to use a particular cookiejar object when it does its magic. So after you initiate a REQUEST with urllib2.urlopen(), you get a RESPONSE back, and your cookie jar has the cookies set in it just like how your browser would have set its own cookies.
Now you can make further REQUESTs that use the same cookie jar. It is like you are passing the same global variables back and forth to some far-away function.

That is the essence of how the whole thing works. Is it any clearer now? I know it is a lot of technical jargon to swallow all at once.

Baix · 08-12-2005, 09:52 PM

I can't thank you enough for your explanation and your help, it has really helped, I was looking through the code and now understand pretty much what's going on. Unfortunatly, it still doesn't seem to be working as it reports incorrectly, "You have 0 points)

Code:

#!/usr/bin/python
import urllib
import urllib2
import cookielib

#Create empty cookie jar.
cj = cookielib.LWPCookieJar()
#Install cookie handler for urllib2.
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
#For ClientCookie module(?)
# opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj))
# ClientCookie.install_opener(opener)

#Create initial request -- This is like when you first browse to the page.  Since the cookie jar was empty, it will
#be like you initially cleared them from your browser.
#Cookies may set at this point.
request = urllib2.Request("http://www.mypoints.com", None)
f = urllib2.urlopen(request)
f.close()
#Now you have to make a request like you submitted the form on the page.
#ClientForms would be good for this, but I don't have the docs handy.  I will just do it the hard way.  Assume
#the form action is "http://www.mypoints.com/login.cgi" and the method is POST.
#Further assume the names of the login and password fields are "login" and "password".
data = urllib.urlencode({"email": "me@gmail.com", "password" : "mypass"})
request = urllib2.Request("https://www.mypoints.com/emp/u/login.do", data)
f = urllib2.urlopen(request)
#I am assuming that at this point you log into the screen you want to scrape.
#If not, you will have to request the page you want to scrape at this point.

#Read the page.
html = f.read()
f.close()

#Parse the html here (html contains the page markup).
#for now just print the html and I'm grepping through it
print html

Once again, I truly appreciate how you've taken you're time out to help. In the meantime, believe me that I'll be reading away trying my best to learn as much as I can from this.

-Baix

Edit: I've just found this script which looks like it may hold some clues I can use.

Edit: There had been a bunch of stuff about an error...but I solved it so no need to worry about that

Baix · 08-12-2005, 11:13 PM

This was my attempt from scratch, its pretty much a combination of everything I've found...yet it doesn't work.

Code:

#!/usr/bin/python

import urllib
import urllib2
import cookielib

urlopen = urllib2.urlopen
cj = cookielib.LWPCookieJar()
Request = urllib2.Request

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)

theurl = 'http://www.mypoints.com/'
txdata = urllib.urlencode({"email": "foo@gmail.com", "password": "foobar"})
txheaders = {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}

req = Request(theurl, txdata, txheaders)
handle = urlopen(req)
html = handle.read()
#page = handle.geturl() # page will equal http://www.mypoints.com/emp/u/index.vm due to redirect

handle.close

print html

I assume that maybe this isn't working because it only submits the form and gets the cookie but I need to send another request and use it? This doesn't sound so right to me, so maybe I'm completely off.

carl.waldbieser · 08-12-2005, 11:47 PM

Quote:

I assume that maybe this isn't working because it only submits the form and gets the cookie but I need to send another request and use it? This doesn't sound so right to me, so maybe I'm completely off.

You are correct with that guess. However, there is another slight problem. There are two hidden fields on the login page that have to be send to the form processing page as well as your email and password. Even though you don't see them on the page, the processing logic is checking for them (I missed one of them the first time I tried it and it kept going back to the first page.

Here is what I came up with. If you have questions about what I am doing, feel free to ask.

Code:

import urllib
import urllib2
import cookielib
import HTMLParser

#Class used to parse HTML to be scraped.
class MyParser(HTMLParser.HTMLParser):
	def __init__(self):
		HTMLParser.HTMLParser.__init__(self)
		self.data_type = ""
	def handle_data(self, data):
		if not self.data_type:
			if data.lower() == "point balance":
				self.data_type = "balance"
			elif data.lower() == "points available to redeem":
				self.data_type = "points available to redeem"
			elif data.lower() == "pending points":
				self.data_type = "pending points"
		else:
			print "%s: %s" % (self.data_type, data)
			self.data_type = ""

#Set up email and password.
email = "yourmail@yourserver"
password = "your-password"
			
#Create empty cookie jar.
cj = cookielib.LWPCookieJar()
#Install cookie handler for urllib2.
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)

#Create initial request -- This is like when you first browse to the page.  Since the cookie jar was empty, it will
#be like you initially cleared them from your browser.
#Cookies may set at this point.
request = urllib2.Request("http://www.mypoints.com", None)
f = urllib2.urlopen(request)
f.close()

#Now you have to make a request like you submitted the form on the page.  
#Notice that two hidden fields plus the email and password fields are sent to the form processing page.
data = urllib.urlencode({"action": "login", "email": email, "password" : password, "proceed" : "Sign In"})
request = urllib2.Request("https://www.mypoints.com/emp/u/login.do", data)
f = urllib2.urlopen(request)

#Read the page.
html = f.read()
f.close()

#Parse the html here (html contains the page markup). 
parser = MyParser()
parser.feed(html)

Baix · 08-13-2005, 12:31 AM

It worked perfectly

. Does this mean that all I've really been missing this whole time were those two hidden fields? I'll have to be more observant next time

. While what you did with the html parsing is beyond me, it looks like pretty cool stuff and I'm interested enough to try and figure out what's going on there. I would've just grabbed the lines and did some 'my_points = line[x:-y]'

One thing I did notice is that these lines seem to be unnecessary as far as I can tell, and taking them out shaves off about 2 seconds:

Code:

#Create initial request -- This is like when you first browse to the page.  Since the cookie jar was empty, it will
#be like you initially cleared them from your browser.
#Cookies may set at this point.
request = urllib2.Request("http://www.mypoints.com", None)
f = urllib2.urlopen(request)
f.close()

Every time I read through a script like that I can only assume the author is some kind of genius, thanks a ton for all your help.

Edit:
When you get the time I've been wondering something, would there be anyway to store the output of html parser as a variable(s?), I want to try to throw the output into a Tkinter window but I've just realized that it's not as easy as just throwing the output from the parser into a simple Tkinter window as the parser is doing all the printing through a loop.

Also, how did you find out you needed the two hidden fields? Did you just look through the source or was there more too it, I noticed you mentioned the DOM Inspector and was wondering if that had anything to do with it.

Just to make sure I had followed everything that had happened I decided to try to log on to linuxquestions.org via python and it worked

. Are these hidden fields some kind of weak security or do they serve a purpose? For example lq.org has one named 's' which is set to "" and didn't seem to serve any purpose other than to serve me grief trying to figure out how to set 's' to nothing.

Sorry this went from a short little thank you to a list of even more questions, I feel I've begun to exceeded my fair share of questioning.

carl.waldbieser · 08-13-2005, 10:16 AM

Quote:

One thing I did notice is that these lines seem to be unnecessary as far as I can tell, and taking them out shaves off about 2 seconds:

Code:

#Create initial request -- This is like when you first browse to the page.  Since the cookie jar was empty, it will
#be like you initially cleared them from your browser.
#Cookies may set at this point.
request = urllib2.Request("http://www.mypoints.com", None)
f = urllib2.urlopen(request)
f.close()

Well, that was just a little laziness on my part. I wasn't sure if the login page set some sort of cookie initially that would be needed later, so I just threw that in without checking if I could go directly to the login-processing page.

Quote:

When you get the time I've been wondering something, would there be anyway to store the output of html parser as a variable(s?), I want to try to throw the output into a Tkinter window but I've just realized that it's not as easy as just throwing the output from the parser into a simple Tkinter window as the parser is doing all the printing through a loop.

Yes. What you can do is just initialize the HTML parser class with a dictionary. Then instead of printing the results, you can just store the name-value pairs in the dictionary. Once the parsing is finished, you can access the dictionary, loop through the keys, and put the results in whatever variables you like.

Quote:

Also, how did you find out you needed the two hidden fields? Did you just look through the source or was there more too it, I noticed you mentioned the DOM Inspector and was wondering if that had anything to do with it.

Just to make sure I had followed everything that had happened I decided to try to log on to linuxquestions.org via python and it worked

. Are these hidden fields some kind of weak security or do they serve a purpose? For example lq.org has one named 's' which is set to "" and didn't seem to serve any purpose other than to serve me grief trying to figure out how to set 's' to nothing.

There are lots of techniques you can use to find the hidden fields. One of the simplest, is like you mention, just looking at the HTML source of the page and looking for <input> tags. On Firefox, there is a handy DOM inspector (I think that it is an optional part of the install on Windows) and other useful tools. On GNU/Linux, you go to Tools -> Page Info -> Forms tab. In the top section you select a form and the bottom section shows you all the inputs for that form, their types, and their values.

If you ever start programming dynamic web pages with tools like PHP, JSP or ASP, you find that hidden fields tend to be useful for setting "modes" for different pages. For example, you have probably filled out a form, submited it, and then it comes back saying, "You need to fill in such and such field which you skipped". Well, the basic HTML on both those pages was almost identical-- the second time, the differences were that the error message was displayed and the stuff you entered before is already populated in the fields. So to reuse the code, you have something like this psuedocode:

Code:

if formfield "mode" is value "filled-out":
   if there are errors on the form:
      write error message
      write HTML with previous form variables
   else:
      process form and navigate to successful page...
else:
   write HTML with blank inputs.
   write hidden input "mode" with value "filled-out"

Now the way this will work is that when you first come to the page, this script runs, and the hidden field "mode" did not exist, so the blank form is written out. When you submit it, the form ACTION is set to submit the page to *itself*. The second time through the script, the "mode" field was set, so the top code block is executed.
All these flags could be set on the URL, too. In fact, if you look at the URL for this page you will see variables like "action" and "postid". These are other ways of establishing some sort of statefulness for the web site.

The basic difference between hidden fields and putting the variables on the URL is that URLs can be bookmarked. Sometimes, your site doesn't make sense if a user can jump directly to some part without establishing a prior context. Putting the info in a hidden field prevents casually doing this. However, it cannot be relied upon as a security feature, as it is not hard to discover these variables and POST the data yourself (using python, for example!).

Baix · 08-13-2005, 03:18 PM

Here was my attempt at throwing in a dictionary, doesn't really work well...

Code:

#!/usr/bin/python

import urllib
import urllib2
import cookielib
import HTMLParser

results = {}

#Class used to parse HTML to be scraped.
class MyParser(HTMLParser.HTMLParser):
	def __init__(self):
		HTMLParser.HTMLParser.__init__(self)
		self.data_type = ""
	def handle_data(self, data):
		if not self.data_type:
			if data.lower() == "point balance":
				self.data_type = "balance"
			elif data.lower() == "points available to redeem":
				self.data_type = "points available to redeem"
			elif data.lower() == "pending points":
				self.data_type = "pending points"
		else:
			global results
			results[self.data_type] = data
			print results
			self.data_type = ""
	
#Set up email and password.
email = "foo"
password = "foo"
			
#Create empty cookie jar.
cj = cookielib.LWPCookieJar()
#Install cookie handler for urllib2.
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)

#Create initial request -- This is like when you first browse to the page.  Since the cookie jar was empty, it will
#be like you initially cleared them from your browser.
#Cookies may set at this point.
request = urllib2.Request("http://www.mypoints.com", None)
f = urllib2.urlopen(request)
f.close()

#Now you have to make a request like you submitted the form on the page.  
#Notice that two hidden fields plus the email and password fields are sent to the form processing page.
data = urllib.urlencode({"action": "login", "email": email, "password" : password, "proceed" : "Sign In"})
request = urllib2.Request("https://www.mypoints.com/emp/u/login.do", data)
f = urllib2.urlopen(request)

#Read the page.
html = f.read()
f.close()

#Parse the html here (html contains the page markup). 
parser = MyParser()
result
Heres = {}
results = parser.feed(html)
print results

Here's the results:

Code:

{'balance': '2650 Points'}
{'balance': '2650 Points', 'points available to redeem': '2251 Points'}
{'pending points': '399 Points', 'balance': '2650 Points', 'points available to redeem': '2251 Points'}
None

A lot of it is probably unnecessary and redundant scaffolding for me to try to figure out how its trying to work. The first three prints of the dictionary is from the 'print results' in the loop and the 'None', which is where I need the variable to work is from the end 'print results.' This probably isn't very close to the right way, and I've tried experimenting with 'return this' and 'global that' I think I'm missing something pretty basic.

carl.waldbieser · 08-13-2005, 03:49 PM

Actually, again you are pretty close. The only technical problem you have is this line:

Code:

results = parser.feed(html)

The parser.feed() method does not return a value, so when the parser finishes, you clobber the dictionary with a "None" value.

Now from a design point of view, I don't care to gratuitously use global variables. What I would probably do is something like:

Code:

#Class used to parse HTML to be scraped.
class MyParser(HTMLParser.HTMLParser):
	def __init__(self):
		HTMLParser.HTMLParser.__init__(self)
		self.data_type = ""
                self.results = {}
	def handle_data(self, data):
		if not self.data_type:
			if data.lower() == "point balance":
				self.data_type = "balance"
			elif data.lower() == "points available to redeem":
				self.data_type = "points available to redeem"
			elif data.lower() == "pending points":
				self.data_type = "pending points"
		else:
			self.results[self.data_type] = data
			self.data_type = ""

And then later:

Code:

parser.feed(html)
for key in parser.results:
     print "%s: %s" % (key, parser.results[key])

This means that the parser and its results stay together, and you could have multiple parsers and not worry about mixing up your globals.

Another technique would be something like this:

Code:

class MyParser2(HTMLParser.HTMLParser):
	def __init__(self, results):
		HTMLParser.HTMLParser.__init__(self)
		self.data_type = ""
                self.results = results
         #Other methods omitted for brevity.

And to use it:

Code:

results = {}
parser = MyParser(results)
parser.feed(html)
for key in results:
     print "%s: %s" % (key, results[key])

This technique is a little more useful if you want to pass information into the parser. For example, instead of hard-coding the lines to look for in the class, you could check the dictionary to see if it has a key that matches the current text node you are looking at. Make sense?