urllib2 python

davholla · 08-11-2005, 04:42 AM

I am trying to urllib2 to download some webpages, unfortunately they are password protected.

In this example :-
"
import urllib2 # Create an OpenerDirector with support for Basic HTTP Authentication... auth_handler = urllib2.HTTPBasicAuthHandler() auth_handler.add_password('realm', 'host', 'username', 'password') opener = urllib2.build_opener(auth_handler) # ...and install it globally so it can be used with urlopen. urllib2.install_opener(opener) urllib2.urlopen('http://www.example.com/login.html')
"

What do realm and host refer to ?

What I am trying to do is download some pages from a site and produce a report. The problem is downloading the pages as it is a php application that is password protected.

carl.waldbieser · 08-11-2005, 09:46 PM

Quote:

Originally posted by davholla
I am trying to urllib2 to download some webpages, unfortunately they are password protected.

In this example :-
"
import urllib2 # Create an OpenerDirector with support for Basic HTTP Authentication... auth_handler = urllib2.HTTPBasicAuthHandler() auth_handler.add_password('realm', 'host', 'username', 'password') opener = urllib2.build_opener(auth_handler) # ...and install it globally so it can be used with urlopen. urllib2.install_opener(opener) urllib2.urlopen('http://www.example.com/login.html')
"

What do realm and host refer to ?

What I am trying to do is download some pages from a site and produce a report. The problem is downloading the pages as it is a php application that is password protected.

If you scan halfway down this page, http://httpd.apache.org/docs/1.3/howto/au , it explains what a realm is in the context of basic authentication.

Does the site actually use basic authentication? Lots of sites just use form-based authentication because it blends in with the web page.

davholla · 08-16-2005, 09:06 AM

Carl,

Thanks for that. I think the site uses form based authentication :-

Code:

<h1>Helpdesk Login</h1>   Authorised xxxxx staff and clients may login here.<br><br>      <form target="_self" action="index.php" method="POST" id="form_form" >  <input type='hidden' name='node' value='578'>  <input type='hidden' name='form_refresh' value='0'>

I tried a script that you posted somewhere else :-

import urllib
import urllib2
import cookielib

#Create empty cookie jar.
cj = cookielib.LWPCookieJar()
#Install cookie handler for urllib2.
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
#For ClientCookie module(?)
# opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj))
# ClientCookie.install_opener(opener)

#Create initial request -- This is like when you first browse to the page. Since the cookie jar was empty, it will
#be like you initially cleared them from your browser.
#Cookies may set at this point.

request = urllib2.Request("http://intranetatwork/", None)
f = urllib2.urlopen(request)
f.close()
#Now you have to make a request like you submitted the form on the page.
#ClientForms would be good for this, but I don't have the docs handy. I will just do it the hard way. Assume
#the form action is "http://www.mypoints.com/login.cgi" and the method is POST.
#Further assume the names of the login and password fields are "login" and "password".
data = urllib.urlencode({"login": "loginname", "password" : "password"})
request = urllib2.Request("http:/intranetatwork/index.php?node=2371&pagetree=&mode=ticket_view&objectid=26121", data)
f = urllib2.urlopen(request)
#I am assuming that at this point you log into the screen you want to scrape.
#If not, you will have to request the page you want to scrape at this point.

#Read the page.
html = f.read()
f.close()
newfile = open("newfile.html",'w')
newfile.write(html)
newfile.close()

#Parse the html here (html contains the page markup)
print 'finished'

And I just downloaded the welcome page not the page at :-
http:/intranetatwork/index.php?node=2371&pagetree=&mode=ticket_view&objectid=26121

Any ideas ?

carl.waldbieser · 08-16-2005, 10:06 PM

Quote:

Originally posted by davholla
Carl,

Thanks for that. I think the site uses form based authentication :-

Code:

<h1>Helpdesk Login</h1> Authorised xxxxx staff and clients may login here.<br><br> <form target="_self" action="index.php" method="POST" id="form_form" > <input type='hidden' name='node' value='578'> <input type='hidden' name='form_refresh' value='0'>

I tried a script that you posted somewhere else :-

...

And I just downloaded the welcome page not the page at :-
http:/intranetatwork/index.php?node=2371&pagetree=&mode=ticket_view&objectid=26121

Any ideas ?

Well, since this is presumably a page on your local intranet, you are going to have to do some of the legwork here. You need to investigate what the minimum set of information you need to send to the webserver in order to log into the page you want to go to.

I mentioned in that other post that there are numerous way to detect these things. To recap, data can be passed to the web server in the form of:
+ HTTP Headers
+ Cookies (which are just a specialized form of header)
+ The URL (key-value pairs after the "?")
+ POSTed data (form data- also key-value pairs).

The thing to do is to use whatever tools you have available and observer the transmissions that occur when you successfully log in. Then you start limiting the information sent. If you still succeed, you don't require that info. If you fail, taht info was required.

For example, start by erasing all the cookies in your browser and disabling cookies right before you hit the "submit" button. If you can still log in, you don't need cookies. If you can't, you do need cookies.

For URL and form data, the idea is the same. However, since your browser doesn't have the option of disabling individual form fields posted, you have to use some other techniques.

A simple technique is to just save a copy of the HTML on your PC and modify the source code. Remove an <input> tag and load the page in your browser and try to submit it. Does it still work? Using tools like netcat, web proxies, etc. helps you see what is being sent as well as allowing you to try to send your own messages.

Feel free to ask with more specific questions.

davholla · 08-17-2005, 03:07 AM

Thanks I checked and yes I do need cookies.

Looking at the code I think that there is a javascript that is involved in this.

My employer uses this browser based system to record calls. Now I want to be able to extract useful information from it, ideally without having to download the web pages but I am begining to think that it is impossible.

I can't do this

Quote:

A simple technique is to just save a copy of the HTML on your PC and modify the source code. Remove an <input> tag and load the page in your browser and try to submit it. Does it still work?

As if I just save the page and then log into the local copy I get this message :-

Quote:

You tried to access the address file://localhost/C:/Python24/index.php, which is currently unavailable. Please make sure that the Web address (URL) is correctly spelled and punctuated, then try reloading the page.

carl.waldbieser · 08-17-2005, 05:23 PM

Quote:

Originally posted by davholla
Thanks I checked and yes I do need cookies.

Looking at the code I think that there is a javascript that is involved in this.

My employer uses this browser based system to record calls. Now I want to be able to extract useful information from it, ideally without having to download the web pages but I am begining to think that it is impossible.

I can't do this
As if I just save the page and then log into the local copy I get this message :-

Javascript can complicate matters, as it can add tags to the HTML dynamically, or change the form ACTION, etc.
Try using tools like this:
http://tinyproxy.sourceforge.net/
or this (Windows only)
http://www.proxomitron.info/
to let you see what is being sent when you hit submit.