LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-11-2005, 04:42 AM   #1
davholla
Member
 
Registered: Jun 2003
Location: London
Distribution: Linux Mint 13 Maya
Posts: 729

Rep: Reputation: 32
urllib2 python


I am trying to urllib2 to download some webpages, unfortunately they are password protected.

In this example :-
"
import urllib2 # Create an OpenerDirector with support for Basic HTTP Authentication... auth_handler = urllib2.HTTPBasicAuthHandler() auth_handler.add_password('realm', 'host', 'username', 'password') opener = urllib2.build_opener(auth_handler) # ...and install it globally so it can be used with urlopen. urllib2.install_opener(opener) urllib2.urlopen('http://www.example.com/login.html')
"

What do realm and host refer to ?

What I am trying to do is download some pages from a site and produce a report. The problem is downloading the pages as it is a php application that is password protected.

Last edited by davholla; 08-11-2005 at 04:47 AM.
 
Old 08-11-2005, 09:46 PM   #2
carl.waldbieser
Member
 
Registered: Jun 2005
Location: Pennsylvania
Distribution: Kubuntu
Posts: 197

Rep: Reputation: 32
Re: urllib2 python

Quote:
Originally posted by davholla
I am trying to urllib2 to download some webpages, unfortunately they are password protected.

In this example :-
"
import urllib2 # Create an OpenerDirector with support for Basic HTTP Authentication... auth_handler = urllib2.HTTPBasicAuthHandler() auth_handler.add_password('realm', 'host', 'username', 'password') opener = urllib2.build_opener(auth_handler) # ...and install it globally so it can be used with urlopen. urllib2.install_opener(opener) urllib2.urlopen('http://www.example.com/login.html')
"

What do realm and host refer to ?

What I am trying to do is download some pages from a site and produce a report. The problem is downloading the pages as it is a php application that is password protected.
If you scan halfway down this page, http://httpd.apache.org/docs/1.3/howto/au , it explains what a realm is in the context of basic authentication.

Does the site actually use basic authentication? Lots of sites just use form-based authentication because it blends in with the web page.
 
Old 08-16-2005, 09:06 AM   #3
davholla
Member
 
Registered: Jun 2003
Location: London
Distribution: Linux Mint 13 Maya
Posts: 729

Original Poster
Rep: Reputation: 32
Carl,

Thanks for that. I think the site uses form based authentication :-
Code:
<h1>Helpdesk Login</h1>   Authorised xxxxx staff and clients may login here.<br><br>      <form target="_self" action="index.php" method="POST" id="form_form" >  <input type='hidden' name='node' value='578'>  <input type='hidden' name='form_refresh' value='0'>
I tried a script that you posted somewhere else :-

import urllib
import urllib2
import cookielib

#Create empty cookie jar.
cj = cookielib.LWPCookieJar()
#Install cookie handler for urllib2.
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
#For ClientCookie module(?)
# opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj))
# ClientCookie.install_opener(opener)

#Create initial request -- This is like when you first browse to the page. Since the cookie jar was empty, it will
#be like you initially cleared them from your browser.
#Cookies may set at this point.

request = urllib2.Request("http://intranetatwork/", None)
f = urllib2.urlopen(request)
f.close()
#Now you have to make a request like you submitted the form on the page.
#ClientForms would be good for this, but I don't have the docs handy. I will just do it the hard way. Assume
#the form action is "http://www.mypoints.com/login.cgi" and the method is POST.
#Further assume the names of the login and password fields are "login" and "password".
data = urllib.urlencode({"login": "loginname", "password" : "password"})
request = urllib2.Request("http:/intranetatwork/index.php?node=2371&pagetree=&mode=ticket_view&objectid=26121", data)
f = urllib2.urlopen(request)
#I am assuming that at this point you log into the screen you want to scrape.
#If not, you will have to request the page you want to scrape at this point.

#Read the page.
html = f.read()
f.close()
newfile = open("newfile.html",'w')
newfile.write(html)
newfile.close()

#Parse the html here (html contains the page markup)
print 'finished'

And I just downloaded the welcome page not the page at :-
http:/intranetatwork/index.php?node=2371&pagetree=&mode=ticket_view&objectid=26121

Any ideas ?
 
Old 08-16-2005, 10:06 PM   #4
carl.waldbieser
Member
 
Registered: Jun 2005
Location: Pennsylvania
Distribution: Kubuntu
Posts: 197

Rep: Reputation: 32
Quote:
Originally posted by davholla
Carl,

Thanks for that. I think the site uses form based authentication :-
Code:
<h1>Helpdesk Login</h1>   Authorised xxxxx staff and clients may login here.<br><br>      <form target="_self" action="index.php" method="POST" id="form_form" >  <input type='hidden' name='node' value='578'>  <input type='hidden' name='form_refresh' value='0'>
I tried a script that you posted somewhere else :-

...

And I just downloaded the welcome page not the page at :-
http:/intranetatwork/index.php?node=2371&pagetree=&mode=ticket_view&objectid=26121

Any ideas ?
Well, since this is presumably a page on your local intranet, you are going to have to do some of the legwork here. You need to investigate what the minimum set of information you need to send to the webserver in order to log into the page you want to go to.

I mentioned in that other post that there are numerous way to detect these things. To recap, data can be passed to the web server in the form of:
+ HTTP Headers
+ Cookies (which are just a specialized form of header)
+ The URL (key-value pairs after the "?")
+ POSTed data (form data- also key-value pairs).

The thing to do is to use whatever tools you have available and observer the transmissions that occur when you successfully log in. Then you start limiting the information sent. If you still succeed, you don't require that info. If you fail, taht info was required.

For example, start by erasing all the cookies in your browser and disabling cookies right before you hit the "submit" button. If you can still log in, you don't need cookies. If you can't, you do need cookies.

For URL and form data, the idea is the same. However, since your browser doesn't have the option of disabling individual form fields posted, you have to use some other techniques.

A simple technique is to just save a copy of the HTML on your PC and modify the source code. Remove an <input> tag and load the page in your browser and try to submit it. Does it still work? Using tools like netcat, web proxies, etc. helps you see what is being sent as well as allowing you to try to send your own messages.

Feel free to ask with more specific questions.
 
Old 08-17-2005, 03:07 AM   #5
davholla
Member
 
Registered: Jun 2003
Location: London
Distribution: Linux Mint 13 Maya
Posts: 729

Original Poster
Rep: Reputation: 32
Thanks I checked and yes I do need cookies.

Looking at the code I think that there is a javascript that is involved in this.

My employer uses this browser based system to record calls. Now I want to be able to extract useful information from it, ideally without having to download the web pages but I am begining to think that it is impossible.

I can't do this
Quote:
A simple technique is to just save a copy of the HTML on your PC and modify the source code. Remove an <input> tag and load the page in your browser and try to submit it. Does it still work?
As if I just save the page and then log into the local copy I get this message :-
Quote:
You tried to access the address file://localhost/C:/Python24/index.php, which is currently unavailable. Please make sure that the Web address (URL) is correctly spelled and punctuated, then try reloading the page.
 
Old 08-17-2005, 05:23 PM   #6
carl.waldbieser
Member
 
Registered: Jun 2005
Location: Pennsylvania
Distribution: Kubuntu
Posts: 197

Rep: Reputation: 32
Quote:
Originally posted by davholla
Thanks I checked and yes I do need cookies.

Looking at the code I think that there is a javascript that is involved in this.

My employer uses this browser based system to record calls. Now I want to be able to extract useful information from it, ideally without having to download the web pages but I am begining to think that it is impossible.

I can't do this
As if I just save the page and then log into the local copy I get this message :-
Javascript can complicate matters, as it can add tags to the HTML dynamically, or change the form ACTION, etc.
Try using tools like this:
http://tinyproxy.sourceforge.net/
or this (Windows only)
http://www.proxomitron.info/
to let you see what is being sent when you hit submit.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Processing Conflict: python-devel conflicts python< 2.3.4-13.1 guarriman Fedora 2 04-23-2009 07:02 PM
installing python library's (Python Numeric) Four Linux - Newbie 1 10-16-2005 02:31 PM
Python guru's - Is this a python bug? or is it me? bardinjw Programming 2 06-23-2005 08:17 AM
WineX, python-gnome, and python-gtk DrD Fedora 0 08-03-2004 12:11 PM
install python 2.3 ,necssary to remove python 2.2 ngan_yine Linux - Newbie 7 12-28-2003 04:07 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:35 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration