LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices



Reply
 
Search this Thread
Old 11-17-2004, 03:18 PM   #1
fiomba
Member
 
Registered: Sep 2004
Posts: 63

Rep: Reputation: 15
Python: loading page from forum


I am almost a newbie in Linux and a 'very' newbie in Python.

For that reason, I am starting with someone's else programs (from Dive Into Python).

Code:
#! /usr/bin/python

import urllib
     
sock = urllib.urlopen("http://www.linuxquestions.org/") 
htmlSource = sock.read()                         
sock.close() 

sock2 = urllib.urlopen("http://www.linuxquestions.org/questions/search.php?s") 
htmlSource2 = sock2.read()                            
sock2.close()
This small program loads 2 html pages:
- the main page of LQ forum (into htmlSource)
- the search page for reaching my own posts (on LQ forum there is no "See your posts)
into htmlSource2.

After that you can print the html source on screen (not very useful) or save on file (or both) to further analysis (with Mozilla).

Going further I tryed to load directly the output of the search, that is the list of my own posts, but the html loaded is NOT the list (it is empty!).
It seems that this list is 'very' temporary (although it works perfectly if I load it by hand!).

Does someone knows the answer? or you know if in Python you can emulate mouse mouvement and click or key emulation?
 
Old 11-17-2004, 11:31 PM   #2
CroMagnon
Member
 
Registered: Sep 2004
Location: New Zealand
Distribution: Debian
Posts: 900

Rep: Reputation: 33
What URL did you use to view your own posts? Was it this one?
Code:
http://www.linuxquestions.org/questions/search.php?s=&action=showresults&searchid=3012926
 
Old 11-18-2004, 04:44 AM   #3
fiomba
Member
 
Registered: Sep 2004
Posts: 63

Original Poster
Rep: Reputation: 15
Something like... except the number that depends from the user or the progressive search number.
But I am not interested in the parameters of search.php...

Last edited by fiomba; 11-18-2004 at 04:56 AM.
 
Old 11-18-2004, 07:46 AM   #4
CroMagnon
Member
 
Registered: Sep 2004
Location: New Zealand
Distribution: Debian
Posts: 900

Rep: Reputation: 33
Quote:
But I am not interested in the parameters of search.php...
Then I really don't understand what you're trying to do. You do realise there is no browser running when you use urllib, right? There is no mouse, no form, only the raw HTML you request from the server. If you want to use a python program to make sensible LQ requests, then you need to know how to post the correct data to the correct url. If you ask the server to run a search with no details, it's not very surprising if it gives you nothing back.
 
Old 11-18-2004, 12:48 PM   #5
fiomba
Member
 
Registered: Sep 2004
Posts: 63

Original Poster
Rep: Reputation: 15
Quote:
I really don't understand what you're trying to do
Perhaps my Englis is so poor...

If I try to load a html page from a program, I am perfectly aware that I am not using a browser for this task!
In every forum I have found that people wants to know 'why' I am asking something instead of trying to give an answer...
Anyway, what I am trying to do is a python program that access all the forums in which I am member and displays my posts (if there is an answer).

The only useful thing in your replay is:

Quote:
If you ask the server to run a search with no details, it's not very surprising if it gives you nothing back.
Really I gave no detail ...

Code:
sock2 = urllib.urlopen("...URL address...") 
htmlSource2 = sock2.read()                            
sock2.close()
But the same "...URL address..." loaded with cut & paste into the navigation toolbar of Mozilla, gave me the correct page, that is the answer of the server (the list of my posts)!
 
Old 11-18-2004, 05:19 PM   #6
CroMagnon
Member
 
Registered: Sep 2004
Location: New Zealand
Distribution: Debian
Posts: 900

Rep: Reputation: 33
Quote:
In every forum I have found that people wants to know 'why' I am asking something instead of trying to give an answer...
This is because your English is a little hard to understand, and doesn't always seem to properly describe the problem, or includes confusing elements (like asking to control keyboard and mouse - this makes it sound like you are trying to control a browser). For us to understand the problem, it helps to understand what you are trying to achieve, and we ask questions to clarify our own understanding. Please remember that you are very 'close' to the task you are trying to perform, and you understand it completely, but you might be assuming knowledge on our part. Also, people are not perfect and sometimes we misunderstand what we read - in cases like this, you should try to be polite, as you are still asking others for help - alienating them will never be useful.

Here is some python code that retrieves what you're looking for...

Code:
import urllib
import re

# Without these parameters, the search form just displays the search page again - not what we want
params = []
params.append( ("s", "") )
params.append( ("searchuser", "fiomba") )
params.append( ("exactname", "yes") )
params.append( ("query", "") )
params.append( ("excquery", "") )
params.append( ("optquery", "") )
params.append( ("phrquery", "") )
params.append( ("forumchoice", "-1") )
params.append( ("titleonly", "") )
params.append( ("showposts", "") )
params.append( ("searchdate", "365") )
params.append( ("beforeafter", "after") )
params.append( ("sortby", "lastpost") )
params.append( ("sortorder", "descending") )
params.append( ("action", "simplesearch") )
params.append( ("Submit", "Perform Search") )
encparams = urllib.urlencode( params )

# Post the data to the search page
url = "http://www.linuxquestions.org/questions/search.php"
sock = urllib.urlopen( url, encparams )
html = sock.read()
sock.close()

# This first page is the 'redirect' page, so we use a regex to pull out the refresh URL
m = re.compile( 'url=[^"]*', re.I ).search( html )
# Prepend the site name because the link is relative, and chop the URL= portion of the regex match
url = "http://www.linuxquestions.org/questions/" + html[m.start():m.end()][4:]
sock = urllib.urlopen( url )
html = sock.read()
sock.close()

f = open( "fiomba.html", "w" )
f.write( html )
f.close()
 
Old 11-18-2004, 06:57 PM   #7
fiomba
Member
 
Registered: Sep 2004
Posts: 63

Original Poster
Rep: Reputation: 15
I thank you very much! Your code worked perfectly!
I deceived myself because the same url address worked perfectly (without passing any further parameter) if I loaded it with Mozilla.

Instead with your technique there is no problem (perhaps the server uses other ways when the request comes from a program instead of a full-featured browser...).

Instead of using separate instructions for loading the html page and save it on a file, I have found in the Python Library Reference:
Code:
urllib.urlretrieve( url_for_accessing_the_html, path&file_to_save_the_loaded_file)
Besides that, if you want to load a particular post, you must of course modify the url in order to point to the LQ url.
 
Old 11-18-2004, 07:02 PM   #8
fiomba
Member
 
Registered: Sep 2004
Posts: 63

Original Poster
Rep: Reputation: 15
I thank you very much! Your code worked perfectly!
I deceived myself because the same url address worked perfectly (without passing any further parameter) if I loaded it with Mozilla.

Instead with your technique there is no problem (perhaps the server uses other ways when the request comes from a program instead of a full-featured browser...).

Instead of using separate instructions for loading the html page and save it on a file, I have found in the Python Library Reference:
Code:
urllib.urlretrieve( url_for_accessing_the_html, path&file_to_save_the_loaded_file)
Besides that, if you want to load a particular post, you must of course modify the url in order to point to the LQ url.
 
Old 11-18-2004, 07:10 PM   #9
CroMagnon
Member
 
Registered: Sep 2004
Location: New Zealand
Distribution: Debian
Posts: 900

Rep: Reputation: 33
Quote:
perhaps the server uses other ways when the request comes from a program instead of a full-featured browser...
I don't know for sure what LQ is doing, but I noticed I had a problem if I pasted the URL into a new browser session. I searched for all your posts with Opera, then pasted the URL into Firefox, and it didn't work - I got the message "please enter some search terms", or something like that. It may be that searches are associated with a PHP session or something similar.
 
Old 11-22-2004, 04:04 PM   #10
fiomba
Member
 
Registered: Sep 2004
Posts: 63

Original Poster
Rep: Reputation: 15
To extend the python routine suggested by you in order to list posts, I have started to examine the code to understand it.

Do you used "search.php" to get the various parameters and the corresponding values?
To minimize the number of parameters I found that only 3 parameters are necessary (because the other parameters probably take default values):

Code:
params = []
params.append( ("searchuser", "fiomba") )
params.append( ("searchdate", "365")
params.append( ("action", "simplesearch") )

encparams = urllib.urlencode( params )
...
Obviously searchuser must be present, if I want to list my own posts.
If searchdate is not present the output is empty.
The 3rd parameter is stranger because its lack causes the following error:
Code:
Traceback (most recent call last):
  File "./get_LQ8.py", line 42, in ?
    url = "http://www.linuxquestions.org/questions/" + html[m.start():m.end()][4:]
AttributeError: 'NoneType' object has no attribute 'start'
Beeing a newbie in Python, I have difficulties to understand the code, especially:
Code:
m = re.compile( 'url=[^"]*', re.I ).search( html )
url = "http://www.linuxquestions.org/questions/" + html[m.start():m.end()][4:]
 
Old 11-22-2004, 05:11 PM   #11
CroMagnon
Member
 
Registered: Sep 2004
Location: New Zealand
Distribution: Debian
Posts: 900

Rep: Reputation: 33
The third parameter tells the server which button you pressed (whether you pressed the "Perform Search" or "Reset Fields"), so it knows what action to take.

Code:
m = re.compile( 'url=[^"]*', re.I ).search( html )
url = "http://www.linuxquestions.org/questions/" + html[m.start():m.end()][4:]
This is regular expression code - just used to easily pull out the part we're interested in. The interim page that loads first has an HTML refresh command to tell your browser to wait a few seconds then load the next page. Since Python won't do that automatically, we have to grab the URL from the page and load it ourselves. If you are not familiar with regular expressions, there are some very good introductions to them out on the web.

Code:
m = re.compile( 'url=[^"]*', re.I )
# This returns a regex object that searches for the string url=, followed by 
# as many characters as possible that don't include quote marks
# the HTML we want looks like this:
# <meta http-equiv="Refresh" content="1; URL=search.php?s=&action=showresults&searchid=3039984&sortby=lastpost&sortorder=descending">
# and this is right near the top.  We are relying on this being the first instance of URL=
# the re.I sets the regex to case-insensitive.

m = re.compile( 'url=[^"]*', re.I ).search( html )
# Since we are not interested in the re object, but rather the match, we apply the search
# method immediately to the compiled regex object.  This searches the 'html' variable for
# the pattern listed, and returns a match object (check the docs for the python re module).
# The important part is that if the pattern is found, m.start() returns the start index of our
# substring, and m.end() returns the last.  If the search() method fails, m will not be a valid
# match object, but instead just None, which gives an error when we try to use start() or 
# end()

url = "http://www.linuxquestions.org/questions/" + html[m.start():m.end()][4:]
# So here we get the portion of the html variable that matches our regex
# ( html[m.start() : m.end()] )
# And then remember that the string we searched for included URL= on the front, so
# we strip that off with the [4:] on the end.
I hope this explains things more clearly
 
Old 11-23-2004, 06:29 AM   #12
fiomba
Member
 
Registered: Sep 2004
Posts: 63

Original Poster
Rep: Reputation: 15
I thank you again! You could have answered "R.T.F.M." (in this case the Python Library Reference)!
It must be a characteristic of the majority of newbies (lasiness)...

For curiosity (and also for not disturbing you with more questions in trying to extend your technique with other forum servers...) did you use "search.php" to get the various parameters and the corresponding values?

Only a lasting doubt... why is it necessary to apply regex? As I understand the interim page
must be refreshed by the server with the required list. I tried to print this interim page but it is not possible. Perhaps it's PHP code ...
I am tryng to understand, but not knowing PHP either...

All this was not necessary (also if it is very instructive!) if I had known a method to emulate mouse and keyboard (like I was accostumed in XP with macro programs like AutoIt or VBA macro registration in Excel...).

Last edited by fiomba; 11-23-2004 at 03:03 PM.
 
Old 11-23-2004, 03:59 PM   #13
CroMagnon
Member
 
Registered: Sep 2004
Location: New Zealand
Distribution: Debian
Posts: 900

Rep: Reputation: 33
Quote:
For curiosity (and also for not disturbing you with more questions in trying to extend your technique with other forum servers...) did you use "search.php" to get the various parameters and the corresponding values?
I'm not sure I understand the question... I had to look at the HTML source for the search page to find out what fields it was going to post with the form. This will be different for most online forums, though you could write an HTML parser and try to guess which field is correct for the username.

Quote:
Only a lasting doubt... why is it necessary to apply regex? As I understand the interim page
must be refreshed by the server with the required list. I tried to print this interim page but it is not possible. Perhaps it's PHP code ...
The server can't refresh a page... HTTP is always initiated by the client, so the client must be the one to request the refresh URL from the server, so we must find that URL and ask for it. It isn't necessary to use regex, except for simplicity of the code (we could use python's string scanning and splitting functions to do the same thing).

What do you mean by printing the interim page? Just in Python (i.e: "print html" just above the regex section)? That should show you the HTML source for the interim page as expected...
 
Old 11-24-2004, 10:51 AM   #14
fiomba
Member
 
Registered: Sep 2004
Posts: 63

Original Poster
Rep: Reputation: 15
I keep on thanking you...

Maybe I have solved my problem, and I leave further insight into server-client communication to when I will have more experience.
To recall my goal...
I wanted to have the list of all my posts in the various forums. Now I have found a solution (not very elegant)... but it works!
I save the url of every post list in a file, then Ioad from program every url and see if there are any replay.

The program shows me only the posts which have a replay and so I am not more obliged to go through all the forums or to rely on mail communication.
There are still some problems but I think I can face them...

Bye... and thank you again!

Last edited by fiomba; 11-24-2004 at 10:53 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Getting the web page in python :: What's wrong with the code ? indian Programming 1 09-12-2005 04:17 PM
Python Server Page Problem on FC4 hus Programming 1 09-05-2005 09:53 PM
Select on/off birthday list on main forum page? trickykid LQ Suggestions & Feedback 5 03-26-2005 11:34 AM
Hurray almost done but page not loading in crclient ! gopikrish Linux - Networking 0 01-02-2005 11:53 AM
How to fetch source from a web page with Python rootyard Programming 1 07-19-2004 02:56 PM


All times are GMT -5. The time now is 04:04 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration