LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-10-2009, 06:14 PM   #1
snowman81
Member
 
Registered: Aug 2006
Location: Michigan
Distribution: Ubuntu
Posts: 271

Rep: Reputation: 30
Python web page parser


I found a guide online for writing a simple program for parsing a webpage. The example outputs the full path of all the images in a particular webpage. I want to change it so that it will parse a webpage for all instances of an IP address/port combo. For instance, say a page has a listing like this
Code:
 
                               this one is 127.0.0.1:80
                               this one is 127.0.1.1:80
                               this one is 127.0.2.1:80
It will only output the IP address/port combo.
Code:
                               127.0.0.1:80
                               127.0.1.1:80
                               127.0.2.1:80
Anyway, the code opens a page and does error checking, I think I have that figured out. The code that I think needs to be different is this:
Code:
matches = sre.findall('<img .*src="(.*?)"', website_text)

for match in matches:
        if match[:7] != "http://":
                if match[0] == "/":
                        slash = ""
                else:
                        slash = "/"
                match_set.add(dir + slash + match)
        else:
                match_set.add(match)

match_set = list(match_set)
match_set.sort()

for item in match_set:
        print item
If you need me to I can post the whole thing. Since I do not need it to be in the form of a web address I figured i could strip out most of the above code and maybe do something like
Code:
 \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\{1,4}\b
but it didn't really work when I tried it. Any suggestions?
 
Old 01-11-2009, 06:26 AM   #2
maroonbaboon
Senior Member
 
Registered: Aug 2003
Location: Sydney
Distribution: debian
Posts: 1,495

Rep: Reputation: 48
Quote:
Originally Posted by snowman81 View Post
I figured i could strip out most of the above code and maybe do something like
Code:
 \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\{1,4}\b
but it didn't really work when I tried it. Any suggestions?
There is a typo in your regex. Missing 'd' in front of {1,4} This seems to work:

Code:
import re
foo = re.compile("\s(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,4})\s")
bar = "looking for 100.10.20.30:40 in this string."
ip = foo.search(bar)
print ip.groups()
If you are only interested in links the python HTML parser module will find all the links for you.
 
Old 01-11-2009, 11:01 AM   #3
snowman81
Member
 
Registered: Aug 2006
Location: Michigan
Distribution: Ubuntu
Posts: 271

Original Poster
Rep: Reputation: 30
Yep, that looks exactly like what I need. I can't believe I missed something simple like that. Thanks.

Is
Code:
print ip.groups()
the only way to output the results? I tried just
Code:
print ip
but it wouldn't work. It seems when you print it in groups it adds a comma and a bracket type thing. I need it to be just the IP/port combo.

Last edited by snowman81; 01-11-2009 at 11:30 AM.
 
Old 01-11-2009, 12:25 PM   #4
bgeddy
Senior Member
 
Registered: Sep 2006
Location: Liverpool - England
Distribution: slackware64 13.37 and -current, Dragonfly BSD
Posts: 1,810

Rep: Reputation: 231Reputation: 231Reputation: 231
I've made a slight change to your code to find mutiple matches even next to each other. This does not assume a IP string should be surrounded by whitespace so beware.

Code:
import re
foo = re.compile("(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,4})")
bar = "looking for 100.10.20.30:40 10.1.20.2:33 in this string."
ip = foo.findall(bar)
for finds in ip:
    print finds
 
Old 01-11-2009, 06:03 PM   #5
maroonbaboon
Senior Member
 
Registered: Aug 2003
Location: Sydney
Distribution: debian
Posts: 1,495

Rep: Reputation: 48
Quote:
Originally Posted by snowman81 View Post
Is
Code:
print ip.groups()
the only way to output the results? I tried just
Code:
print ip
but it wouldn't work.
It's a while since I did any serious python programming. I was delighted to see that the library documentation has taken a quantum leap upwards in quality.

http://docs.python.org/library/re.html

should tell you everything you need to know. Or just use the suggestion in the post above.
 
Old 01-11-2009, 07:18 PM   #6
snowman81
Member
 
Registered: Aug 2006
Location: Michigan
Distribution: Ubuntu
Posts: 271

Original Poster
Rep: Reputation: 30
Ok, thanks for all the help, the program runs as designed now.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: SELinux - Highly Secured Web Hosting for Python-based Web Applications LXer Syndicated Linux News 0 05-28-2008 07:30 AM
python: How to launch a web page in its own thread? BrianK Programming 7 01-24-2008 06:07 PM
web page database access per page or per session? b0uncer Programming 6 01-13-2007 12:09 PM
Getting the web page in python :: What's wrong with the code ? indian Programming 1 09-12-2005 03:17 PM
How to fetch source from a web page with Python rootyard Programming 1 07-19-2004 01:56 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:05 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration