I found a guide online for writing a simple program for parsing a webpage. The example outputs the full path of all the images in a particular webpage. I want to change it so that it will parse a webpage for all instances of an IP address/port combo. For instance, say a page has a listing like this
Code:
this one is 127.0.0.1:80
this one is 127.0.1.1:80
this one is 127.0.2.1:80
It will only output the IP address/port combo.
Code:
127.0.0.1:80
127.0.1.1:80
127.0.2.1:80
Anyway, the code opens a page and does error checking, I think I have that figured out. The code that I think needs to be different is this:
Code:
matches = sre.findall('<img .*src="(.*?)"', website_text)
for match in matches:
if match[:7] != "http://":
if match[0] == "/":
slash = ""
else:
slash = "/"
match_set.add(dir + slash + match)
else:
match_set.add(match)
match_set = list(match_set)
match_set.sort()
for item in match_set:
print item
If you need me to I can post the whole thing. Since I do not need it to be in the form of a web address I figured i could strip out most of the above code and maybe do something like
Code:
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\{1,4}\b
but it didn't really work when I tried it. Any suggestions?