[SOLVED] Python regular expression, cant get it to work

madsovenielsen · 08-23-2010, 05:17 AM

Hello

I am trying to scan a website for http references (links)

with this script:

Code:

from urllib import urlopen
import re

current_site = urlopen("http://en.wikipedia.org/wiki/").read()

search = re.search('href="[a-zA-Z0-9]"', current_site)

print search.group(0)
raw_input("Pause")

It doesent work properly.

I get the following error message:

Code:

Traceback (most recent call last):
  File "C:\Users\admin\Desktop\crawler.py", line 8, in <mo
    print search.group(0)
AttributeError: 'NoneType' object has no attribute 'group'

I have googled the error, but i am not able to find anything helpful.
is the regular expression wrong ?

Any help is greatly appriciated.

/mads

grail · 08-23-2010, 06:28 AM

Hi mads ... so I did a little testing and you banged the nail on the head, your regex is wrong.

By using:

Quote:

href="[a-zA-Z0-9]"

You will only match something like:

Code:

href="A"

Which of course if you look through the source for your website that never exists.

However, if you change it to:

Code:

href="[a-zA-Z0-9]

You will now match something like:

Code:

href="h

Obviously not hugely helpful as you are probably looking for something like this as a match:

Code:

http://creativecommons.org

I hope this gets you going

ghostdog74 · 08-23-2010, 09:54 PM

with Python, regex is the last thing to come to mind. It is not Perl, and Python's string manipulation is easy to use in most cases, there is no need to use regex. But here's a simple example (

Code:

...
current_site = urlopen("http://en.wikipedia.org/wiki/").read()
data=current_site.split("</a>")  
for content in data:
    if "href" in content:
         d=content.split('href="')[1:]
         print d
         # for i in d[::2]:
              #print  i
         # do the rest of string manipulation as deemed fit.
...

Note: To do HTML parsing, if possible always use a parser, like BeautifulSoup.