LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-23-2010, 05:17 AM   #1
madsovenielsen
Member
 
Registered: Aug 2009
Posts: 183

Rep: Reputation: 15
Python regular expression, cant get it to work


Hello

I am trying to scan a website for http references (links)

with this script:

Code:
from urllib import urlopen
import re

current_site = urlopen("http://en.wikipedia.org/wiki/").read()

search = re.search('href="[a-zA-Z0-9]"', current_site)

print search.group(0)
raw_input("Pause")
It doesent work properly.

I get the following error message:
Code:
Traceback (most recent call last):
  File "C:\Users\admin\Desktop\crawler.py", line 8, in <mo
    print search.group(0)
AttributeError: 'NoneType' object has no attribute 'group'
I have googled the error, but i am not able to find anything helpful.
is the regular expression wrong ?

Any help is greatly appriciated.

/mads
 
Old 08-23-2010, 06:28 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
Hi mads ... so I did a little testing and you banged the nail on the head, your regex is wrong.

By using:
Quote:
href="[a-zA-Z0-9]"
You will only match something like:
Code:
href="A"
Which of course if you look through the source for your website that never exists.

However, if you change it to:
Code:
href="[a-zA-Z0-9]
You will now match something like:
Code:
href="h
Obviously not hugely helpful as you are probably looking for something like this as a match:
Code:
http://creativecommons.org
I hope this gets you going
 
Old 08-23-2010, 09:54 PM   #3
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
with Python, regex is the last thing to come to mind. It is not Perl, and Python's string manipulation is easy to use in most cases, there is no need to use regex. But here's a simple example (

Code:
...
current_site = urlopen("http://en.wikipedia.org/wiki/").read()
data=current_site.split("</a>")  
for content in data:
    if "href" in content:
         d=content.split('href="')[1:]
         print d
         # for i in d[::2]:
              #print  i
         # do the rest of string manipulation as deemed fit.
...
Note: To do HTML parsing, if possible always use a parser, like BeautifulSoup.

Last edited by ghostdog74; 08-23-2010 at 09:56 PM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Regular expression working fine in JavaScript but not in python madsovenielsen Programming 3 06-24-2010 07:25 PM
[SOLVED] What's wrong with this regular expression? Doesn't always work gregorian Linux - Newbie 4 03-14-2010 07:17 AM
Regular expression doesnt work in an elsif statement? oinker Programming 2 02-04-2010 01:33 PM
regular expression (.*?) uttam_h Programming 6 05-30-2008 05:45 PM
Python - need help narrowing a regular expression rose_bud4201 Programming 2 10-30-2005 12:37 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:04 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration