LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Regular Expressions in Python (https://www.linuxquestions.org/questions/programming-9/regular-expressions-in-python-363336/)

indian 09-14-2005 12:32 PM

Regular Expressions in Python
 
Hi,

I am looking to split a complete URL like www.google.com/index.html into main URL www.google.com and remaining url /index.html.

How can I do this in python ?

Thanks

Hko 09-14-2005 01:07 PM

Code:

import urlparse
url = 'http://www.google.com/index.html'
spliturl = urlparse.urlparse(url)
print spliturl


shanenin 09-14-2005 01:27 PM

I was just playing with it a little, it seems to split some urls strangly
Code:

>>> urlparse.urlparse('http://www.linuxquestions.org/questions/showthread.php?s=&threadid=363336')
('http', 'www.linuxquestions.org', '/questions/showthread.php', '', 's=&threadid=363336', '')


indian 09-14-2005 01:45 PM

How is this urlparse works ? I mean if I put www.google.com/index.html than it gives some blank values.

shanenin 09-14-2005 02:03 PM

This function seems a little cleaner, it just sptilts the url into two parts as you need
Code:

def parse_url(url):

    extentions = ('.com', '.net', '.uk', '.biz', '.gov', '.org')
    for i in extentions:
        if url.find(i) != -1:
            new_url = url.replace(i, i+"!@#$%") # this adds a unique delimnater
            split_url = new_url.split("!@#$%")  # this line splits it at the newly ctreated delimiter
            return split_url

I am sure there are some flaws in this method I missed :-)

Hko 09-14-2005 02:10 PM

Quote:

Originally posted by indian
How is this urlparse works ? I mean if I put www.google.com/index.html than it gives some blank values.
Yes, that's because it expects somthing like "http://", "ldap://", "ftp://" at the start of the string.

indian 09-14-2005 10:40 PM

Thanks shanein, it is working :)

anyway another thing which I am not able to do is to get the file name. Like if given a URL www.google.com/docs/index.html so I want to break it in www.google.com/docs/ and index.html.

I am not able to think, how to use delimiters to get the file name :)

shanenin 09-14-2005 11:00 PM

I am not sure I am fully following you, but you could use the split method again like this, but choose '/' as the dilimeter
Code:

>>> "http://www.google.com/docs/index.html".split('/')
['http:', '', 'www.google.com', 'docs', 'index.html']

Code:

url = "http://www.google.com/docs/index.html"
split_url = url.split('/')
file = split_url[-1]  #the element -1 is you last one in the list
print file

or as a function
Code:

def url_file(url):
    split_url = url.split('/')
    return split_url[-1]



All times are GMT -5. The time now is 07:28 PM.