LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-18-2011, 11:08 AM   #1
zcrxsir88
Member
 
Registered: Oct 2004
Location: Cardiff-by-the-Sea, CA
Distribution: Fedora X & RHEL X.X
Posts: 51

Rep: Reputation: 18
Python IndexError: list index out of range (Web Scrapper)


Hey all!

Having a bit of an issue with Python while trying to write a script to download every rar file on a webpage.

The script successfully downloads any link that doesn't contain any spaces, etc. But when it hits a url like:

http://www.insidepro.com/dictionaries/Belarusian (Classical Spelling).rar

It fails...I'm sure this is something simple, but I'm so new to python I'm not sure what to do!

Thank you in advance.



Code:
import urllib2
import os

os.system("curl http://www.insidepro.com/eng/download.shtml|grep -i rar|cut -d '\"' -f 2 > temp.out ")

infile =open('temp.out', 'r')

for url in infile:
        print url
#url = "http://download.thinkbroadband.com/10MB.zip"

        #url = target

        file_name = url.split('/')[-1]
        u = urllib2.urlopen(url)
        f = open(file_name, 'w')
        meta = u.info()
        file_size = int(meta.getheaders("Content-Length")[0])
        print "Downloading: %s Bytes: %s" % (file_name, file_size)

        file_size_dl = 0
        block_sz = 8192
        while True:
            buffer = u.read(block_sz)
            if not buffer:
                break

            file_size_dl += block_sz
            f.write(buffer)
            status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
            status = status + chr(8)*(len(status)+1)
            print status,

        f.close()
 
Old 02-18-2011, 11:34 AM   #2
pgroover
Member
 
Registered: Sep 2005
Location: Colorado
Distribution: Ubuntu
Posts: 56

Rep: Reputation: 16
Not really familiar with Python, but my first thought would be that the spaces should either be escaped, or the entire URL encapsulated within quotes.

Just my .02.
 
Old 02-18-2011, 11:35 AM   #3
pgroover
Member
 
Registered: Sep 2005
Location: Colorado
Distribution: Ubuntu
Posts: 56

Rep: Reputation: 16
Oh yeah, I forgot to mention it, but you could also look at the file it's attempting to download when it does get one with spaces.
 
Old 02-18-2011, 11:41 AM   #4
zcrxsir88
Member
 
Registered: Oct 2004
Location: Cardiff-by-the-Sea, CA
Distribution: Fedora X & RHEL X.X
Posts: 51

Original Poster
Rep: Reputation: 18
Escaping with Quotes

Tried escaping iwht quotes...didn't work either!
 
Old 02-18-2011, 11:53 AM   #5
pgroover
Member
 
Registered: Sep 2005
Location: Colorado
Distribution: Ubuntu
Posts: 56

Rep: Reputation: 16
Have you tried looking at the filename it attempts to download when met with a URL with spaces?
 
Old 02-18-2011, 10:58 PM   #6
Dogs
Member
 
Registered: Aug 2009
Location: Houston
Distribution: Slackware 13.37 x64
Posts: 105

Rep: Reputation: 25
Look at the filename assignment. This may help you.

To make this one more versatile you can construct your os.system() line from user input



Code:
import urllib2
import os

# look here, too. Raw input would be your http://whatever.notcom
os_system_line = 'curl ' + raw_input() + '| grep -i rar | cut -d '\"' -f2 > temp.out ")

os.system(os_system_line)

infile =open('temp.out', 'r')

for url in infile:
        print url
#url = "http://download.thinkbroadband.com/10MB.zip"

        #url = target

        #Remember that Linux doesn't like spaces so much, and that Python strings are immutable, so operations on strings will return strings which can be further operated on.
        file_name = url.replace('http://www.insidepro.com/Dictionaries', '').replace(' ', '_')

        u = urllib2.urlopen(url)
        f = open(file_name, 'w')
        meta = u.info()
        file_size = int(meta.getheaders("Content-Length")[0])
        print "Downloading: %s Bytes: %s" % (file_name, file_size)

        file_size_dl = 0
        block_sz = 8192
        while True:
            buffer = u.read(block_sz)
            if not buffer:
                break

            file_size_dl += block_sz
            f.write(buffer)
            status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
            status = status + chr(8)*(len(status)+1)
            print status,

        f.close()

Last edited by Dogs; 02-18-2011 at 11:33 PM.
 
Old 02-18-2011, 11:57 PM   #7
zcrxsir88
Member
 
Registered: Oct 2004
Location: Cardiff-by-the-Sea, CA
Distribution: Fedora X & RHEL X.X
Posts: 51

Original Poster
Rep: Reputation: 18
Still not working...O'well.

Nope..still craps the bed with a bunch of different errors...

I was trying to do this as a project rather than using bash scripting, but I guess trying to reinvent the wheel for fun is an exercise in futility when you don't completely understand the programming language at hand.

So back to the basics...wget it is!

Thank you all for the help! I really appreciate it!

-V
 
Old 02-19-2011, 04:44 AM   #8
bgeddy
Senior Member
 
Registered: Sep 2006
Location: Liverpool - England
Distribution: slackware64 13.37 and -current, Dragonfly BSD
Posts: 1,810

Rep: Reputation: 232Reputation: 232Reputation: 232
If your parsing html documents (or xml) you really should look at BeautifulSoup. It makes parsing html stuff as in webscraping a real doddle. I've bashed together a little python script that should do what you want downloading all the dictionaries from the page you showed in your original code. As you can see it's very small as BeautifulSoup does all the hard work. Anyway here it is :
Code:
from urllib2 import urlopen, quote
from BeautifulSoup import BeautifulSoup

page = urlopen("http://www.insidepro.com/eng/download.shtml")
soup = BeautifulSoup(page)
for item in soup.findAll('a', href=True):
    this_href = item["href"]
    if  u"/dictionaries/" in this_href:
        local_file = this_href.split("/")[-1]
        remote_file = quote(this_href, safe=":/")
        print "downloading: " + remote_file +" to: " + local_file
        rfile = urlopen(remote_file)
        with open(local_file, "w") as lfile:
            lfile.write(rfile.read())
This was put together very quickly and there is no error checking in it so obviously it needs some work if it is to be used in production. It works though.

Last edited by bgeddy; 02-19-2011 at 03:39 PM. Reason: Tidying up
 
  


Reply

Tags
download, python, url, url address



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: ActiveState Launches Python Package Manager Index (PyPM Index) LXer Syndicated Linux News 0 11-03-2010 03:40 PM
yum update results in python IndexError pattbert Fedora 6 06-24-2010 01:41 PM
Engage compile error, index out of range crewblunts Linux - Software 1 06-04-2006 03:34 PM
Index was out of range. Must be non-negative and less than the size of the collectio mrobertson Programming 2 03-15-2006 10:47 AM
index.php to list the contents of the dir?? bruno buys Linux - Software 1 08-30-2004 02:19 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 01:56 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration