Python IndexError: list index out of range (Web Scrapper)
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
Look at the filename assignment. This may help you.
To make this one more versatile you can construct your os.system() line from user input
# look here, too. Raw input would be your http://whatever.notcom
os_system_line = 'curl ' + raw_input() + '| grep -i rar | cut -d '\"' -f2 > temp.out ")
infile =open('temp.out', 'r')
for url in infile:
#url = "http://download.thinkbroadband.com/10MB.zip"
#url = target
#Remember that Linux doesn't like spaces so much, and that Python strings are immutable, so operations on strings will return strings which can be further operated on.
file_name = url.replace('http://www.insidepro.com/Dictionaries', '').replace(' ', '_')
u = urllib2.urlopen(url)
f = open(file_name, 'w')
meta = u.info()
file_size = int(meta.getheaders("Content-Length"))
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 8192
buffer = u.read(block_sz)
if not buffer:
file_size_dl += block_sz
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
Nope..still craps the bed with a bunch of different errors...
I was trying to do this as a project rather than using bash scripting, but I guess trying to reinvent the wheel for fun is an exercise in futility when you don't completely understand the programming language at hand.
So back to the basics...wget it is!
Thank you all for the help! I really appreciate it!
Distribution: slackware64 13.37 and -current, Dragonfly BSD
If your parsing html documents (or xml) you really should look at BeautifulSoup. It makes parsing html stuff as in webscraping a real doddle. I've bashed together a little python script that should do what you want downloading all the dictionaries from the page you showed in your original code. As you can see it's very small as BeautifulSoup does all the hard work. Anyway here it is :
from urllib2 import urlopen, quote
from BeautifulSoup import BeautifulSoup
page = urlopen("http://www.insidepro.com/eng/download.shtml")
soup = BeautifulSoup(page)
for item in soup.findAll('a', href=True):
this_href = item["href"]
if u"/dictionaries/" in this_href:
local_file = this_href.split("/")[-1]
remote_file = quote(this_href, safe=":/")
print "downloading: " + remote_file +" to: " + local_file
rfile = urlopen(remote_file)
with open(local_file, "w") as lfile:
This was put together very quickly and there is no error checking in it so obviously it needs some work if it is to be used in production. It works though.
Last edited by bgeddy; 02-19-2011 at 04:39 PM.
Reason: Tidying up