LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-06-2017, 10:09 AM   #1
metallica1973
Senior Member
 
Registered: Feb 2003
Location: Washington D.C
Posts: 2,171

Rep: Reputation: 60
Python Web Page Scraping Urls Creating A Dictionary


I have thrown in the towel and cant figure out how to do this. I have a directory of html files that contain urls that I need to scrape (loop through) and add into a dictionary. An example of the output that I want:
Code:
bigbadwolf.htlm: https://www.blah.com, http://www.blahblah.com, http://www.blahblahblah.com
maryhadalittlelamb.html: http://www.red.com, https://www.redyellow.com, http://www.zigzag.com
time.html: https://www.est.com, http://www.pst.com, https://www.cst.com
My code that I have so far is:
Code:
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
		urls = links.get('href')
		print "HTML Files: {}\nUrls: {}\n".format(tut,urls)
produces the correct output for the most part:
Code:
HTML Files: bigbadwolf.html
Urls: https://www.blah.com

HTML Files: bigbadwolf.html
Urls: https://www.blahblah.com

HTML Files: bigbadwolf.html
Urls: https://www.blahblahblah.com

HTML files: maryhadalittlelamb.html
Urls: http://www.red.com 

HTML files: maryhadalittlelamb.html
Urls: https://www.redyellow.com 

HTML files: maryhadalittlelamb.html
Urls: http://www.zigzag.com
but I want it in a dictionary with this format:
Code:
bigbadwolf.htlm: https://www.blah.com, http://www.blahblah.com, http://www.blahblahblah.com
maryhadalittlelamb.html: http://www.red.com, https://www.redyellow.com, http://www.zigzag.com
time.html: https://www.est.com, http://www.pst.com, https://www.cst.com
As you can see, there will be several urls inside of an html doc so there will be keys that can contain many values(urls). I tried many variable of the below code but cant get a single key to have many urls associated with it.
Code:
tut_links = {}
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
                urls = links.get('href')
                tut_links[tut] = urls
produces:
Code:
bigbadwolf.htlm: https://www.blah.com
maryhadalittlelamb.html: http://www.red.com
time.html: https://www.est.com
...
...
...
Can someone please shine some light on what I am trying to do?

Last edited by metallica1973; 06-06-2017 at 11:41 AM.
 
Old 06-06-2017, 12:48 PM   #2
norobro
Member
 
Registered: Feb 2006
Distribution: Debian Sid
Posts: 637

Rep: Reputation: 248Reputation: 248Reputation: 248
In C++ the container to use would be std::multimap. A web search for Python multimap came up with this: https://docs.python.org/2/library/co...ns.defaultdict

I have not used defaultdict but it seems to do what you want.
 
Old 06-06-2017, 01:38 PM   #3
metallica1973
Senior Member
 
Registered: Feb 2003
Location: Washington D.C
Posts: 2,171

Original Poster
Rep: Reputation: 60
Thanks for the help. Another person suggested another solution that worked

Code:
tut_links = {}
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            tut_links[tut] = []
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
                urls = links.get('href')
                tut_links[tut].append(urls)
We simply added a list of values to a key inside the dictionary
 
Old 06-07-2017, 02:00 PM   #4
metallica1973
Senior Member
 
Registered: Feb 2003
Location: Washington D.C
Posts: 2,171

Original Poster
Rep: Reputation: 60
I wanted to add that I was coming up with a lot of duplicate entries as values that were a result of several duplicate urls within the html document:
Code:
bigbadwolf.html: 

'https://www.blah.com',
'https://www.blah.com',
'https://www.blah.com',
'http://www.blahblah.com'
'http://www.blahblah.com'
'http://www.blahblahblah.com'
'http://www.blahblahblah.com'
so I had to give my script an enema to cleans things up as in:
Code:
tut_links = {}

for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
	    tut_links[tut] = []
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a', href=True):
		urls = links.get('href')
	    	if urls.startswith('http' or 'https'):		
	           tut_links[tut].append(urls)
	    for dup in tut_links.values(): --> removes duplicate urls from the dictionary value list
    		dup[:] = list(set(dup))
Worked like a champ
Code:
'bigbadwolf.htlm' : ['https://www.blah.com', 'http://www.blahblah.com','http://www.blahblahblah.com']

Last edited by metallica1973; 06-07-2017 at 02:37 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
python mechanize scraping questions methodtwo Programming 4 03-14-2014 10:57 AM
LXer: Web scraping with Python (Part 2) LXer Syndicated Linux News 0 09-04-2009 09:00 PM
Python web page parser snowman81 Programming 5 01-11-2009 07:18 PM
LXer: Web Scraping with Python LXer Syndicated Linux News 0 12-03-2008 03:40 PM
Creating a Web Page lxandrthegr8 Linux - Software 2 08-12-2003 07:29 PM


All times are GMT -5. The time now is 02:58 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration