LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-23-2014, 11:27 AM   #1
methodtwo
Member
 
Registered: May 2007
Posts: 146

Rep: Reputation: 18
Basic web scraping question(mechanize+BeautifulSoup)


Hi there
I have some web scraping code, that uses python mechanise and BeautifulSoup. I need to feed the text(html) of a web page retrieved by mechanize,to BeautifulSoup. Whenever i copy and paste the html from "page source" in firefox the code works. But whenever i do:
Code:
file("my_htmlfile.txt","w").write(self.br.open(site_url+'page.aspx').read())
my_html = open('./my_htmlfile.txt', 'r')
soup = BeautifulSoup(my_html)
Or:
Code:
myfile = open('./script.html','w')
myfile.write(response.read())
Or:
Code:
soup = BeautifulSoup(response.get_data())
Then the code doesn't work, even though when i copy-and-paste from "page source" in firefox the code does work. I know you probably don't want to debug my whole thing for me. I was just asking incase there was anything obvious i was missing in terms of what i'm feeding to BeautifulSoup when i do it programatically?
Thank you for reading and for any replies i might get
 
Old 03-31-2014, 04:27 PM   #2
cin_
Member
 
Registered: Dec 2010
Posts: 281

Rep: Reputation: 24
diff the working and failing

what do you mean 'when i copy and paste'?

like copy and paste into a file then saving it as an .htm, or copy and pasting the page source as a string into your program?

what does my_htmlfile.txt look like? the same as your copy and pasted page source?
if you create the my_htmlfile.txt and diff it against the from page source file is there any output?

Code:
# diff my_htmlfile.txt FROM_SOURCE.htm
#
you may need to manipulate the input before running it through BeautifulSoup()

Last edited by cin_; 03-31-2014 at 07:37 PM. Reason: gramm`err
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
python mechanize scraping questions methodtwo Programming 4 03-14-2014 10:57 AM
[SOLVED] perl mechanize question amboxer21 Programming 2 01-06-2014 03:02 PM
LXer: Web scraping with Python (Part 2) LXer Syndicated Linux News 0 09-04-2009 09:00 PM
LXer: Web Scraping with Python LXer Syndicated Linux News 0 12-03-2008 03:40 PM
LXer: Extract data from the Internet with Web scraping LXer Syndicated Linux News 0 03-29-2006 12:55 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:09 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration