LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Basic web scraping question(mechanize+BeautifulSoup) (https://www.linuxquestions.org/questions/programming-9/basic-web-scraping-question-mechanize-beautifulsoup-4175499184/)

methodtwo 03-23-2014 12:27 PM

Basic web scraping question(mechanize+BeautifulSoup)
 
Hi there
I have some web scraping code, that uses python mechanise and BeautifulSoup. I need to feed the text(html) of a web page retrieved by mechanize,to BeautifulSoup. Whenever i copy and paste the html from "page source" in firefox the code works. But whenever i do:
Code:

file("my_htmlfile.txt","w").write(self.br.open(site_url+'page.aspx').read())
my_html = open('./my_htmlfile.txt', 'r')
soup = BeautifulSoup(my_html)

Or:
Code:

myfile = open('./script.html','w')
myfile.write(response.read())

Or:
Code:

soup = BeautifulSoup(response.get_data())
Then the code doesn't work, even though when i copy-and-paste from "page source" in firefox the code does work. I know you probably don't want to debug my whole thing for me. I was just asking incase there was anything obvious i was missing in terms of what i'm feeding to BeautifulSoup when i do it programatically?
Thank you for reading and for any replies i might get

cin_ 03-31-2014 05:27 PM

diff the working and failing
 
what do you mean 'when i copy and paste'?

like copy and paste into a file then saving it as an .htm, or copy and pasting the page source as a string into your program?

what does my_htmlfile.txt look like? the same as your copy and pasted page source?
if you create the my_htmlfile.txt and diff it against the from page source file is there any output?

Code:

# diff my_htmlfile.txt FROM_SOURCE.htm
#

you may need to manipulate the input before running it through BeautifulSoup()


All times are GMT -5. The time now is 10:17 PM.