methodtwo 03-23-2014 12:27 PM

Basic web scraping question(mechanize+BeautifulSoup)
Hi there
I have some web scraping code, that uses python mechanise and BeautifulSoup. I need to feed the text(html) of a web page retrieved by mechanize,to BeautifulSoup. Whenever i copy and paste the html from "page source" in firefox the code works. But whenever i do:

my_html = open('./my_htmlfile.txt', 'r')
soup = BeautifulSoup(my_html)


myfile = open('./script.html','w')


soup = BeautifulSoup(response.get_data())
Then the code doesn't work, even though when i copy-and-paste from "page source" in firefox the code does work. I know you probably don't want to debug my whole thing for me. I was just asking incase there was anything obvious i was missing in terms of what i'm feeding to BeautifulSoup when i do it programatically?
Thank you for reading and for any replies i might get

cin_ 03-31-2014 05:27 PM

diff the working and failing
what do you mean 'when i copy and paste'?

like copy and paste into a file then saving it as an .htm, or copy and pasting the page source as a string into your program?

what does my_htmlfile.txt look like? the same as your copy and pasted page source?
if you create the my_htmlfile.txt and diff it against the from page source file is there any output?


# diff my_htmlfile.txt FROM_SOURCE.htm

you may need to manipulate the input before running it through BeautifulSoup()

