Hi jstilby,
When I tried a
wget usage similar to yours, with just the
User-Agent header specified as you had, I got a similar result.
After several thousand bytes, the connection was closed by the other end,
wget retried and got the remaining data.
Despite both streams of data being categorized as "application/x-gzip", the first was instead text, while the second was binary.
They may or may not be trying to prevent retrieval by scripts. But often the intentions of the folks that build a web
site aren't necessarily that specific. They may just want the web site to be used in a certain way, and will do things,
such as check that the page which referred to a page on their site, was another page on their site.
So the
Referer header can sometimes be needed to get correct/expected results if trying to grab data from a site
by some means other than through a web browser.
In this case, I added that header, and still got the same result.
Finally, I eavesdropped on the connection between the browser and the site. I then added to the wget command line, ALL
the headers the browser sent. That worked.
If you have a new enough version of
wget, the
--header option can used repeatedly, and each usage can be used to add
a different header to what is sent by
wget.
That effectively resulted in this very long single command line:
Quote:
wget --header='Host: www.redhat.com' --header='User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20100101 Firefox/6.0' --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' --header='Accept-Language: en-US,en;q=0.7,en;q=0.3' --header='Accept-Encoding: gzip, deflate' --header='Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7' --header='Connection: keep-alive' --header='Referer: https://www.redhat.com/archives/enterprise-watch-list/' --header='Pragma: no-cache' --header='Cache-Control: no-cache' http://www.redhat.com/archives/enter...ptember.txt.gz -O rh_ewl_2011_Sept.txt.gz
|
I tried getting the most recent several months of data in that fashion, one file at a time, and each attempt was successful.