Very weird wget/curl output

jstilby · 09-11-2011, 02:31 AM

Hi,
I'm trying to write a script to download RedHat's errata digest.
It comes in a txt.gz format, and i can get it easily with firefox from: http://www.redhat.com/archives/enterprise-watch-list/

HOWEVER: output is VERY strange when donwloading it in a script. It seems I'm getting a file of the same size - but partially text and partly binary! It contains the first message in the digest, and then garbled data of what i can only assume is the rest of the .gz file.
Here is the basic request:

wget http://www.redhat.com/archives/enter...11-July.txt.gz

I think this is an attempt by redhat to block people who try to retrieve the errata by script.... so I tried messing with the user agent ID string. no luck. output is the same. Here is an example of what I tried:

wget -U "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3" http://www.redhat.com/archives/enter...11-July.txt.gz

curl also gives incorrect output - only the text of the first message. it probably tosses out the garbled binary data.

curl --silent http://www.redhat.com/archives/enter...11-July.txt.gz

curl -A "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" http://www.redhat.com/archives/enter...11-July.txt.gz

This is really annoying. Again, firefox gets it ok as a gz file. what should I do?

Thanks in advance....

rigor · 09-14-2011, 01:17 PM

Hi jstilby,

When I tried a wget usage similar to yours, with just the User-Agent header specified as you had, I got a similar result.
After several thousand bytes, the connection was closed by the other end, wget retried and got the remaining data.
Despite both streams of data being categorized as "application/x-gzip", the first was instead text, while the second was binary.

They may or may not be trying to prevent retrieval by scripts. But often the intentions of the folks that build a web
site aren't necessarily that specific. They may just want the web site to be used in a certain way, and will do things,
such as check that the page which referred to a page on their site, was another page on their site.

So the Referer header can sometimes be needed to get correct/expected results if trying to grab data from a site
by some means other than through a web browser.

In this case, I added that header, and still got the same result.

Finally, I eavesdropped on the connection between the browser and the site. I then added to the wget command line, ALL
the headers the browser sent. That worked.

If you have a new enough version of wget, the --header option can used repeatedly, and each usage can be used to add
a different header to what is sent by wget.

That effectively resulted in this very long single command line:

Quote:

wget --header='Host: www.redhat.com' --header='User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20100101 Firefox/6.0' --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' --header='Accept-Language: en-US,en;q=0.7,en;q=0.3' --header='Accept-Encoding: gzip, deflate' --header='Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7' --header='Connection: keep-alive' --header='Referer: https://www.redhat.com/archives/enterprise-watch-list/' --header='Pragma: no-cache' --header='Cache-Control: no-cache' http://www.redhat.com/archives/enter...ptember.txt.gz -O rh_ewl_2011_Sept.txt.gz

I tried getting the most recent several months of data in that fashion, one file at a time, and each attempt was successful.