LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-11-2011, 02:31 AM   #1
jstilby
LQ Newbie
 
Registered: Sep 2011
Posts: 7

Rep: Reputation: Disabled
Very weird wget/curl output - what should I do?


Hi,
I'm trying to write a script to download RedHat's errata digest.
It comes in a txt.gz format, and i can get it easily with firefox from: http://www.redhat.com/archives/enterprise-watch-list/

HOWEVER: output is VERY strange when donwloading it in a script. It seems I'm getting a file of the same size - but partially text and partly binary! It contains the first message in the digest, and then garbled data of what i can only assume is the rest of the .gz file.
Here is the basic request:

wget http://www.redhat.com/archives/enter...11-July.txt.gz

I think this is an attempt by redhat to block people who try to retrieve the errata by script.... so I tried messing with the user agent ID string. no luck. output is the same. Here is an example of what I tried:

wget -U "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3" http://www.redhat.com/archives/enter...11-July.txt.gz

curl also gives incorrect output - only the text of the first message. it probably tosses out the garbled binary data.

curl --silent http://www.redhat.com/archives/enter...11-July.txt.gz

curl -A "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" http://www.redhat.com/archives/enter...11-July.txt.gz


This is really annoying. Again, firefox gets it ok as a gz file. what should I do?

Thanks in advance....
 
Old 09-14-2011, 01:17 PM   #2
rigor
Member
 
Registered: Sep 2003
Location: 19th moon ................. ................Planet Covid ................Another Galaxy;............. ................Not Yours
Posts: 705

Rep: Reputation: Disabled
Hi jstilby,

When I tried a wget usage similar to yours, with just the User-Agent header specified as you had, I got a similar result.
After several thousand bytes, the connection was closed by the other end, wget retried and got the remaining data.
Despite both streams of data being categorized as "application/x-gzip", the first was instead text, while the second was binary.

They may or may not be trying to prevent retrieval by scripts. But often the intentions of the folks that build a web
site aren't necessarily that specific. They may just want the web site to be used in a certain way, and will do things,
such as check that the page which referred to a page on their site, was another page on their site.

So the Referer header can sometimes be needed to get correct/expected results if trying to grab data from a site
by some means other than through a web browser.

In this case, I added that header, and still got the same result.

Finally, I eavesdropped on the connection between the browser and the site. I then added to the wget command line, ALL
the headers the browser sent. That worked.

If you have a new enough version of wget, the --header option can used repeatedly, and each usage can be used to add
a different header to what is sent by wget.

That effectively resulted in this very long single command line:

Quote:
wget --header='Host: www.redhat.com' --header='User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20100101 Firefox/6.0' --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' --header='Accept-Language: en-US,en;q=0.7,en;q=0.3' --header='Accept-Encoding: gzip, deflate' --header='Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7' --header='Connection: keep-alive' --header='Referer: https://www.redhat.com/archives/enterprise-watch-list/' --header='Pragma: no-cache' --header='Cache-Control: no-cache' http://www.redhat.com/archives/enter...ptember.txt.gz -O rh_ewl_2011_Sept.txt.gz
I tried getting the most recent several months of data in that fashion, one file at a time, and each attempt was successful.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
wget/curl problems PLS HELP tommmmmm Linux - Software 0 08-19-2005 03:58 AM
YOU for SUSE 9.1 - curl or wget? djc SUSE / openSUSE 1 02-15-2005 03:26 PM
Wget and cURL can't connect umberleigh Linux - Newbie 0 09-21-2004 05:59 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:03 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration