LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 03-14-2012, 05:22 PM   #1
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Rep: Reputation: 17
wget --reject


I am using wget for a recursive download with --accept and --reject rules.


Code:
wget  -o $LOG --user-agent="$AGENT" --load-cookies $COOKIES 
--wait $FREQUENCY --random-wait --recursive --level 5 --timestamping  
--no-parent --no-directories --adjust-extension 
--restrict-file-names=unix --convert-links --domains=$DOMAIN 
--accept "$ACCEPT" --reject "$REJECT"  "$URL"
I notice from the log file that links are downloaded before they are rejected:

Code:
HTTP request sent, awaiting response... 200 OK
Saving to: `msg.cfm?catid=26&threadid=12134&STARTPAGE=1.html'

     0K .......... 

Last-modified header missing -- time-stamps turned off.
2012-03-14 13:45:50 (1.21 MB/s) - `msg.cfm?catid=26&threadid=12134&STARTPAGE=1.html' saved [29849]

Removing msg.cfm?catid=26&threadid=12134&STARTPAGE=1.html since it should be rejected.
I expected those links that fit the reject pattern not to be downloaded in the first place. I want to avoid needless downloads.

Is there any way I can achieve this?
 
Old 03-15-2012, 01:56 AM   #2
ruario
Senior Member
 
Registered: Jan 2011
Location: Oslo, Norway
Distribution: Slackware
Posts: 2,557

Rep: Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761
I suspect the issue is that since you are downloading recursively Wget will always download HTML pages first to scan them for further links, no matter what (the manual has a section mentioning this).

Quote:
Originally Posted by wget manual
Note that these two options do not affect the downloading of html files (as determined by a ‘.htm’ or ‘.html’ filename prefix). This behavior may not be desirable for all users, and may be changed for future versions of Wget.

Note, too, that query strings (strings at the end of a URL beginning with a question mark (‘?’) are not included as part of the filename for accept/reject rules, even though these will actually contribute to the name chosen for the local file. It is expected that a future version of Wget will provide an option to allow matching against query strings.

Finally, it's worth noting that the accept/reject lists are matched twice against downloaded files: once against the URL's filename portion, to determine if the file should be downloaded in the first place; then, after it has been accepted and successfully downloaded, the local file's name is also checked against the accept/reject lists to see if it should be removed. The rationale was that, since ‘.htm’ and ‘.html’ files are always downloaded regardless of accept/reject rules, they should be removed after being downloaded and scanned for links, if they did match the accept/reject lists. However, this can lead to unexpected results, since the local filenames can differ from the original URL filenames in the following ways, all of which can change whether an accept/reject rule matches:
If the local file already exists and ‘--no-directories’ was specified, a numeric suffix will be appended to the original name.
  • If ‘--adjust-extension’ was specified, the local filename might have ‘.html’ appended to it.
  • If Wget is invoked with ‘-E -A.php’, a filename such as ‘index.php’ will match be accepted, but upon download will be named ‘index.php.html’, which no longer matches, and so the file will be deleted.
  • Query strings do not contribute to URL matching, but are included in local filenames, and so do contribute to filename matching.

This behavior, too, is considered less-than-desirable, and may change in a future version of Wget.
 
1 members found this post helpful.
Old 03-15-2012, 01:59 AM   #3
ruario
Senior Member
 
Registered: Jan 2011
Location: Oslo, Norway
Distribution: Slackware
Posts: 2,557

Rep: Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761
In my searching I also found this similar thread elsewhere. One of the replies mentions using HTTrack instead as an alternative.
 
Old 03-15-2012, 01:57 PM   #4
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Original Poster
Rep: Reputation: 17
Many thanks for confirming my suspicions with regard to --reject.

I have encountered another issue with --wait: in orrder to avoid redundant download of reject pages, I generated a list of URLs; I ran wget with --input-file and without recursion. I found that the --wait time was ignored.

Is the --wait parameter only used for recursive downloads?



I looked at httrack half a year ago. Maybe I need to revisit.
 
Old 03-16-2012, 02:37 AM   #5
ruario
Senior Member
 
Registered: Jan 2011
Location: Oslo, Norway
Distribution: Slackware
Posts: 2,557

Rep: Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761
--wait seems to work correctly for me in conjunction with --input-file (using 1.12 on 32-Bit Slackware).
 
Old 03-17-2012, 11:01 AM   #6
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Original Poster
Rep: Reputation: 17
You are right, Ruarí: I checked the log file and found that the --wait time was applied.

What had confused me was the fact that all files had the same timestamp. After the download, --convert-links is processed. With the recursive download, this was done after every download whereas with my non-recursive batch download, all files are downloaded first, then the conversions are applied.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to resume an interrupted wget using wget.log? misterJ Linux - Software 2 06-19-2011 01:21 PM
Postfix REJECT Ateo Linux - Server 1 03-05-2009 02:35 PM
Drop vs Reject sulekha Linux - Security 4 11-28-2008 10:25 PM
reject simultaneous connection hegdeshashi Linux - Networking 0 12-20-2005 12:39 AM
wget --reject -R troubles cursed Linux - Software 0 11-12-2005 10:23 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 02:33 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration