wget question

kpachopoulos · 07-23-2005, 09:18 AM

Attempting to do a mirroring task of an http repository, i also get html's downloaded- something i don't want to.
I tried "wget -lalala --reject *.html url", "wget -lalala --reject html url", "wget -lalala --delete-after *.html url" and "wget -lalala --reject html url", but it doesn't work.
Any ideas?

rjlee · 07-23-2005, 10:00 AM

One minor point: *.html will be expanded by the shell to the names of all the .html files in the current directory; it won't pass the string '*.html' to wget (you should put it in quotes if that's what you meant to do).

wget stores the link locations in the HTML files, so you can't remove them with a reject option (-reject stops files from being downloaded in the first place; wget wouldn't have anything to recurse). From the info page on recursive accept/reject:

Quote:

Note that these two options do not affect the downloading of HTML files; Wget must load all the HTMLs to know where to go at all--recursive retrieval would make no sense otherwise.

I suggest that you look at deleting the HTML files after wget has run. This should do it (although it's not tested):

Code:

find . -iname "*.html" -exec rm '{}' ';'