How do I tell wget not to follow links in a file?
I am using wget to mirror a site and I want to tell it not to follow links in a particular file.
To give an idea of the site architecture, suppose the site http://example.com has a page http://example.com/links1.html and another page http://example.com/links2.html. These pages contain links to many different directories on example.com, and both links pages contain links to each other. I want to download the links recursively from /links1.html only and not from /links2.html. --exclude-directories won't work because links1 and links2 are in the same directory. --reject won't work because wget deletes the rejected page only after it has enqueued all the links on it. (!) This seems like it should be trivially easy. How hard can it be to tell a downloader not to download a file? But I'm darned if I can find a solution. Can anybody suggest anything? |
if links1.html links to links2.html and also to links3.html, then I see no way of requesting in one wget command to download links1.html and links2.html, but not links3.html
please clarify |
looked at the wget man page, here is the extract of what I think is your solution:
Quote:
|
links1 contains links to a bunch of directories and to links2. I am already using -m (which is the same as -rl inf) to descend (infinitely) down the directories linked to in links1, but I don't want to follow any of the links contained in links2. So I want to say "Download links1 and follow all of its links, but do not download or follow links from links2".
Hope that makes it more clear. This seems like something that should be well within wget's core functionality, so I'm still baffled as to why it isn't easier. |
I just checked with following command:
wget -r http://babelfish.yahoo.com/translate_txt wget -rl 1 http://babelfish.yahoo.com/translate_txt First one gets me 19 files, and the other gets me 7. So please try those 2 commands and tell us how it goes |
Quote:
To use that page as an example, what I'm trying to do is to say "download pages 5 levels of recursion on all the links from http://babelfish.yahoo.com/translate_txt except for the privacy page or anything linked from it". The links there look a little funny so I'm not sure how well that example will actually work on that page, but that's the idea. |
Okay, to make this clearer, I put up a little demo at http://fangjaw.com/wgettest/links1.html. An outline of the file structure looks like this:
Code:
wgettest/ I want a single wget command that downloads only the files links1.html, A/index.html, and A/a.html. That is, I want to recursively download everything from links1, except I don't want to download or follow anything from links2. In reality, links1 and links2 link to many more directories, so I can't include or exclude them all by name. Should be simple, right? Can anybody do it? Feel free to try it on the demo -- since it's only 40K, I hope my server can handle it!... |
Actually, it's simpler to show the link hierarchy rather than the directory hierarchy. So here's what the link structure of the site looks like:
Code:
links1.html |
if links1 has a bunch of links, and only one of them points towards links2, can't you just exclude directories w/ the -X switch?
Quote:
Are you trying to download from the root directory on the site? (i.e. example.com/*) That, I imagine, would make it harder to specifically exclude links2, especially if both the root directory AND a subdirectory have objects that point to page links2. Although -X still might work there, too. As for -X's syntax, it's not clear to me if it needs to be relative ("links2.html") or absolute ("http://example.com/links2.html"). Just try em both ways and see which one works. |
Just tried these two:
Code:
wget -r -np -X links2.html http://fangjaw.com/wgettest/links1.html This is the crux of the problem, I think: as I understand it, -X does the correct thing, but will only do it for directories. (That is, if links2 were a directory rather than an html file, something like "wget -X links2/" would work as desired.) -R is a similar option for files, but wget -R links2.html will still follow all of links2's links. The fact that wget is prepared to do exactly what I want for directories makes me think that there must be some way to do it for files. This also seems too fundamental to be a bug in such a stable program -- surely it's more likely that I'm overlooking something? But who knows, maybe I should report it as a bug and see what happens?... |
This issue has now been reported as a bug over at https://savannah.gnu.org/bugs/index.php?33044 -- if anybody has any ideas it might be a good idea to leave a comment there.
|
All times are GMT -5. The time now is 08:57 PM. |