LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (http://www.linuxquestions.org/questions/linux-software-2/)
-   -   How do I tell wget not to follow links in a file? (http://www.linuxquestions.org/questions/linux-software-2/how-do-i-tell-wget-not-to-follow-links-in-a-file-871851/)

fang2415 03-30-2011 01:38 AM

How do I tell wget not to follow links in a file?
 
I am using wget to mirror a site and I want to tell it not to follow links in a particular file.

To give an idea of the site architecture, suppose the site http://example.com has a page http://example.com/links1.html and another page http://example.com/links2.html. These pages contain links to many different directories on example.com, and both links pages contain links to each other. I want to download the links recursively from /links1.html only and not from /links2.html.

--exclude-directories won't work because links1 and links2 are in the same directory. --reject won't work because wget deletes the rejected page only after it has enqueued all the links on it. (!)

This seems like it should be trivially easy. How hard can it be to tell a downloader not to download a file? But I'm darned if I can find a solution.

Can anybody suggest anything?

16pide 03-30-2011 03:29 AM

if links1.html links to links2.html and also to links3.html, then I see no way of requesting in one wget command to download links1.html and links2.html, but not links3.html

please clarify

16pide 03-30-2011 03:35 AM

looked at the wget man page, here is the extract of what I think is your solution:
Quote:

-r
--recursive
Turn on recursive retrieving.

-l depth
--level=depth
Specify recursion maximum depth level depth. The default maximum depth is 5.

fang2415 03-30-2011 03:56 AM

links1 contains links to a bunch of directories and to links2. I am already using -m (which is the same as -rl inf) to descend (infinitely) down the directories linked to in links1, but I don't want to follow any of the links contained in links2. So I want to say "Download links1 and follow all of its links, but do not download or follow links from links2".

Hope that makes it more clear. This seems like something that should be well within wget's core functionality, so I'm still baffled as to why it isn't easier.

16pide 03-31-2011 03:53 AM

I just checked with following command:
wget -r http://babelfish.yahoo.com/translate_txt
wget -rl 1 http://babelfish.yahoo.com/translate_txt

First one gets me 19 files, and the other gets me 7. So please try those 2 commands and tell us how it goes

fang2415 03-31-2011 09:39 PM

Quote:

Originally Posted by 16pide (Post 4309336)
I just checked with following command:
wget -r http://babelfish.yahoo.com/translate_txt
wget -rl 1 http://babelfish.yahoo.com/translate_txt

First one gets me 19 files, and the other gets me 7. So please try those 2 commands and tell us how it goes

Yep, that makes perfect sense, because your first example keeps following links to the default recursion level of 5, whereas your second one only follows each link 1 level down. My problem isn't the level of recursion, it's which pages' links are being followed.

To use that page as an example, what I'm trying to do is to say "download pages 5 levels of recursion on all the links from http://babelfish.yahoo.com/translate_txt except for the privacy page or anything linked from it". The links there look a little funny so I'm not sure how well that example will actually work on that page, but that's the idea.

fang2415 03-31-2011 10:45 PM

Okay, to make this clearer, I put up a little demo at http://fangjaw.com/wgettest/links1.html. An outline of the file structure looks like this:
Code:

wgettest/
    links1.html
    links2.html
    A/
        index.html
        a.html
    B/
        index.html
        b.html

links1 contains links to A/index and to links2. links2 contains a link to B/index. The index files for each directory just contain links to the other files in the directory, i.e., wgettest/A/index.html just links to a.html, and the same for B. Have a look at the site if that helps make things clearer.

I want a single wget command that downloads only the files links1.html, A/index.html, and A/a.html. That is, I want to recursively download everything from links1, except I don't want to download or follow anything from links2. In reality, links1 and links2 link to many more directories, so I can't include or exclude them all by name.

Should be simple, right? Can anybody do it? Feel free to try it on the demo -- since it's only 40K, I hope my server can handle it!...

fang2415 04-01-2011 02:57 AM

Actually, it's simpler to show the link hierarchy rather than the directory hierarchy. So here's what the link structure of the site looks like:

Code:

      links1.html
      |        \
      |        XXX
      |          \
A/index.html        links2.html
      |              |
  a.html          B/index.html
                        |
                      b.html

I want to download everything in the tree except the branch below the "XXX". Surely wget can do this?

Captain Obvious 04-01-2011 07:33 AM

if links1 has a bunch of links, and only one of them points towards links2, can't you just exclude directories w/ the -X switch?

Quote:

wget -r -np -X links2.html http://example.com/links1.html
I added -np for "no-parent" for those annoying links that go back upwards in the tree and cause you to download the whole site.

Are you trying to download from the root directory on the site? (i.e. example.com/*) That, I imagine, would make it harder to specifically exclude links2, especially if both the root directory AND a subdirectory have objects that point to page links2. Although -X still might work there, too.

As for -X's syntax, it's not clear to me if it needs to be relative ("links2.html") or absolute ("http://example.com/links2.html"). Just try em both ways and see which one works.

fang2415 04-01-2011 08:21 PM

Just tried these two:
Code:

wget -r -np -X links2.html http://fangjaw.com/wgettest/links1.html
wget -r -np -X http://fangjaw.com/wgettest/links2.html http://fangjaw.com/wgettest/links1.html

...and they both download all six files. (Feel free to try it out on my server.)

This is the crux of the problem, I think: as I understand it, -X does the correct thing, but will only do it for directories. (That is, if links2 were a directory rather than an html file, something like "wget -X links2/" would work as desired.)

-R is a similar option for files, but wget -R links2.html will still follow all of links2's links.

The fact that wget is prepared to do exactly what I want for directories makes me think that there must be some way to do it for files. This also seems too fundamental to be a bug in such a stable program -- surely it's more likely that I'm overlooking something? But who knows, maybe I should report it as a bug and see what happens?...

fang2415 04-11-2011 04:44 PM

This issue has now been reported as a bug over at https://savannah.gnu.org/bugs/index.php?33044 -- if anybody has any ideas it might be a good idea to leave a comment there.


All times are GMT -5. The time now is 01:23 PM.