LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
Search this Thread
Old 03-30-2011, 01:38 AM   #1
fang2415
Member
 
Registered: Jan 2007
Posts: 159

Rep: Reputation: 15
How do I tell wget not to follow links in a file?


I am using wget to mirror a site and I want to tell it not to follow links in a particular file.

To give an idea of the site architecture, suppose the site http://example.com has a page http://example.com/links1.html and another page http://example.com/links2.html. These pages contain links to many different directories on example.com, and both links pages contain links to each other. I want to download the links recursively from /links1.html only and not from /links2.html.

--exclude-directories won't work because links1 and links2 are in the same directory. --reject won't work because wget deletes the rejected page only after it has enqueued all the links on it. (!)

This seems like it should be trivially easy. How hard can it be to tell a downloader not to download a file? But I'm darned if I can find a solution.

Can anybody suggest anything?
 
Old 03-30-2011, 03:29 AM   #2
16pide
Member
 
Registered: Jan 2010
Posts: 418

Rep: Reputation: 83
if links1.html links to links2.html and also to links3.html, then I see no way of requesting in one wget command to download links1.html and links2.html, but not links3.html

please clarify
 
Old 03-30-2011, 03:35 AM   #3
16pide
Member
 
Registered: Jan 2010
Posts: 418

Rep: Reputation: 83
looked at the wget man page, here is the extract of what I think is your solution:
Quote:
-r
--recursive
Turn on recursive retrieving.

-l depth
--level=depth
Specify recursion maximum depth level depth. The default maximum depth is 5.
 
Old 03-30-2011, 03:56 AM   #4
fang2415
Member
 
Registered: Jan 2007
Posts: 159

Original Poster
Rep: Reputation: 15
links1 contains links to a bunch of directories and to links2. I am already using -m (which is the same as -rl inf) to descend (infinitely) down the directories linked to in links1, but I don't want to follow any of the links contained in links2. So I want to say "Download links1 and follow all of its links, but do not download or follow links from links2".

Hope that makes it more clear. This seems like something that should be well within wget's core functionality, so I'm still baffled as to why it isn't easier.
 
Old 03-31-2011, 03:53 AM   #5
16pide
Member
 
Registered: Jan 2010
Posts: 418

Rep: Reputation: 83
I just checked with following command:
wget -r http://babelfish.yahoo.com/translate_txt
wget -rl 1 http://babelfish.yahoo.com/translate_txt

First one gets me 19 files, and the other gets me 7. So please try those 2 commands and tell us how it goes
 
Old 03-31-2011, 09:39 PM   #6
fang2415
Member
 
Registered: Jan 2007
Posts: 159

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by 16pide View Post
I just checked with following command:
wget -r http://babelfish.yahoo.com/translate_txt
wget -rl 1 http://babelfish.yahoo.com/translate_txt

First one gets me 19 files, and the other gets me 7. So please try those 2 commands and tell us how it goes
Yep, that makes perfect sense, because your first example keeps following links to the default recursion level of 5, whereas your second one only follows each link 1 level down. My problem isn't the level of recursion, it's which pages' links are being followed.

To use that page as an example, what I'm trying to do is to say "download pages 5 levels of recursion on all the links from http://babelfish.yahoo.com/translate_txt except for the privacy page or anything linked from it". The links there look a little funny so I'm not sure how well that example will actually work on that page, but that's the idea.
 
Old 03-31-2011, 10:45 PM   #7
fang2415
Member
 
Registered: Jan 2007
Posts: 159

Original Poster
Rep: Reputation: 15
Okay, to make this clearer, I put up a little demo at http://fangjaw.com/wgettest/links1.html. An outline of the file structure looks like this:
Code:
wgettest/
    links1.html
    links2.html
    A/
        index.html
        a.html
    B/
        index.html
        b.html
links1 contains links to A/index and to links2. links2 contains a link to B/index. The index files for each directory just contain links to the other files in the directory, i.e., wgettest/A/index.html just links to a.html, and the same for B. Have a look at the site if that helps make things clearer.

I want a single wget command that downloads only the files links1.html, A/index.html, and A/a.html. That is, I want to recursively download everything from links1, except I don't want to download or follow anything from links2. In reality, links1 and links2 link to many more directories, so I can't include or exclude them all by name.

Should be simple, right? Can anybody do it? Feel free to try it on the demo -- since it's only 40K, I hope my server can handle it!...
 
Old 04-01-2011, 02:57 AM   #8
fang2415
Member
 
Registered: Jan 2007
Posts: 159

Original Poster
Rep: Reputation: 15
Actually, it's simpler to show the link hierarchy rather than the directory hierarchy. So here's what the link structure of the site looks like:

Code:
      links1.html
       |         \
       |         XXX
       |           \
A/index.html        links2.html
       |               |
   a.html           B/index.html
                        |
                      b.html
I want to download everything in the tree except the branch below the "XXX". Surely wget can do this?
 
Old 04-01-2011, 07:33 AM   #9
Captain Obvious
LQ Newbie
 
Registered: Sep 2009
Distribution: Ubuntu Oneiric Ocelot
Posts: 8

Rep: Reputation: 0
if links1 has a bunch of links, and only one of them points towards links2, can't you just exclude directories w/ the -X switch?

Quote:
wget -r -np -X links2.html http://example.com/links1.html
I added -np for "no-parent" for those annoying links that go back upwards in the tree and cause you to download the whole site.

Are you trying to download from the root directory on the site? (i.e. example.com/*) That, I imagine, would make it harder to specifically exclude links2, especially if both the root directory AND a subdirectory have objects that point to page links2. Although -X still might work there, too.

As for -X's syntax, it's not clear to me if it needs to be relative ("links2.html") or absolute ("http://example.com/links2.html"). Just try em both ways and see which one works.
 
Old 04-01-2011, 08:21 PM   #10
fang2415
Member
 
Registered: Jan 2007
Posts: 159

Original Poster
Rep: Reputation: 15
Just tried these two:
Code:
wget -r -np -X links2.html http://fangjaw.com/wgettest/links1.html
wget -r -np -X http://fangjaw.com/wgettest/links2.html http://fangjaw.com/wgettest/links1.html
...and they both download all six files. (Feel free to try it out on my server.)

This is the crux of the problem, I think: as I understand it, -X does the correct thing, but will only do it for directories. (That is, if links2 were a directory rather than an html file, something like "wget -X links2/" would work as desired.)

-R is a similar option for files, but wget -R links2.html will still follow all of links2's links.

The fact that wget is prepared to do exactly what I want for directories makes me think that there must be some way to do it for files. This also seems too fundamental to be a bug in such a stable program -- surely it's more likely that I'm overlooking something? But who knows, maybe I should report it as a bug and see what happens?...
 
Old 04-11-2011, 04:44 PM   #11
fang2415
Member
 
Registered: Jan 2007
Posts: 159

Original Poster
Rep: Reputation: 15
This issue has now been reported as a bug over at https://savannah.gnu.org/bugs/index.php?33044 -- if anybody has any ideas it might be a good idea to leave a comment there.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Is there any way to follow links with Samba? Felipe Linux - Server 4 06-30-2010 06:12 AM
Cannot follow certain links in browser. rdozijn_2 Linux - Networking 6 12-08-2009 04:49 PM
How come I can't follow links yet? seeflanigan LinuxQuestions.org Member Intro 1 03-19-2009 01:56 PM
Stopping wget to follow 302 codes? northy_ie Programming 8 09-03-2007 01:13 AM
possible to get svn to follow links on a commit? hedpe Programming 2 09-22-2006 02:07 PM


All times are GMT -5. The time now is 04:44 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration