Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I am trying to download a whole webpage, including the files that link to the page. The linked files are txt, jpg, and pdf. These files are linked from a couple of my friends webpages. Direct links so if you click on them in a browser they show up straight away.
I want to create a mirror of the whole lot for backup purposes.
I'm having a world of trouble trying to get it to work.
The second line feeds this page back to wget ( -i option for inputfile ) -F tells wget that the file isn't a buch of links but a html file ( force-html ), so it look for the links within tags.
The B tell wget that any relative links ( don't start with http:// ) should have the http://sitename added to the front. ( B for base )
Super old thread. Still coming up in searches on Google. Crucify me. ;-)
I was receiving errors rather than the desired content using the previously specified method so I thought I would share the method I used for a similar task.
If your the desired content is a list of hyperlinks such as the links on an "all pages" page on mediawiki, you can scrape that text (manually is good enough for me) and put that list into a file.
Then, use that file like this to gather the linked content.
Code:
for i in `cat file_name` ; do wget http://sitename/index.php/$i ; done
I just researched necro'ing ettiquette as I never know if it is better to make a new post or use the current post even if a little old. Now on to topic..
There is a page on a site that has a lot of links to other pages on the site but at the bottom there are LOTS or links that are irrelevant to what I'm looking for, it's all the "about this site" and basically a site map that is irrelevant to what I want.
the problem is that I'm not getting all of the links from what I can tell and many of the files are saved as the 4-8 digit number at the end of the link that points to the longer link. Is there a way to save the output with the name from the longer link? From the example above it is now saving as 2382.html - is it possible to save is as server-room-survival-kit.html?
wget can only form the filename from the link - that is what it requests. The web server may then internally redirect or rewrite the URL, but wget doesn't "see" any of that, it simply saves what is ultimately returned under the name it asked for, or the name you told it to save as. That is not a problem for a single file, but when mirroring a page and all its links, you can really only save as what is in the links.
Additionally, many web pages nowdays write the links after they load using javascript. Wget won't see those links at all, which might account for you not getting all of the links.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.