I would like to mirror some websites using wget and need a little help
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I would like to mirror some websites using wget and need a little help
I am better at using linux now. I installed Ubuntu 16.04 LTS.
What I would like to do is setup a directory to store several websites. What's most important to me is 1. Either download everything or 2. Concentrate on images only.
I would do either 1 or 2 depending on the website.
I looked at a site mirroring tool on Windows called httrack which is not too difficult to use, but the Linux version is somewhat more complicated.
There are considerations such as user-agent settings and etc.
I came upon wget and noticed that it's command line is easier to use, so I am interested in trying it out.
My main question is based on what I have said so far. Can someone give me an example of a stable command for wget to do 1. Download everything from a site, but no external sites and 2. Download just images of specific types, for example gif and png.
So I am looking for 2 command examples.
Please understand that I know how to Google, but there are so many examples, that it's just too overwhelming. I'm simply trying to setup a mirror system for offline use on my computer.
On the side, if you know other great software for this that has a GUI, please let me know.
I am better at using linux now. I installed Ubuntu 16.04 LTS.
What I would like to do is setup a directory to store several websites. What's most important to me is 1. Either download everything or 2. Concentrate on images only.
I would do either 1 or 2 depending on the website.
I looked at a site mirroring tool on Windows called httrack which is not too difficult to use, but the Linux version is somewhat more complicated.
There are considerations such as user-agent settings and etc.
I came upon wget and noticed that it's command line is easier to use, so I am interested in trying it out.
My main question is based on what I have said so far. Can someone give me an example of a stable command for wget to do 1. Download everything from a site, but no external sites and 2. Download just images of specific types, for example gif and png.
So I am looking for 2 command examples.
Please understand that I know how to Google, but there are so many examples, that it's just too overwhelming. I'm simply trying to setup a mirror system for offline use on my computer.
On the side, if you know other great software for this that has a GUI, please let me know.
Yes, thank you - at this point I am searching the Internet, but the reason I am asking here is to not have to read so much information. The application is not too difficult, but Linux has so much that it does. I am looking for examples I can start from.
So for example - what would be the most optimal switches, where can I find a list of user-agents I can use. I'm looking for all this now of course, but help would be appreciated.
I have a more specific question - I can't find anything on this!
In httrack, I could list a page such as
-somepage.html
and what happens is it ignores all the links starting from that page.
I tried this with wget by adding
-R somepage.html
and it ignores the page, but it still downloads all the directories listed on that page. How can I stop wget from downloading everything on that page while mirroring the rest of the site?
Actually the first three or ten hits are exactly how to mirror a site, specially using wget.
I suspect that what the issue here is what you also see on those several hit results, "This isn't always easy", "There are a few pitfalls", "You may run into a wget loop", "active content is an exception", where what you wished to do was "wget *" and then it'd all work.
If you're hosting a mirror of a site as a form of assistance to that original website, they give you the content updates, when they update their site.
If you're hosting a mirror of a site just because you want too, then you have to monitor the original site for changes and keep up with them, or risk not being a true mirror.
Also, if you want to keep from having to repeat insane parameters in your wget command, learn to build and use a .wgetrc file to save your user agent and header preferences when using wget. You can globally tweak wget with /etc/wgetrc settings.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.