LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 05-19-2016, 10:53 AM   #1
linuxus3r
LQ Newbie
 
Registered: May 2016
Posts: 8

Rep: Reputation: 8
Question I would like to mirror some websites using wget and need a little help


I am better at using linux now. I installed Ubuntu 16.04 LTS.

What I would like to do is setup a directory to store several websites. What's most important to me is 1. Either download everything or 2. Concentrate on images only.

I would do either 1 or 2 depending on the website.

I looked at a site mirroring tool on Windows called httrack which is not too difficult to use, but the Linux version is somewhat more complicated.

There are considerations such as user-agent settings and etc.

I came upon wget and noticed that it's command line is easier to use, so I am interested in trying it out.

My main question is based on what I have said so far. Can someone give me an example of a stable command for wget to do 1. Download everything from a site, but no external sites and 2. Download just images of specific types, for example gif and png.

So I am looking for 2 command examples.

Please understand that I know how to Google, but there are so many examples, that it's just too overwhelming. I'm simply trying to setup a mirror system for offline use on my computer.

On the side, if you know other great software for this that has a GUI, please let me know.

Thank you, ...
 
Old 05-19-2016, 11:08 AM   #2
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Quote:
Originally Posted by linuxus3r View Post
I am better at using linux now. I installed Ubuntu 16.04 LTS.

What I would like to do is setup a directory to store several websites. What's most important to me is 1. Either download everything or 2. Concentrate on images only.

I would do either 1 or 2 depending on the website.

I looked at a site mirroring tool on Windows called httrack which is not too difficult to use, but the Linux version is somewhat more complicated.

There are considerations such as user-agent settings and etc.

I came upon wget and noticed that it's command line is easier to use, so I am interested in trying it out.

My main question is based on what I have said so far. Can someone give me an example of a stable command for wget to do 1. Download everything from a site, but no external sites and 2. Download just images of specific types, for example gif and png.

So I am looking for 2 command examples.

Please understand that I know how to Google, but there are so many examples, that it's just too overwhelming. I'm simply trying to setup a mirror system for offline use on my computer.

On the side, if you know other great software for this that has a GUI, please let me know.

Thank you, ...
Welcome to LQ!
Buckle up if you think I'm gonna let you tell me you can't find an example on Google?
https://duckduckgo.com/?q=mirror+a+w...th+wget&ia=web
https://duckduckgo.com/?q=mirror+a+w...th+curl&ia=web
One of the pages from either search will have what you need.

Last edited by Habitual; 05-19-2016 at 11:10 AM.
 
Old 05-19-2016, 11:19 AM   #3
linuxus3r
LQ Newbie
 
Registered: May 2016
Posts: 8

Original Poster
Rep: Reputation: 8
Yes, thank you - at this point I am searching the Internet, but the reason I am asking here is to not have to read so much information. The application is not too difficult, but Linux has so much that it does. I am looking for examples I can start from.

So for example - what would be the most optimal switches, where can I find a list of user-agents I can use. I'm looking for all this now of course, but help would be appreciated.
 
Old 05-19-2016, 12:10 PM   #4
linuxus3r
LQ Newbie
 
Registered: May 2016
Posts: 8

Original Poster
Rep: Reputation: 8
Question

I have a more specific question - I can't find anything on this!


In httrack, I could list a page such as

-somepage.html

and what happens is it ignores all the links starting from that page.

I tried this with wget by adding

-R somepage.html

and it ignores the page, but it still downloads all the directories listed on that page. How can I stop wget from downloading everything on that page while mirroring the rest of the site?
 
Old 05-19-2016, 12:30 PM   #5
jamison20000e
Senior Member
 
Registered: Nov 2005
Location: ...uncanny valley... infinity\1975; (randomly born:) Milwaukee, WI, US( + travel,) Earth&Mars (I wish,) END BORDER$!◣◢┌∩┐ Fe26-E,e...
Distribution: any GPL that works well on freest; has been KDE, CLI, Novena but open... http://goo.gl/NqgqJx &c ;-)
Posts: 4,744
Blog Entries: 2

Rep: Reputation: 1533Reputation: 1533Reputation: 1533Reputation: 1533Reputation: 1533Reputation: 1533Reputation: 1533Reputation: 1533Reputation: 1533Reputation: 1533Reputation: 1533
Hi.

Download Them All (or others) an add-on for browsers may work or: https://www.gnu.org/software/wget/manual/wget.html
Code:
man wget
I've always
Code:
wget -r -np http...
then
Quote:
cd
ls
plus "wget examples?"

Have fun!

Last edited by jamison20000e; 05-19-2016 at 12:35 PM. Reason: fixed link+
 
Old 05-19-2016, 02:57 PM   #6
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,788
Blog Entries: 13

Rep: Reputation: 4831Reputation: 4831Reputation: 4831Reputation: 4831Reputation: 4831Reputation: 4831Reputation: 4831Reputation: 4831Reputation: 4831Reputation: 4831Reputation: 4831
Actually the first three or ten hits are exactly how to mirror a site, specially using wget.

I suspect that what the issue here is what you also see on those several hit results, "This isn't always easy", "There are a few pitfalls", "You may run into a wget loop", "active content is an exception", where what you wished to do was "wget *" and then it'd all work.

If you're hosting a mirror of a site as a form of assistance to that original website, they give you the content updates, when they update their site.

If you're hosting a mirror of a site just because you want too, then you have to monitor the original site for changes and keep up with them, or risk not being a true mirror.
 
Old 05-24-2016, 08:57 AM   #7
szboardstretcher
Senior Member
 
Registered: Aug 2006
Location: Detroit, MI
Distribution: GNU/Linux systemd
Posts: 4,278

Rep: Reputation: 1693Reputation: 1693Reputation: 1693Reputation: 1693Reputation: 1693Reputation: 1693Reputation: 1693Reputation: 1693Reputation: 1693Reputation: 1693Reputation: 1693
Also, if you want to keep from having to repeat insane parameters in your wget command, learn to build and use a .wgetrc file to save your user agent and header preferences when using wget. You can globally tweak wget with /etc/wgetrc settings.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
wget --mirror https Jorris Linux - Newbie 1 08-23-2011 01:01 AM
How do i get wget to mirror images referenced in CSS files Spawn10 General 2 10-19-2010 12:58 PM
wget (or other util): how do I mirror parts of this site? exscape Linux - Software 2 08-04-2010 12:03 PM
What's So Reliable About the wget Mirror Command vs Downloading Other Ways? des_a Linux - Software 0 03-12-2008 11:53 AM
wget and mirror scottrell Linux - General 1 05-30-2003 04:54 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 04:46 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration