LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-25-2018, 08:03 AM   #1
Time4Linux
LQ Newbie
 
Registered: Jan 2018
Posts: 21

Rep: Reputation: Disabled
wget command line help


Hello!

I am new to Ubuntu and have the need to use wget, to grab photos from webpages.

wget looks very advanced and I just can't get my head around it as I'm not comfortable with this non-GUI environment. I need a "kick start" to get going.

I would like to know what the command line would be for the following example scenario:

On the webpage "http://user.albumsite.com/album" there are thumbnails linking to images on an image server on albumsite.com. (Located for example at "img23.albumsite.com" and filenames will of course often vary and not necessarily follow a pattern.)
There are other objects on the page, like image banners, ads, pngs and gifs which I don't want. These are often located at "http://www.albumsite.com" so I might want to block objects from there.
(Sometimes there will also be .mp4 videos on the page, which I may want.)

The thing is, this webpage may be updating with new photos, so I'd like to know what to input for wget to check, let's say every 20, 30 or 60 seconds for new photos on this particular page?

Finally, I want it to download to "D:\MyPhotos\user\album".

What's the most convenient way to have the command line ready for other webpages/albums? Just copy/paste and edit for new URL's and folders ("user2/album2")?

I'd really appreciate the help and not having to plough through all the manuals and guides for this software, at least until I've gotten started.

Lastly, if there's another software or Firefox browser add-on that will do this for me perhaps even quicker, I'd love to know.
(I tried Video DownloadHelper, but couldn't figure out how to download all images at once and also for it to check for new photos, unattended.)

Thank you!
 
Old 01-25-2018, 09:41 AM   #2
sidzen
Member
 
Registered: Feb 2014
Location: GMT-7
Distribution: Slackware64, xenialpup64, Slacko5.7
Posts: 204

Rep: Reputation: 36
OP - "Lastly, if there's another software ..."

PMirrorget in Puppy Linux (i.e. Puppy Slacko64)

chron is one way to go for a scheduled check

Best wishes! Persist!
 
Old 01-25-2018, 10:16 AM   #3
Time4Linux
LQ Newbie
 
Registered: Jan 2018
Posts: 21

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by sidzen View Post
OP - "Lastly, if there's another software ..."

PMirrorget in Puppy Linux (i.e. Puppy Slacko64)

chron is one way to go for a scheduled check

Best wishes! Persist!
Thanks for reply. It confused me even more, though.
I would need more details than that. Remember that I'm a newbie at this.
I wish to see the full wget command line at the terminal prompt for what I'm asking for, including my example URL and all the other options and values which I specified.

I googled your tips:
"chron" is not a valid wget command AFAIK, and the command "cron", which I'm assuming you meant, seems to do other things than what I need.
And are you suggesting me to install Puppy Linux?
I will only be running Ubuntu 16.04.3 LTS, because I need it to work with/for other things...

Also, I'm not looking to mirror a webpage (and definitely not a whole site), just grab the images from one album page.

Please give me more help! Thanks.

Last edited by Time4Linux; 01-25-2018 at 10:20 AM.
 
Old 01-25-2018, 10:23 AM   #4
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 2,332

Rep: Reputation: 511Reputation: 511Reputation: 511Reputation: 511Reputation: 511Reputation: 511
Look at
Code:
man wget
and
Code:
man curl
Something like
Code:
wget http://....jpg -O file.jpg
Quote:
so I'd like to know what to input for wget to check, let's say every 20, 30 or 60 seconds for new photos on this particular page?
wget is not a web scraper, it's a download mgr.

You'll need to use bash and friends, python, Qtwebengine, beautuful soup, or just look at the source of the web page to collect the links to the .jpg files that you want.

To check something every 30 seconds, (which will probably get you banned from a web site after a while), then loop on it.
Example:
Code:
num=1
while :; do
    echo "$num"
    num=$(($num + 1))
    sleep 1 #set sleep time here
done
Quote:
Finally, I want it to download to "D:\MyPhotos\user\album".
This is Microsoft windows speek. Has nothing to do with wget or Linux.
If you want a directory under home called album,
Code:
mkdir ~/album
Then
Code:
wget http:.....com -O ~/album/file.jpg
You have some studying to do first.
Read the man page for wget, read about bash scripting, bash core tools like cd, rm, mkdir, etc.

Man pages are also available online, use google
https://linux.die.net/man/1/wget
http://mywiki.wooledge.org/BashFAQ

Quote:
What's the most convenient way to have the command line ready for other webpages/albums? Just copy/paste and edit for new URL's and folders ("user2/album2")?
Make yourself a script.
Quote:
I'd really appreciate the help and not having to plough through all the manuals and guides for this software, at least until I've gotten started.
You'll find that members on this forum will be glad to help you as you show some initiative. They won't probably write scripts for you. Won't do you any good.

Quote:
Lastly, if there's another software or Firefox browser add-on that will do this for me perhaps even quicker,
You can look at firefox's firebug, webkits web engine, webengine's remote debugging.....and then there is python, urllib, selenium, soup etc.

Welcome to LQ.

Edit: Spelling and format

Last edited by teckk; 01-25-2018 at 10:31 AM.
 
Old 01-25-2018, 03:17 PM   #5
Time4Linux
LQ Newbie
 
Registered: Jan 2018
Posts: 21

Original Poster
Rep: Reputation: Disabled
Well, if wget isn't the right tool for what I need, then I wish I would have named the topic differently.

So my question is, wget or not, which is the best tool in Ubuntu (and with Firefox?) to do what I described in the original post?

Since I will not know how the photos are named, that are being uploaded on the page, I can't "study the code" to get a software to find the photos. The software needs to figure this out on its own, and the operation must be unattended.
All I will see, is what server the photos are one, nothing more.
If the album is already complete, it's another matter and I can figure out ways to get those albums myself with whatever browser add-on, etc.
But the main thing here is to grab all images from an updating webpage.

Currently, I'm on Windows and I'm using Internet Download Manager for this. It has some bugs and isn't always reliable, but it still meets my needs fully and is really quick. For example I can tell IDM to check a page once a minute, which is usually sufficient.
So, I need something similar or that does the job similarly, for Ubuntu.

IMO that's a quite straightforward question so I can do without all the super techy stuff, since I have told you that I'm lacking the skills and knowledge.
Why not use my example and tell me what you would have used and done, in Ubuntu?

If I'm expected to "figure out" everything myself and just get a huge list of software and whatever, what is then the point of asking for help?
I don't have time, energy or capacity to read up on all of these very advanced programs. I only want to become a Ubuntu user with as little hassle as possible. And I'm not going to use it for that many things anyway.
 
Old 01-25-2018, 07:02 PM   #6
AwesomeMachine
LQ Guru
 
Registered: Jan 2005
Location: USA and Italy
Distribution: Debian testing/sid; OpenSuSE; Fedora; Mint
Posts: 5,513

Rep: Reputation: 1004Reputation: 1004Reputation: 1004Reputation: 1004Reputation: 1004Reputation: 1004Reputation: 1004Reputation: 1004
Wait a minute

Quote:
Originally Posted by Time4Linux View Post
I wish to see the full wget command line at the terminal prompt for what I'm asking for, including my example URL and all the other options and values which I specified.

Please give me more help! Thanks.
Well, you can always everything you wish for. Basically, your saying, I only want to do this one little thing, so why should I have to read the manual? Can't I just come here and have someone do it for me?

No, you can't. But I will tell you that wget downloads way more files than what a person usually wants. You just have to delete the extra files.

Some of the operators that might be of interest are:

--user-agent=
-r
-nc
-k
--random-wait --wait
-erobots=
-l
--span-hosts
-np

I'll give you some hints about those. User agent is your browser's identity string. I usually set that to "", which means 'nothing'.

Random wait is explained in the man page.

erobots is not listed in the manual, but it can be set to on or off, depending on whether the site uses the file robots.txt.

There is no way a program can know which files you want from a site. It can only do what you tell it to do. Globbing, i.e. wildcards, doesn't work with wget, but it does with curl. However, the website must also support globbing if you want to use it.

Globbing examples: *.jpg for all jpgs; [a-z][a-z]??.tar for any files that begin with 2 lower case letters followed by any 2 characters, with a tar extension.

Last edited by AwesomeMachine; 01-25-2018 at 07:15 PM.
 
Old 01-25-2018, 07:22 PM   #7
Keith Hedger
Senior Member
 
Registered: Jun 2010
Location: Wiltshire, UK
Distribution: Linux From Scratch, Slackware64, Partedmagic
Posts: 2,716

Rep: Reputation: 680Reputation: 680Reputation: 680Reputation: 680Reputation: 680Reputation: 680
Check out httrack, it can mirror whole websites, restrict downloads to a single domain, file type etc change links in html pagesto point to the local downloaded page, and has a reasonably easy gui interface it can also resume downloads and just get the bits that have been added.

Haven't used it for a while so can't give detailed instructions, but it's worth looking at.
 
Old 01-26-2018, 06:56 AM   #8
Time4Linux
LQ Newbie
 
Registered: Jan 2018
Posts: 21

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by AwesomeMachine View Post
...
You're writing in riddles.
"-r", "-nc", etc. But what do they mean and do? I have the wget manual which I could look in. I have, and that's why I'm here, because I can't interpret it to a real situation let alone know how to effectively combine arguments.

Trial and error in a command line program is different from a GUI one, because I can't tell what the individual operators really do (if you don't tell me) and how they work or don't work together.

While I'm open to read up on things, I have no interest in spending days and weeks "trying stuff". When I know there are people who have experience and knowledge that I think I could ask for help and should know well enough what to input or use (if not wget!) to get the desired effect which I have written about in detail.

I know that the server does not allow directory browsing or listing, so the program needs to get the image links from the album page, from the image thumbnails that direct to the full size images.
That is how and why Internet Download Manager has worked for me: it checks the album page for image links and grabs JPGs larger than e.g. 10 KB, while also re-checking once a minute (which I set it to) for new links to images.


Quote:
Originally Posted by Keith Hedger View Post
Check out httrack, it can mirror whole websites, restrict downloads to a single domain, file type etc change links in html pagesto point to the local downloaded page, and has a reasonably easy gui interface it can also resume downloads and just get the bits that have been added.

Haven't used it for a while so can't give detailed instructions, but it's worth looking at.
Thanks, but can it really work to monitor one page that is updating with images in real time? I believe that feature must be part of the program, or it will not work properly.
At least, when I tried it, I couldn't get it to do what I wanted in the fashion I want it. I was quite confused with the many settings.
If anyone has some experience with this program, I'd be grateful.

I'll just say it again, though: The key here is to be able to monitor one webpage for updates and to do it automatically.

Last edited by Time4Linux; 01-26-2018 at 06:59 AM.
 
Old 01-26-2018, 08:06 AM   #9
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 13,071

Rep: Reputation: 4133Reputation: 4133Reputation: 4133Reputation: 4133Reputation: 4133Reputation: 4133Reputation: 4133Reputation: 4133Reputation: 4133Reputation: 4133Reputation: 4133
Quote:
Originally Posted by Time4Linux View Post
You're writing in riddles.
"-r", "-nc", etc. But what do they mean and do? I have the wget manual which I could look in. I have, and that's why I'm here, because I can't interpret it to a real situation let alone know how to effectively combine arguments.
So those are the flags of wget which will modify how it will download and what. You need to read the man page about the meaning of them and choose what you really need to do what you need. I cannot tell you "the solution" because I do not now the requirements exactly, but you can try any combination of those flags...
 
Old 01-26-2018, 09:42 AM   #10
Time4Linux
LQ Newbie
 
Registered: Jan 2018
Posts: 21

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by pan64 View Post
So those are the flags of wget which will modify how it will download and what. You need to read the man page about the meaning of them and choose what you really need to do what you need. I cannot tell you "the solution" because I do not now the requirements exactly, but you can try any combination of those flags...
I have lightly skimmed the GNU Wget 1.18 manual. Is that "the man page"?
I told you, I don't understand it and it's all bits and pieces and I can't grasp the whole concept and what to combine.
I also looked at some videos and more concrete guide but it didn't really help me. That's why I wish to know the exact command line I need to use.
It's also very abstract to me, compared to a GUI software, quite naturally.

I thought my example was very detailed about just what I wanted to do.
What I don't want the program to do is crawl for other pages linked to it, but only stick to the album page.
What more info do you require?

I can send the link to an example of an album page, if that helps?


Most of all I have problems reading instructions in general. Not because I'm lazy, but because of some disabilities. If people here just assume I'm too lazy to just learn from manuals and that I don't "feel like" doing the research myself. I'm this regular computer user who is still curious/excited about Linux and cares about the security it has to offer, compared to Windows. All I want is a "kick start", about this particular issue. I was hoping that isn't asking for too much, here, among experienced long-time users.

Last edited by Time4Linux; 01-26-2018 at 09:43 AM. Reason: grammar...
 
Old 01-26-2018, 10:19 AM   #11
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 2,332

Rep: Reputation: 511Reputation: 511Reputation: 511Reputation: 511Reputation: 511Reputation: 511
You are wanting a web scraper that will get the data that you wish from a web page. Then you want to check for changes every so often.
After you get the list of url's, then you want to use a download mgr like wget to retrieve them.

Firstly, you are asking how to take content off of a page that the webmin has not mage available for download. Look at this forums TOS. You are kinda talking about hacking. No one is going to write you a script.

So, some general guidelines,

If the web page has .jpg's listed in simple html links then something as simple as:
Code:
wget http://user.albumsite.com/album -O - | grep "jpg"
or
Code:
wget http://user.albumsite.com/album -O - | grep "href" | grep "jpg"
Will show you lines with jpg in them. You can further parse that with cut, sort awk, sed etc.

You could also download the page source to file and use a simple text editor to view or use the search function of your editor.
Code:
wget http://www...com -O MyFile.html
If that page is using javascript, ajax, json etc, then there won't be direct links posted on the page. You will have to use something that runs javascript like....webkit, webengine, beautiful soup, phantomjs, nodejs etc.

Then you will need to loop on that page every X seconds to check for changes, maybe make a list out of the results.

Take this page for example. If I wanted to make a list of the image file references on it, and save it to $list, You could make a little script with bash and friends, even if it isn't pretty.

Code:
url="https://www.linuxquestions.org/questions/showthread.php?p=5811693"
Spider 1 level (slow)
Code:
wget -r --spider -l1 -A gif "$url" 2>&1 | grep -Eio http.+gif
Then there is lynx(cli browser)
Code:
lynx -image_links -dump "$url" | grep '\. https\?://.*\.\(gif\|jpg\|png\)$'
Or just parse the page source with bash and friends
Code:
wget "$url" -O - | grep ".gif" | grep -oP 'src=\K[^ ]+'
Redirect to file
Code:
command >> file.txt
Quote:
I'll just say it again, though: The key here is to be able to monitor one webpage for updates and to do it automatically.
Then I'll say it again. Write yourself a script that does what you want.

Ask questions when you have tried and get stuck.

Last edited by teckk; 01-26-2018 at 10:21 AM.
 
2 members found this post helpful.
Old 01-26-2018, 03:58 PM   #12
sidzen
Member
 
Registered: Feb 2014
Location: GMT-7
Distribution: Slackware64, xenialpup64, Slacko5.7
Posts: 204

Rep: Reputation: 36
+1 teckk

Mindset requirement for M$ is not equal to the mindset required for Linux.

I know -- I live too close to Redmond!
 
Old 01-26-2018, 04:27 PM   #13
Shadow_7
Senior Member
 
Registered: Feb 2003
Distribution: debian
Posts: 3,928
Blog Entries: 1

Rep: Reputation: 832Reputation: 832Reputation: 832Reputation: 832Reputation: 832Reputation: 832Reputation: 832
$ wget -c -i FileOfURLs.txt

I used to use that a lot on dialup. To leach some larger content when using someone elses broadband wifi. So instead of doing 15MB an hour I could do 300MB an hour at a state rest stop (or from the parking lot of the library). Although we have developed a bit around here with a much closer truck stop now. But we finally have a WISP too, so I'm no longer slumming it in 3rd world telecoms. Although not by much, it still takes HOURS to get most DVD installer images near 4GB, still better than half a month. At least laptop batteries last longer than an hour these days.
 
Old 01-26-2018, 07:11 PM   #14
AwesomeMachine
LQ Guru
 
Registered: Jan 2005
Location: USA and Italy
Distribution: Debian testing/sid; OpenSuSE; Fedora; Mint
Posts: 5,513

Rep: Reputation: 1004Reputation: 1004Reputation: 1004Reputation: 1004Reputation: 1004Reputation: 1004Reputation: 1004Reputation: 1004
The manual for all commands is at
Code:
$ man command
Substitute the command name for the word command.
 
Old 01-26-2018, 09:09 PM   #15
Time4Linux
LQ Newbie
 
Registered: Jan 2018
Posts: 21

Original Poster
Rep: Reputation: Disabled
Thanks for the substantial reply and I appreciate everyone taking their time to help an old dog like me.

Is there no webscraper I can use without using wget?
It seems like an awful lot of work in wget, even if I managed to make a script?

I could again take the example what I'm currently using. A pretty awfully made GUI software for Windows, but which still does it in a matter of like 5 clicks of a mouse, including re-checking once a minute.

But in Linux, I'd be having to make lists, sort them, have wget to download from that list and then for it(?) to check continuously and repeat?
Will that really work and can I write a script for it in one command line? Or... I guess I will run already made scripts, but having to modify them for new URLs?

I tried browsing the album site in question with javascript off, and no images were visible. However testing both your commands generated a list of the available jpgs. (Just like they are linked in the HTML code, so no mystery there.)

I really don't think this qualifies as hacking, though.
There's not an admin uploading photos, but users of the site. It's a public site for public photos (welcome to the Internet). I like a program to download the photos from whatever album there that has already finished uploading or is being uploaded, so I can do other things meanwhile and have the program check for new photos without my presence.

Well. Even though I am trying to read your post, I got stuck after the first part.
The prompt told me "written to STDOUT" and then I tried to find out more about how to sort that data with "sort awk", which I didn't really understand.
So how does "bash and friends" work, to make this html list, and how do I have it update with new links? I'm assuming bash is a command you write in the terminal. More programming, though. =/
I can't connect the dots too well here, regarding the last commands. I just get more and more confused looking at them. O_o
Is it vital that I do all those steps?
If there's a shortcut that would be good. Could that Lynx browser do this quicker?

Nice to know about the man command call, but a pity the language is so advanced and "compressed" (I'm not a native English speaker).

I'm afraid I'm a bit too tired right now to write anything more, or use my brain.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Wget (Command Line Downloader) command examples LXer Syndicated Linux News 0 12-22-2016 01:12 PM
what is the wget command line to down a complete web page for off line reading rob.rice Linux - Networking 12 10-29-2016 02:38 PM
LXer: GNU Wget 1.17 Command-line Download Manager Gets FTPS and HSTS Support LXer Syndicated Linux News 0 11-18-2015 08:02 AM
LXer: Wallbase.cc command line (bash/wget) wallpaper downloader (newest, random and search) LXer Syndicated Linux News 0 06-10-2012 08:40 PM
Downloads. GUI/command line, wget. Specify where to save. Kryptos Linux - Newbie 18 08-12-2011 04:09 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 09:31 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration