LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   wget command line help (https://www.linuxquestions.org/questions/linux-newbie-8/wget-command-line-help-4175622376/)

Time4Linux 01-25-2018 08:03 AM

wget command line help
 
Hello!

I am new to Ubuntu and have the need to use wget, to grab photos from webpages.

wget looks very advanced and I just can't get my head around it as I'm not comfortable with this non-GUI environment. I need a "kick start" to get going.

I would like to know what the command line would be for the following example scenario:

On the webpage "http://user.albumsite.com/album" there are thumbnails linking to images on an image server on albumsite.com. (Located for example at "img23.albumsite.com" and filenames will of course often vary and not necessarily follow a pattern.)
There are other objects on the page, like image banners, ads, pngs and gifs which I don't want. These are often located at "http://www.albumsite.com" so I might want to block objects from there.
(Sometimes there will also be .mp4 videos on the page, which I may want.)

The thing is, this webpage may be updating with new photos, so I'd like to know what to input for wget to check, let's say every 20, 30 or 60 seconds for new photos on this particular page?

Finally, I want it to download to "D:\MyPhotos\user\album".

What's the most convenient way to have the command line ready for other webpages/albums? Just copy/paste and edit for new URL's and folders ("user2/album2")?

I'd really appreciate the help and not having to plough through all the manuals and guides for this software, at least until I've gotten started.

Lastly, if there's another software or Firefox browser add-on that will do this for me perhaps even quicker, I'd love to know.
(I tried Video DownloadHelper, but couldn't figure out how to download all images at once and also for it to check for new photos, unattended.)

Thank you!

sidzen 01-25-2018 09:41 AM

OP - "Lastly, if there's another software ..."

PMirrorget in Puppy Linux (i.e. Puppy Slacko64)

chron is one way to go for a scheduled check

Best wishes! Persist!

Time4Linux 01-25-2018 10:16 AM

Quote:

Originally Posted by sidzen (Post 5811319)
OP - "Lastly, if there's another software ..."

PMirrorget in Puppy Linux (i.e. Puppy Slacko64)

chron is one way to go for a scheduled check

Best wishes! Persist!

Thanks for reply. It confused me even more, though.
I would need more details than that. Remember that I'm a newbie at this.
I wish to see the full wget command line at the terminal prompt for what I'm asking for, including my example URL and all the other options and values which I specified.

I googled your tips:
"chron" is not a valid wget command AFAIK, and the command "cron", which I'm assuming you meant, seems to do other things than what I need.
And are you suggesting me to install Puppy Linux?
I will only be running Ubuntu 16.04.3 LTS, because I need it to work with/for other things...

Also, I'm not looking to mirror a webpage (and definitely not a whole site), just grab the images from one album page.

Please give me more help! Thanks.

teckk 01-25-2018 10:23 AM

Look at
Code:

man wget
and
Code:

man curl
Something like
Code:

wget http://....jpg -O file.jpg
Quote:

so I'd like to know what to input for wget to check, let's say every 20, 30 or 60 seconds for new photos on this particular page?
wget is not a web scraper, it's a download mgr.

You'll need to use bash and friends, python, Qtwebengine, beautuful soup, or just look at the source of the web page to collect the links to the .jpg files that you want.

To check something every 30 seconds, (which will probably get you banned from a web site after a while), then loop on it.
Example:
Code:

num=1
while :; do
    echo "$num"
    num=$(($num + 1))
    sleep 1 #set sleep time here
done

Quote:

Finally, I want it to download to "D:\MyPhotos\user\album".
This is Microsoft windows speek. Has nothing to do with wget or Linux.
If you want a directory under home called album,
Code:

mkdir ~/album
Then
Code:

wget http:.....com -O ~/album/file.jpg
You have some studying to do first.
Read the man page for wget, read about bash scripting, bash core tools like cd, rm, mkdir, etc.

Man pages are also available online, use google
https://linux.die.net/man/1/wget
http://mywiki.wooledge.org/BashFAQ

Quote:

What's the most convenient way to have the command line ready for other webpages/albums? Just copy/paste and edit for new URL's and folders ("user2/album2")?
Make yourself a script.
Quote:

I'd really appreciate the help and not having to plough through all the manuals and guides for this software, at least until I've gotten started.
You'll find that members on this forum will be glad to help you as you show some initiative. They won't probably write scripts for you. Won't do you any good.

Quote:

Lastly, if there's another software or Firefox browser add-on that will do this for me perhaps even quicker,
You can look at firefox's firebug, webkits web engine, webengine's remote debugging.....and then there is python, urllib, selenium, soup etc.

Welcome to LQ.

Edit: Spelling and format

Time4Linux 01-25-2018 03:17 PM

Well, if wget isn't the right tool for what I need, then I wish I would have named the topic differently.

So my question is, wget or not, which is the best tool in Ubuntu (and with Firefox?) to do what I described in the original post?

Since I will not know how the photos are named, that are being uploaded on the page, I can't "study the code" to get a software to find the photos. The software needs to figure this out on its own, and the operation must be unattended.
All I will see, is what server the photos are one, nothing more.
If the album is already complete, it's another matter and I can figure out ways to get those albums myself with whatever browser add-on, etc.
But the main thing here is to grab all images from an updating webpage.

Currently, I'm on Windows and I'm using Internet Download Manager for this. It has some bugs and isn't always reliable, but it still meets my needs fully and is really quick. For example I can tell IDM to check a page once a minute, which is usually sufficient.
So, I need something similar or that does the job similarly, for Ubuntu.

IMO that's a quite straightforward question so I can do without all the super techy stuff, since I have told you that I'm lacking the skills and knowledge.
Why not use my example and tell me what you would have used and done, in Ubuntu?

If I'm expected to "figure out" everything myself and just get a huge list of software and whatever, what is then the point of asking for help?
I don't have time, energy or capacity to read up on all of these very advanced programs. I only want to become a Ubuntu user with as little hassle as possible. And I'm not going to use it for that many things anyway.

AwesomeMachine 01-25-2018 07:02 PM

Wait a minute
 
Quote:

Originally Posted by Time4Linux (Post 5811332)
I wish to see the full wget command line at the terminal prompt for what I'm asking for, including my example URL and all the other options and values which I specified.

Please give me more help! Thanks.

Well, you can always everything you wish for. Basically, your saying, I only want to do this one little thing, so why should I have to read the manual? Can't I just come here and have someone do it for me?

No, you can't. But I will tell you that wget downloads way more files than what a person usually wants. You just have to delete the extra files.

Some of the operators that might be of interest are:

--user-agent=
-r
-nc
-k
--random-wait --wait
-erobots=
-l
--span-hosts
-np

I'll give you some hints about those. User agent is your browser's identity string. I usually set that to "", which means 'nothing'.

Random wait is explained in the man page.

erobots is not listed in the manual, but it can be set to on or off, depending on whether the site uses the file robots.txt.

There is no way a program can know which files you want from a site. It can only do what you tell it to do. Globbing, i.e. wildcards, doesn't work with wget, but it does with curl. However, the website must also support globbing if you want to use it.

Globbing examples: *.jpg for all jpgs; [a-z][a-z]??.tar for any files that begin with 2 lower case letters followed by any 2 characters, with a tar extension.

Keith Hedger 01-25-2018 07:22 PM

Check out httrack, it can mirror whole websites, restrict downloads to a single domain, file type etc change links in html pagesto point to the local downloaded page, and has a reasonably easy gui interface it can also resume downloads and just get the bits that have been added.

Haven't used it for a while so can't give detailed instructions, but it's worth looking at.

Time4Linux 01-26-2018 06:56 AM

Quote:

Originally Posted by AwesomeMachine (Post 5811504)
...

You're writing in riddles.
"-r", "-nc", etc. But what do they mean and do? I have the wget manual which I could look in. I have, and that's why I'm here, because I can't interpret it to a real situation let alone know how to effectively combine arguments.

Trial and error in a command line program is different from a GUI one, because I can't tell what the individual operators really do (if you don't tell me) and how they work or don't work together.

While I'm open to read up on things, I have no interest in spending days and weeks "trying stuff". When I know there are people who have experience and knowledge that I think I could ask for help and should know well enough what to input or use (if not wget!) to get the desired effect which I have written about in detail.

I know that the server does not allow directory browsing or listing, so the program needs to get the image links from the album page, from the image thumbnails that direct to the full size images.
That is how and why Internet Download Manager has worked for me: it checks the album page for image links and grabs JPGs larger than e.g. 10 KB, while also re-checking once a minute (which I set it to) for new links to images.


Quote:

Originally Posted by Keith Hedger (Post 5811508)
Check out httrack, it can mirror whole websites, restrict downloads to a single domain, file type etc change links in html pagesto point to the local downloaded page, and has a reasonably easy gui interface it can also resume downloads and just get the bits that have been added.

Haven't used it for a while so can't give detailed instructions, but it's worth looking at.

Thanks, but can it really work to monitor one page that is updating with images in real time? I believe that feature must be part of the program, or it will not work properly.
At least, when I tried it, I couldn't get it to do what I wanted in the fashion I want it. I was quite confused with the many settings.
If anyone has some experience with this program, I'd be grateful.

I'll just say it again, though: The key here is to be able to monitor one webpage for updates and to do it automatically.

pan64 01-26-2018 08:06 AM

Quote:

Originally Posted by Time4Linux (Post 5811669)
You're writing in riddles.
"-r", "-nc", etc. But what do they mean and do? I have the wget manual which I could look in. I have, and that's why I'm here, because I can't interpret it to a real situation let alone know how to effectively combine arguments.

So those are the flags of wget which will modify how it will download and what. You need to read the man page about the meaning of them and choose what you really need to do what you need. I cannot tell you "the solution" because I do not now the requirements exactly, but you can try any combination of those flags...

Time4Linux 01-26-2018 09:42 AM

Quote:

Originally Posted by pan64 (Post 5811693)
So those are the flags of wget which will modify how it will download and what. You need to read the man page about the meaning of them and choose what you really need to do what you need. I cannot tell you "the solution" because I do not now the requirements exactly, but you can try any combination of those flags...

I have lightly skimmed the GNU Wget 1.18 manual. Is that "the man page"?
I told you, I don't understand it and it's all bits and pieces and I can't grasp the whole concept and what to combine.
I also looked at some videos and more concrete guide but it didn't really help me. That's why I wish to know the exact command line I need to use.
It's also very abstract to me, compared to a GUI software, quite naturally.

I thought my example was very detailed about just what I wanted to do.
What I don't want the program to do is crawl for other pages linked to it, but only stick to the album page.
What more info do you require?

I can send the link to an example of an album page, if that helps?


Most of all I have problems reading instructions in general. Not because I'm lazy, but because of some disabilities. If people here just assume I'm too lazy to just learn from manuals and that I don't "feel like" doing the research myself. I'm this regular computer user who is still curious/excited about Linux and cares about the security it has to offer, compared to Windows. All I want is a "kick start", about this particular issue. I was hoping that isn't asking for too much, here, among experienced long-time users.

teckk 01-26-2018 10:19 AM

You are wanting a web scraper that will get the data that you wish from a web page. Then you want to check for changes every so often.
After you get the list of url's, then you want to use a download mgr like wget to retrieve them.

Firstly, you are asking how to take content off of a page that the webmin has not mage available for download. Look at this forums TOS. You are kinda talking about hacking. No one is going to write you a script.

So, some general guidelines,

If the web page has .jpg's listed in simple html links then something as simple as:
Code:

wget http://user.albumsite.com/album -O - | grep "jpg"
or
Code:

wget http://user.albumsite.com/album -O - | grep "href" | grep "jpg"
Will show you lines with jpg in them. You can further parse that with cut, sort awk, sed etc.

You could also download the page source to file and use a simple text editor to view or use the search function of your editor.
Code:

wget http://www...com -O MyFile.html
If that page is using javascript, ajax, json etc, then there won't be direct links posted on the page. You will have to use something that runs javascript like....webkit, webengine, beautiful soup, phantomjs, nodejs etc.

Then you will need to loop on that page every X seconds to check for changes, maybe make a list out of the results.

Take this page for example. If I wanted to make a list of the image file references on it, and save it to $list, You could make a little script with bash and friends, even if it isn't pretty.

Code:

url="https://www.linuxquestions.org/questions/showthread.php?p=5811693"
Spider 1 level (slow)
Code:

wget -r --spider -l1 -A gif "$url" 2>&1 | grep -Eio http.+gif
Then there is lynx(cli browser)
Code:

lynx -image_links -dump "$url" | grep '\. https\?://.*\.\(gif\|jpg\|png\)$'
Or just parse the page source with bash and friends
Code:

wget "$url" -O - | grep ".gif" | grep -oP 'src=\K[^ ]+'
Redirect to file
Code:

command >> file.txt
Quote:

I'll just say it again, though: The key here is to be able to monitor one webpage for updates and to do it automatically.
Then I'll say it again. Write yourself a script that does what you want.

Ask questions when you have tried and get stuck.

sidzen 01-26-2018 03:58 PM

+1 teckk

Mindset requirement for M$ is not equal to the mindset required for Linux.

I know -- I live too close to Redmond!

Shadow_7 01-26-2018 04:27 PM

$ wget -c -i FileOfURLs.txt

I used to use that a lot on dialup. To leach some larger content when using someone elses broadband wifi. So instead of doing 15MB an hour I could do 300MB an hour at a state rest stop (or from the parking lot of the library). Although we have developed a bit around here with a much closer truck stop now. But we finally have a WISP too, so I'm no longer slumming it in 3rd world telecoms. Although not by much, it still takes HOURS to get most DVD installer images near 4GB, still better than half a month. At least laptop batteries last longer than an hour these days.

AwesomeMachine 01-26-2018 07:11 PM

The manual for all commands is at
Code:

$ man command
Substitute the command name for the word command.

Time4Linux 01-26-2018 09:09 PM

Thanks for the substantial reply and I appreciate everyone taking their time to help an old dog like me.

Is there no webscraper I can use without using wget?
It seems like an awful lot of work in wget, even if I managed to make a script?

I could again take the example what I'm currently using. A pretty awfully made GUI software for Windows, but which still does it in a matter of like 5 clicks of a mouse, including re-checking once a minute.

But in Linux, I'd be having to make lists, sort them, have wget to download from that list and then for it(?) to check continuously and repeat?
Will that really work and can I write a script for it in one command line? Or... I guess I will run already made scripts, but having to modify them for new URLs?

I tried browsing the album site in question with javascript off, and no images were visible. However testing both your commands generated a list of the available jpgs. (Just like they are linked in the HTML code, so no mystery there.)

I really don't think this qualifies as hacking, though.
There's not an admin uploading photos, but users of the site. It's a public site for public photos (welcome to the Internet). I like a program to download the photos from whatever album there that has already finished uploading or is being uploaded, so I can do other things meanwhile and have the program check for new photos without my presence.

Well. Even though I am trying to read your post, I got stuck after the first part.
The prompt told me "written to STDOUT" and then I tried to find out more about how to sort that data with "sort awk", which I didn't really understand.
So how does "bash and friends" work, to make this html list, and how do I have it update with new links? I'm assuming bash is a command you write in the terminal. More programming, though. =/
I can't connect the dots too well here, regarding the last commands. I just get more and more confused looking at them. O_o
Is it vital that I do all those steps?
If there's a shortcut that would be good. Could that Lynx browser do this quicker?

Nice to know about the man command call, but a pity the language is so advanced and "compressed" (I'm not a native English speaker).

I'm afraid I'm a bit too tired right now to write anything more, or use my brain.


All times are GMT -5. The time now is 02:59 PM.