wget command line help

ondoho · 01-27-2018, 04:38 AM

why don't you just download ALL images from the whole site and then sort it out afterwards?
wget can do that, and it's simple.
http://dt.iki.fi/download-filetype-wget/

Time4Linux · 01-27-2018, 06:36 AM

Quote:

Originally Posted by ondoho

why don't you just download ALL images from the whole site and then sort it out afterwards?
wget can do that, and it's simple.
http://dt.iki.fi/download-filetype-wget/

"From the whole site" would mean one billion photos. From an album web page, is only what I'm looking for.

Here's what I used:

Code:

wget --recursive --no-directories --level=2 --accept jpg \
http://user.albumsite.com/album/ ; rm robots.txt

However I don't know if or where to any files were downloaded. It seems no image files of any sort were found.
Like I wrote, they are not on the same server, but on an image server which doesn't allow directory browsing.
So this won't find the links to the images.

pan64 · 01-27-2018, 10:48 AM

looks like you need to learn more about wget. I see 2 different ways: a) use wget and try options as long as you get what you want.
b) use wget to read a page, analize it an based on that wget something else - and repeat this step until you reach your goal.

Anyway you need to learn more about wget and other tools to be able to go further. Unfortunately I cannot give you full solution, but fortunately that will not help you because you need to understand how it works to be able to fine tune it.

teckk · 01-27-2018, 12:44 PM

To the OP
At least you are persistent, you must want those images badly.
The forum gets threads like this every so often. Do a search, you'll see that the answer is the same.
Either use something like httrack or Mirror the website(Which will take forever and use lots of bandwidth), acceot only items...etc.

Code:

wget -mpkP <directory> <url>

Or look at the page source and see what it is doing

Code:

wget <url> -O MyPage.html

A download manager won't follow scripts. All that you are going to get is the source without scripts run. But I've told you that already.

Have you bothered to look at the source of that page to see if you can determine what is going on?

Your easiest solution:
What web browser are you using? Firefox has an extension called firebug that will help you, webkit browsers have web inspector, webengine browsers have remote debugging. Load the page and look at the requests the page makes in the web inspector window.

Or you can script something that will follow page scripts.
Python is a natural for that. So is phantomjs. Depends on what you already have installed.
For example, if Qt and webkit or webengine are already installed for your desktop and browser then take advantage of it

Python3 PyQt5 QtWebkit example:

Code:

#!/usr/bin/env python
#Get source with scripts run

import sys
import signal

from PyQt5.QtCore import Qt, QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebKitWidgets import QWebPage

agent = 'Mozilla/5.0 (Windows NT 6.2; x86_64; rv:56.0) Gecko/20100101 Firefox/56.0'

class Source_W_Scripts(QWebPage):
    def __init__(self, url, file):
        QWebPage.__init__(self)
        self._url = url
        self._file = file
        
    def userAgentForUrl(self, Qurl):
        return (agent)

    def get_it(self):
        signal.signal(signal.SIGINT, signal.SIG_DFL)
        self.loadFinished.connect(self.finished_loading)
        self.mainFrame().load(QUrl(self._url))

    def finished_loading( self, result ):
        with open(self._file, 'w') as f:
            f.write(self.mainFrame().toHtml())
        sys.exit(0)

def main():
    url = input('Enter/Paste url for source: ')
    out_file = input('Enter output file name: ')
    app = QApplication([]) #(sys.argv)
    dloader = Source_W_Scripts(url, out_file)
    dloader.get_it()
    sys.exit(app.exec_())

if __name__ == '__main__':
    main()

Python3 PyQt5 QtWengine example:

Code:

#! /usr/bin/env python

#Get source with scripts run using Python3/PyQt5/qt5-webengine
#Usage:
    #script.py <url> <local filename>
    #or script.py and answer prompts

import sys
from PyQt5.QtWebEngineWidgets import (QWebEnginePage, 
        QWebEngineProfile, QWebEngineView)
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl

#Update user agents here
#Windows 10 Firefox 56
a = ('Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0)'
        ' Gecko/20100101 Firefox/56.0')

class Source(QWebEnginePage):
    def __init__(self, url, _file):
        self.app = QApplication([]) #(sys.argv)
        QWebEnginePage.__init__(self)
        
        self.agent = QWebEngineProfile(self)
        self.agent.defaultProfile().setHttpUserAgent(a) #set ua here
        
        self._file = _file
        self.load(QUrl(url))
        self.loadFinished.connect(self.on_load_finished)
        self.app.exec_()
        
    def on_load_finished(self):
        self.html = self.toHtml(self.write_it)

    def write_it(self, data):
        self.html = data
        with open (self._file, 'w') as f:
            f.write (self.html)
        print ('\nFinished\nFile saved to ' + (self._file))
        self.app.quit()
     
def main():
    #Open with arguments or prompt for input
    if len(sys.argv) > 2:
        url = (sys.argv[1])
        _file = (sys.argv[2])
    else:
        url = input('Enter/Paste url for source: ')
        _file = input('Enter output file name: ')
    Source(url, _file)
    
if __name__ == '__main__':
    main()

Or a small lite phantomjs example:

Code:

#! /usr/bin/env bash

#Get source with scripts run using phantomjs

agent="Mozilla/5.0 (Windows NT 6.2; WOW64; rv:56.0) \
Gecko/20100101 Firefox/56.0"

read -p "Enter url to get: " url
read -p "Enter output file name: " f_name

#! /usr/bin/env phantomjs
var page = require('webpage').create();
page.settings.userAgent = '$agent'
page.settings.loadImages = false 
var fs = require('fs');
var output = '$f_name';
page.open('$url', function() { 
  fs.write(output,page.content,'w');
  phantom.exit();
});

wget is a download manager, not a scraper. There isn't something with a magic button for you to push that will search a web site for content that is not made available to viewer by the admin, is hidden behind scripts, that gets called by redirects.

Time4Linux · 01-27-2018, 02:13 PM

Quote:

Originally Posted by pan64

...

I was hoping to learn more about wget through asking questions here. Not by being told to "go and study it...!"
As I have explained in almost all my posts, I don't understand the commands and descriptions well enough. It's too fragmented for me and I can't connect the dots.
I didn't expect this to cause so many posts about it.

Quote:

Originally Posted by teckk

...

I don't know how badly I want the images, but it's become a hobby of mine to browse the photo site as some photos are interesting (photography is one of my interests), but also I want to become a "part time" Linux user, so then, I want a substitute for what I'm using in Windows.
I'm primarily a Windows user because I've used it all my life and Linux I have used too little to know much about it.
That's why I can't really grasp all the command stuff and whatnot.

Yes, I have "bothered" to look at the HTML, I have written about it too. Even in my first post, as I wrote that the links point to an image server. The images aren't embedded in the page (because it would be huge). But that's common sense and practice.
I'm using Firefox. Since it came out, basically.
But what can Firebug do for me, in terms of downloading?
I don't know what studying scripts could do for me, if I don't know how to tell a program to use them.
I thought that is what grabbers are for.
And, contrary to what you're saying, for Windows, there is a program called Internet Download Manager that does just what I'm asking for here.
It's not "magic". IDM is simply given the URL and reloads the page once a minute (or whatever you schedule it to) and finds the new image links as they become available (in the HTML code).

I tried httrack and I doesn't seem that program will work for me the way I need it to.
It is very slow to download and it doesn't look like I can set it to keep checking the page without reloading the project manually.
Also impossible to mirror a photo site which has millions of users and over a billion photos. I thought I had explained that too, that what I don't want to do is mirror anything, except the album page in question.

It feels like your answers are becoming more and more about riddles and advanced programming language and that's totally beyond me.
Maybe you don't want to help and maybe then I stick to Windows, for at least this purpose. I had hoped to test this on Linux though. But it's as if you expect an absolute beginner to learn and adapt and find out without being given that much information.

I've also suggested posting a sample album page (or at least I could post sample html code) so you could go from there. But, oh well. I kind of feel like giving this up. At least asking about it here. So maybe I'll do what I usually do: learn on my own and by myself and not ask other people. That's how we do in the west.

AwesomeMachine · 01-27-2018, 05:50 PM

You can set the locale in Linux to your native language. The command switches I listed are all in the man page. If your going to use wget, you have to read the man page. Many times all the man pages are installed, for many languages.

Code:

$ locate wget | grep man

If you find your language, copy and paste the path to the file and open it with zcat.

Code:

$ zcat /usr/share/man/man1/wget.1.gz

Perhaps you're not grasping your role, and that of those you ask for help, here on LQ. You can ask, "What's a good program to download a lot of files from a website?"

We will answer, wget or curl.

If you ask, how do I use wget or curl to download what I want?

We'll say, read the man page, and give us more specific questions. Otherwise, people would ask, "How do I do Linux?" Basically, you've told us that you want to download image files, and then check for new images. There are countless ways to do that with wget. But if you don't read the manual, you won't know what we are recommending.

And if one of us types out a command line for you, it probably won't be what you need, because we're only guessing. If you want to automatically check for new images, there are five or ten ways to do that, or more!

LQ is for when you get stuck on a certain area of a problem. It is not a substitute for your own mind. Nor should it be! Otherwise we'd all be riding bicycles with training wheels all our lives.

Time4Linux · 01-27-2018, 08:55 PM

It's not only what I'm asking for, but how I'm asking for it.
I'm not so sure I would understand wget that much better if it was in my language, since we use a lot of English as well or words that looks like English. I was never too good in any command line environment and ended up just messing around.
That is why I thought asking humans about it may help me to understand.
My problem is that I have a very hard time focusing on reading and I can barely read quite simple instructions nowadays. At least not when I don't understand a fragment of it. It's a problem for me to sort the information as well so I get lost in the text and what to look at or for.

You have still not asked me for more details and I can still post the html code from one of the album pages, if that would help. No way I could figure this out better and by myself since I haven't used Linux much at all and I've never used wget.
I've stumbled when I was to do loads of things in Linux.

I was convinced that how I described what I wanted to do was precise enough for you guys to know which commands to use.
Or, maybe better for a newbie, which GUI software or add-on. I didn't think the task was at all that complex, since a pretty dumb Windows program can do this just fine. And since many say that Linux is so totally superior to Windows, I thought there must be a billion solutions to this and one of them should be fairly simple even for a beginner.

I'm still here because I'd be thrilled to solve this and get somewhere. I would feel more urge to learn wget and whatever else if you threw me a (better?) bone and explained more visually. Most codes look like hieroglyphs to me, apart from those I've seen before in DOS and which are probably irrelevant here.

How can I be more specific?
As I see it there's nothing particular about the HTML code that isn't covered in my comprised example. But maybe posting it is the only way in this case...?

AwesomeMachine · 01-27-2018, 10:39 PM

Hi,

OK, it sounds like you have some sort of learning disability. That's not to say you are not smart. You just have problems navigating a sea of text. If we can break it down into bite-sized pieces, that might work better for you.

But no one knew you have a handicap. There are also medications that work for ADD. Typical wget command

Code:

$ wget -r --user-agent="" -erobots=off -k -nc -np --random-wait -w 3 -l 2 --span-hosts http://example.com/thumbnails

That means

-r (recursive) descend into directories

--user-agent="" sets the browser id string to blank, useful for avoiding rejection by sites that specify only certain browsers can be used, also makes wget appear to be a browser rather than a robot

-erobots=off do not observe the robots.txt file

-k convert all links to local links

-nc (no clobber) do not download or overwrite existing files

-np (no parent) do not ascend up the directory tree into the parent directory

--random-wait -w 3 (inject random wait times between downloads of between 1.5 qand 4.5 seconds)

-l 2 descend 2 levels deep in the directory tree

--span-hosts if a link leads to a different website, follow it

So, if you just tweak that command line a bit to suit you needs, and put it in crond for every half hour, it will download all the files once, and after that only the new files. But if the site doesn't want robots, eventually you'll get blocked. So, just to be safe, and not too much of a pest, download updates once every few days.

pan64 · 01-28-2018, 12:30 AM

Quote:

Originally Posted by Time4Linux

I was hoping to learn more about wget through asking questions here. Not by being told to "go and study it...!"
As I have explained in almost all my posts, I don't understand the commands and descriptions well enough. It's too fragmented for me and I can't connect the dots.
I didn't expect this to cause so many posts about it.

To make it clear: I suggest you to read the man page because I can hardly explain better. So it is completely irrelevant if you read the explanation here, at LQ or directly on your monitor (locally) as a man page.
Yes, I understand it is not a trivial task, and there is no problem if you have questions. But please be specific and do not ask how wget works (in general).

You can also go for wget tutorials like this: https://www.thegeekstuff.com/2009/09...esome-examples (there are a lot out there), if that was easier to understand.

From the other hand you cannot make it wrong, so just test any combination of flags, that will not hurt and you will be able to understand (better) what's going on.

Shadow_7 · 01-28-2018, 05:13 AM

It's open source. Everything you need to know can be gleaned from the manpage, or by reading the source code. Which would probably yield an answer faster than random forum hoping for random expert to find the one of millions of threads and to chime in with the answer you desire. It might happen, plus / minus a lifetime, or not.

ondoho · 01-28-2018, 07:11 AM

Quote:

Originally Posted by pan64

To make it clear: I suggest you to read the man page because I can hardly explain better.

for me there's another even more important aspect:
I suggest you to read man pages, because if I explained it to you, I would have to read the man pages myself first.
We don't keep all this knowledge in our heads, you know.

maybe you need a nicer terminal?
does it colorize man output? it should. colors help.
does the color scheme look pleasing to the eye?
is your command prompt visually enhanced so that you can easily spot where the output of one command ends, and the next begins?
is the font easily readable for you?

i found that without these things, using the terminal is indeed painful.

ondoho · 01-28-2018, 07:13 AM

Quote:

Originally Posted by Time4Linux

You have still not asked me for more details and I can still post the html code from one of the album pages, if that would help. No way I could figure this out better and by myself since I haven't used Linux much at all and I've never used wget.
I've stumbled when I was to do loads of things in Linux.

I was convinced that how I described what I wanted to do was precise enough for you guys to know which commands to use.
Or, maybe better for a newbie, which GUI software or add-on.

i'm not promising anything, but first of all you would have to tell us what website this is and what exactly you want to filter out and download.

Time4Linux · 01-29-2018, 08:09 AM

Thanks for being understanding about my disadvantages.

So it seems I need to understand how the website works in order to download from it.
I have been hesitant to post a link to an album, because I don't want to infringe some user's rights or privacy, even if their albums are open to the world.

The site in question is a Czech photo site called Rajce.net
Some albums are listed by one category here:
http://alba.rajce.idnes.cz/?category=2

The html code with scripts and all is quite long, so I might have to make a pastebin of it or whatever if you want to see it.

IDM for Windows which I'm using has no problems with following thumbnail links and downloading the full size images linked from each album page, and it's not rejected even when I download 10 files at once and in rapid pace. I can download 1000 images from one page in 30 seconds. Plus I'm able to do that for several album pages at once. So it seems the server is tolerant. Therefore I don't think I need to be that gentle with wait times.

But it's another story to do this operation from a command line, apparently.
At least when you don't know how servers communicate and all that jazz.

This one didn't work, however:

Code:

wget -r --user-agent="" -erobots=off -k -nc -np --random-wait -w 3 -l 2 --span-hosts http://example.com/thumbnails

No image files. It only downloaded index.html files, css files and similar and because of using levels it downloaded loads of folders. I'm glad I could terminate the process with closing the terminal...

There are no colors in the terminal (is that the same as the "man"... thing?). Just white on black.
But yes, colors help.

I have an issue with getting my backside into gear these days with most things. Otherwise I could check and experiment lots more.

Anyway. I think maybe the next step here, if any, is to post html code?
If you think it's OK I will.
If I had some IRL buddy to do this with, I could pester them.
Maybe some good video tutorial could help, but I don't think they exist for this very purpose since it's so specific.

ondoho · 01-31-2018, 01:00 AM

Quote:

Originally Posted by Time4Linux

http://alba.rajce.idnes.cz/?category=2

i had a look, clicked on some random album with not too many pictures in it.
unfortunately, the image links are hidden behind javascript (but ultimately on the same domain).
this is where wget's abilities end.
i know that tools to parse js on the commandline exist, but have never used them.

Quote:

IDM for Windows which I'm using has no problems with following thumbnail links and downloading the full size images linked from each album page, and it's not rejected even when I download 10 files at once and in rapid pace.

i know that similar add-ons exist for firefox.
please investigate.

Shadow_7 · 01-31-2018, 05:00 AM

In iceweasel (unbranded firefox) Hamburger Menu -> Save Page is what I've used at times to have text stuff to parse with other tools. For small sites with mostly static content, wget -r URL does a valiant effort.