LinuxQuestions.org - wget command line help

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - wget command line help (https://www.linuxquestions.org/questions/linux-newbie-8/wget-command-line-help-4175622376/)

Quote:

Originally Posted by ondoho (Post 5813670)

i had a look, clicked on some random album with not too many pictures in it.
unfortunately, the image links are hidden behind javascript (but ultimately on the same domain).
this is where wget's abilities end.
i know that tools to parse js on the commandline exist, but have never used them.

i know that similar add-ons exist for firefox.
please investigate.

See my second reply, to iceweasel! :) (I wrote some stuff here before an epiphany...)

Quote:

Originally Posted by Shadow_7 (Post 5813747)

In iceweasel (unbranded firefox) Hamburger Menu -> Save Page is what I've used at times to have text stuff to parse with other tools. For small sites with mostly static content, wget -r URL does a valiant effort.

(Sounds time consuming and complicated for that reason. I won't have time to save a page over and over just to get wget or whatever to work with the links. It needs to A. do it automatically and B. check continuously and rapidly.)

I have tried one thing which works half-ways there. Using the add-on FlashGot, together with wget.
It downloaded only the photos I wanted.
The problem is that it took three times the time compared to IDM and seemed to use a lot of resources and also it was strenuous on the system hard drive, because of writing temp files or whatever.
And most of all, there was no way to repeat the process, checking for and only downloading new photos and skip the ones already downloaded. At least not without cross-checking manually.
I just tried FlashGot with the browser built-in downloader and it got the files faster than anything!
There still is the issue(?) of not checking automatically for new files, but since it was that quick, I could as well just trust the speed of the downloader enough to manually repeat the process at whatever pace.
I did not imagine FlashGot being that fast and good, after having used it now and then, long ago, since I first got it, which was some years ago...
(Perhaps this also means that I can finally ditch IDM (Internet Download Manager), which is a terrible program, full of crappy code and it's not even free.)

We haven't figured out how to get wget to do this on its own, but maybe that isn't necessary.
I will mark this as solved then, I guess.

For anyone following this thread, that page could be parsed with a little python and something that follows scripts to make a list.
Then download them.

Simple example with a few url's:

Code:

agent="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0"



list="

http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_1_.jpg

http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_2_.jpg

http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_3_.jpg

http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_4_.jpg

http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_5_.jpg

http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_6_.jpg

http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_7_.jpg

"



for i in $list; do

    #curl -A "$agent" "$i" -o "${i##*/}"

    wget -U "$agent" "$i" -O "${i##*/}"

done

Glad the OP found something that they liked.

Edit: syntax

Code:

list=(

line1

line2

...

)

for i in "${list[@]}"

do

....

done

is the correct syntax.

Sorry @pan64, went too fast. No quotes should be around $list

Code:

for i in $list; do echo "curl -A "$agent" "$i" -o "${i##*/}""; done



curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_1_.jpg -o 001_1_.jpg

curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_2_.jpg -o 001_2_.jpg

curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_3_.jpg -o 001_3_.jpg

curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_4_.jpg -o 001_4_.jpg

curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_5_.jpg -o 001_5_.jpg

curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_6_.jpg -o 001_6_.jpg

curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_7_.jpg -o 001_7_.jpg

Or like you said

Code:

list=(

one.com

two.com

three.com

)



for i in ${list[@]}; do echo "curl -A "$agent" "$i" -o "${i##*/}""; done

curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 one.com -o one.com

curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 two.com -o two.com

curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 three.com -o three.com

Quote:

Originally Posted by teckk (Post 5814215)

that page could be parsed with a little python

can python parse javascript?
or did you mean to first download the page with your browser (with javascript enabled), then parse it with python?

Quote:

Originally Posted by Time4Linux (Post 5813965)

There's this thing called programming. If you find yourself doing the same thing over and over, you write a program to do it for you. And then find a new employer or other things to consume your time.

Quote:

Originally Posted by teckk (Post 5814250)

Sorry @pan64, went too fast. No quotes should be around $list

Code:

for i in $list; do echo "curl -A "$agent" "$i" -o "${i##*/}""; done

....

Yes, you are right, but this implementation is unsafe. Probably works now, but probably won't work correctly in another case.

Quote:

Originally Posted by teckk (Post 5814250)

Code:

list=(

one.com

two.com

three.com

)

for i in "${list[@]}"; do echo "curl -A "$agent" "$i" -o "${i##*/}""; done

...

you missed the " which is important here, without that it will be unsafe again.

@ondoho

Quote:

can python parse javascript?

Yes, the easiest way is to use a browser engine, with beautiful soup, selenium, which I think is dead slow, or with PyQt5 and QtWebEngine or QtWebkit for example, those are in the repos of most distro's I think, and webengine is quite fast, but a little buggy. If you are using nouveau-webengine then run the script with --disable-gpu. You can read why that is.

Then dump the page requests, or specific requests to terminal or file, parse it etc.

For example this python3 code snippet

Code:

import sys, os, subprocess, re, time, calendar

from datetime import datetime

from functools import partial

from PyQt5.QtCore import Qt, QTimer, QDateTime, QUrl, pyqtSignal, pyqtSlot 

from PyQt5.QtNetwork import QNetworkCookie

from PyQt5.QtWidgets import QWidget, QHBoxLayout, QApplication

from PyQt5.QtWebEngineCore import QWebEngineUrlRequestInterceptor

from PyQt5.QtWebEngineWidgets import QWebEnginePage, QWebEngineView



class NetWorkManager(QWebEngineUrlRequestInterceptor):

    netS = pyqtSignal(str)

    

    def __init__(self,parent,url,print_request,block_request,default_block,

        select_request,get_link,req_file):

        super(NetWorkManager, self).__init__(parent)

        

        self.url = url

        self.print_request = print_request

        if block_request:

            self.block_request = block_request.split(',')

        else:

            self.block_request = []

            

        self.default_block = default_block

        self.select_request = select_request

        self.get_link = get_link

        self.req_file = req_file

        

    def interceptRequest(self,info):

        t = info.requestUrl()

        urlLnk = t.url()

        if self.get_link:

            if self.get_link in urlLnk:

                self.netS.emit(urlLnk)

                

        block_url = ''

        lower_case = urlLnk.lower()

        

        #Ads, banners, popup blocker

        lst = []

        if self.default_block:

            lst = [

            "doubleclick.net","ads",'.jpg','.gif','.css','facebook','.aspx',

            r"||youtube-nocookie.com/gen_204?", r"youtube.com###watch-branded-actions",

            "imagemapurl","b.scorecardresearch.com","rightstuff.com","scarywater.net",

            "popup.js","banner.htm","_tribalfusion","||n4403ad.doubleclick.net^$third-party",

            ".googlesyndication.com","graphics.js","fonts.googleapis.com/css",

            "s0.2mdn.net","server.cpmstar.com","||banzai/banner.$subdocument",

            "@@||anime-source.com^$document","/pagead2.","frugal.gif",

            "jriver_banner.png","show_ads.js",'##a[href^="http://billing.frugalusenet.com/"]',

            "||contextweb.com^$third-party",".gutter",".iab",'revcontent',

            "z-na.amazon-adsystem.com", "s.tribalfusion.com",

            "tags.expo9.exponential.com", "pagead2.googlesyndication.com"

            ]

            

        if self.block_request:

            lst = lst + self.block_request

        block = False

        for l in lst:

            if lower_case.find(l) != -1:

                block = True

                break

        if block:

            info.block(True)

            

        #Print page requests to term, spaced  

        if (self.select_request and self.select_request in urlLnk) or self.print_request:

            print('\n' + (urlLnk))

            

        #Save page requests to file, spaced 

        rlist = []

        if self.req_file:

            rlist.append(urlLnk)

            for i in rlist:

                with open(self.req_file, 'a') as f:

                    f.write(i + '\n\n')

That's just the the printing part of the script. Feed it with what you want.

Quote:

or did you mean to first download the page with your browser (with javascript enabled), then parse it with python?

One could do that too. Get all the links from a page that you desire with python - a browser engine like firefox in selenium - make a txt of links, then

Code:

wget -i file.txt

Why did I answer pointing to scripting to start with? To encourage new members to script in bash, python, ruby, perl etc.

There have been quite a few posts on LQ lately on this topic, and there seems to be a lack of interest with new members to learn how to script anything. If I can't click a button in Ubuntu then well...I'm just lost.
So I posted as much as possible without actually posting a script that will hack a website.

Scripting is Linux, it's one of the most powerful aspects of *nix. I can see that some of them don't like it, or are unable. Well then find a gui program like the OP did. And of course, the OP's aren't the only ones reading these threads. New members read these threads and don't participate, but pick up things, just like I did.

@pan64

Quote:

you missed the " which is important here, without that it will be unsafe again.

What do you mean unsafe? The script will fail, or something else?

Thanks.

Quote:

Originally Posted by teckk (Post 5814772)

@pan64

What do you mean unsafe? The script will fail, or something else?

Unsafe means the construct you use may work in some cases, but in some other cases it may fail. Therefore even if it currently (looks like it) made what you expected it is incorrect.

I am still here, reading. :)

Nice to see I caught some interest. I'm totally oblivious to making scripts, however and wouldn't know where to start even. Let alone what type of script and what "language", etc.
So I'm guessing this isn't beginner stuff.

I'm quite intrigued by this actually, and I'd love to see a solution to it, so I marked the thread as unsolved.

Go ahead and brainstorm! :D

Quote:

Originally Posted by teckk (Post 5814772)

@ondoho

Yes, the easiest way is to use a browser engine, with beautiful soup, selenium, which I think is dead slow, or with PyQt5 and QtWebEngine or QtWebkit for example, those are in the repos of most distro's I think, and webengine is quite fast, but a little buggy. If you are using nouveau-webengine then run the script with --disable-gpu. You can read why that is.

Then dump the page requests, or specific requests to terminal or file, parse it etc.

thanks for the detailed answer.
i should've known that python has a library for everything.
still, I'm impressed.