LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   wget command line help (https://www.linuxquestions.org/questions/linux-newbie-8/wget-command-line-help-4175622376/)

Time4Linux 01-31-2018 03:54 PM

Quote:

Originally Posted by ondoho (Post 5813670)
i had a look, clicked on some random album with not too many pictures in it.
unfortunately, the image links are hidden behind javascript (but ultimately on the same domain).
this is where wget's abilities end.
i know that tools to parse js on the commandline exist, but have never used them.


i know that similar add-ons exist for firefox.
please investigate.

See my second reply, to iceweasel! :) (I wrote some stuff here before an epiphany...)

Quote:

Originally Posted by Shadow_7 (Post 5813747)
In iceweasel (unbranded firefox) Hamburger Menu -> Save Page is what I've used at times to have text stuff to parse with other tools. For small sites with mostly static content, wget -r URL does a valiant effort.

(Sounds time consuming and complicated for that reason. I won't have time to save a page over and over just to get wget or whatever to work with the links. It needs to A. do it automatically and B. check continuously and rapidly.)

I have tried one thing which works half-ways there. Using the add-on FlashGot, together with wget.
It downloaded only the photos I wanted.
The problem is that it took three times the time compared to IDM and seemed to use a lot of resources and also it was strenuous on the system hard drive, because of writing temp files or whatever.
And most of all, there was no way to repeat the process, checking for and only downloading new photos and skip the ones already downloaded. At least not without cross-checking manually.

I just tried FlashGot with the browser built-in downloader and it got the files faster than anything!
There still is the issue(?) of not checking automatically for new files, but since it was that quick, I could as well just trust the speed of the downloader enough to manually repeat the process at whatever pace.
I did not imagine FlashGot being that fast and good, after having used it now and then, long ago, since I first got it, which was some years ago...
(Perhaps this also means that I can finally ditch IDM (Internet Download Manager), which is a terrible program, full of crappy code and it's not even free.)

We haven't figured out how to get wget to do this on its own, but maybe that isn't necessary.
I will mark this as solved then, I guess.

teckk 02-01-2018 07:55 AM

For anyone following this thread, that page could be parsed with a little python and something that follows scripts to make a list.
Then download them.

Simple example with a few url's:

Code:

agent="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0"

list="
http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_1_.jpg
http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_2_.jpg
http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_3_.jpg
http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_4_.jpg
http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_5_.jpg
http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_6_.jpg
http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_7_.jpg
"

for i in $list; do
    #curl -A "$agent" "$i" -o "${i##*/}"
    wget -U "$agent" "$i" -O "${i##*/}"
done

Glad the OP found something that they liked.

Edit: syntax

pan64 02-01-2018 08:26 AM

Code:

list=(
line1
line2
...
)
for i in "${list[@]}"
do
....
done

is the correct syntax.

teckk 02-01-2018 08:58 AM

Sorry @pan64, went too fast. No quotes should be around $list

Code:

for i in $list; do echo "curl -A "$agent" "$i" -o "${i##*/}""; done

curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_1_.jpg -o 001_1_.jpg
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_2_.jpg -o 001_2_.jpg
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_3_.jpg -o 001_3_.jpg
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_4_.jpg -o 001_4_.jpg
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_5_.jpg -o 001_5_.jpg
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_6_.jpg -o 001_6_.jpg
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_7_.jpg -o 001_7_.jpg

Or like you said

Code:

list=(
one.com
two.com
three.com
)

for i in ${list[@]}; do echo "curl -A "$agent" "$i" -o "${i##*/}""; done
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 one.com -o one.com
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 two.com -o two.com
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 three.com -o three.com


ondoho 02-01-2018 02:10 PM

Quote:

Originally Posted by teckk (Post 5814215)
that page could be parsed with a little python

can python parse javascript?
or did you mean to first download the page with your browser (with javascript enabled), then parse it with python?

Shadow_7 02-01-2018 03:42 PM

Quote:

Originally Posted by Time4Linux (Post 5813965)
(Sounds time consuming and complicated for that reason. I won't have time to save a page over and over just to get wget or whatever to work with the links. It needs to A. do it automatically and B. check continuously and rapidly.)

There's this thing called programming. If you find yourself doing the same thing over and over, you write a program to do it for you. And then find a new employer or other things to consume your time.

pan64 02-02-2018 12:38 AM

Quote:

Originally Posted by teckk (Post 5814250)
Sorry @pan64, went too fast. No quotes should be around $list

Code:

for i in $list; do echo "curl -A "$agent" "$i" -o "${i##*/}""; done
....


Yes, you are right, but this implementation is unsafe. Probably works now, but probably won't work correctly in another case.


Quote:

Originally Posted by teckk (Post 5814250)
Code:

list=(
one.com
two.com
three.com
)
for i in "${list[@]}"; do echo "curl -A "$agent" "$i" -o "${i##*/}""; done
...


you missed the " which is important here, without that it will be unsafe again.

teckk 02-02-2018 08:25 AM

@ondoho
Quote:

can python parse javascript?
Yes, the easiest way is to use a browser engine, with beautiful soup, selenium, which I think is dead slow, or with PyQt5 and QtWebEngine or QtWebkit for example, those are in the repos of most distro's I think, and webengine is quite fast, but a little buggy. If you are using nouveau-webengine then run the script with --disable-gpu. You can read why that is.

Then dump the page requests, or specific requests to terminal or file, parse it etc.

For example this python3 code snippet
Code:

import sys, os, subprocess, re, time, calendar
from datetime import datetime
from functools import partial
from PyQt5.QtCore import Qt, QTimer, QDateTime, QUrl, pyqtSignal, pyqtSlot
from PyQt5.QtNetwork import QNetworkCookie
from PyQt5.QtWidgets import QWidget, QHBoxLayout, QApplication
from PyQt5.QtWebEngineCore import QWebEngineUrlRequestInterceptor
from PyQt5.QtWebEngineWidgets import QWebEnginePage, QWebEngineView

class NetWorkManager(QWebEngineUrlRequestInterceptor):
    netS = pyqtSignal(str)
   
    def __init__(self,parent,url,print_request,block_request,default_block,
        select_request,get_link,req_file):
        super(NetWorkManager, self).__init__(parent)
       
        self.url = url
        self.print_request = print_request
        if block_request:
            self.block_request = block_request.split(',')
        else:
            self.block_request = []
           
        self.default_block = default_block
        self.select_request = select_request
        self.get_link = get_link
        self.req_file = req_file
       
    def interceptRequest(self,info):
        t = info.requestUrl()
        urlLnk = t.url()
        if self.get_link:
            if self.get_link in urlLnk:
                self.netS.emit(urlLnk)
               
        block_url = ''
        lower_case = urlLnk.lower()
       
        #Ads, banners, popup blocker
        lst = []
        if self.default_block:
            lst = [
            "doubleclick.net","ads",'.jpg','.gif','.css','facebook','.aspx',
            r"||youtube-nocookie.com/gen_204?", r"youtube.com###watch-branded-actions",
            "imagemapurl","b.scorecardresearch.com","rightstuff.com","scarywater.net",
            "popup.js","banner.htm","_tribalfusion","||n4403ad.doubleclick.net^$third-party",
            ".googlesyndication.com","graphics.js","fonts.googleapis.com/css",
            "s0.2mdn.net","server.cpmstar.com","||banzai/banner.$subdocument",
            "@@||anime-source.com^$document","/pagead2.","frugal.gif",
            "jriver_banner.png","show_ads.js",'##a[href^="http://billing.frugalusenet.com/"]',
            "||contextweb.com^$third-party",".gutter",".iab",'revcontent',
            "z-na.amazon-adsystem.com", "s.tribalfusion.com",
            "tags.expo9.exponential.com", "pagead2.googlesyndication.com"
            ]
           
        if self.block_request:
            lst = lst + self.block_request
        block = False
        for l in lst:
            if lower_case.find(l) != -1:
                block = True
                break
        if block:
            info.block(True)
           
        #Print page requests to term, spaced 
        if (self.select_request and self.select_request in urlLnk) or self.print_request:
            print('\n' + (urlLnk))
           
        #Save page requests to file, spaced
        rlist = []
        if self.req_file:
            rlist.append(urlLnk)
            for i in rlist:
                with open(self.req_file, 'a') as f:
                    f.write(i + '\n\n')

That's just the the printing part of the script. Feed it with what you want.
Quote:

or did you mean to first download the page with your browser (with javascript enabled), then parse it with python?
One could do that too. Get all the links from a page that you desire with python - a browser engine like firefox in selenium - make a txt of links, then
Code:

wget -i file.txt
Why did I answer pointing to scripting to start with? To encourage new members to script in bash, python, ruby, perl etc.

There have been quite a few posts on LQ lately on this topic, and there seems to be a lack of interest with new members to learn how to script anything. If I can't click a button in Ubuntu then well...I'm just lost.
So I posted as much as possible without actually posting a script that will hack a website.

Scripting is Linux, it's one of the most powerful aspects of *nix. I can see that some of them don't like it, or are unable. Well then find a gui program like the OP did. And of course, the OP's aren't the only ones reading these threads. New members read these threads and don't participate, but pick up things, just like I did.

@pan64
Quote:

you missed the " which is important here, without that it will be unsafe again.
What do you mean unsafe? The script will fail, or something else?

Thanks.

pan64 02-02-2018 09:59 AM

Quote:

Originally Posted by teckk (Post 5814772)
@pan64

What do you mean unsafe? The script will fail, or something else?

Unsafe means the construct you use may work in some cases, but in some other cases it may fail. Therefore even if it currently (looks like it) made what you expected it is incorrect.

Time4Linux 02-03-2018 03:30 PM

I am still here, reading. :)

Nice to see I caught some interest. I'm totally oblivious to making scripts, however and wouldn't know where to start even. Let alone what type of script and what "language", etc.
So I'm guessing this isn't beginner stuff.

I'm quite intrigued by this actually, and I'd love to see a solution to it, so I marked the thread as unsolved.

Go ahead and brainstorm! :D

ondoho 02-03-2018 05:19 PM

Quote:

Originally Posted by teckk (Post 5814772)
@ondoho

Yes, the easiest way is to use a browser engine, with beautiful soup, selenium, which I think is dead slow, or with PyQt5 and QtWebEngine or QtWebkit for example, those are in the repos of most distro's I think, and webengine is quite fast, but a little buggy. If you are using nouveau-webengine then run the script with --disable-gpu. You can read why that is.

Then dump the page requests, or specific requests to terminal or file, parse it etc.

thanks for the detailed answer.
i should've known that python has a library for everything.
still, I'm impressed.


All times are GMT -5. The time now is 11:23 AM.