LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-31-2018, 03:54 PM   #31
Time4Linux
LQ Newbie
 
Registered: Jan 2018
Posts: 14

Original Poster
Rep: Reputation: Disabled

Quote:
Originally Posted by ondoho View Post
i had a look, clicked on some random album with not too many pictures in it.
unfortunately, the image links are hidden behind javascript (but ultimately on the same domain).
this is where wget's abilities end.
i know that tools to parse js on the commandline exist, but have never used them.


i know that similar add-ons exist for firefox.
please investigate.
See my second reply, to iceweasel! (I wrote some stuff here before an epiphany...)

Quote:
Originally Posted by Shadow_7 View Post
In iceweasel (unbranded firefox) Hamburger Menu -> Save Page is what I've used at times to have text stuff to parse with other tools. For small sites with mostly static content, wget -r URL does a valiant effort.
(Sounds time consuming and complicated for that reason. I won't have time to save a page over and over just to get wget or whatever to work with the links. It needs to A. do it automatically and B. check continuously and rapidly.)

I have tried one thing which works half-ways there. Using the add-on FlashGot, together with wget.
It downloaded only the photos I wanted.
The problem is that it took three times the time compared to IDM and seemed to use a lot of resources and also it was strenuous on the system hard drive, because of writing temp files or whatever.
And most of all, there was no way to repeat the process, checking for and only downloading new photos and skip the ones already downloaded. At least not without cross-checking manually.

I just tried FlashGot with the browser built-in downloader and it got the files faster than anything!
There still is the issue(?) of not checking automatically for new files, but since it was that quick, I could as well just trust the speed of the downloader enough to manually repeat the process at whatever pace.
I did not imagine FlashGot being that fast and good, after having used it now and then, long ago, since I first got it, which was some years ago...
(Perhaps this also means that I can finally ditch IDM (Internet Download Manager), which is a terrible program, full of crappy code and it's not even free.)

We haven't figured out how to get wget to do this on its own, but maybe that isn't necessary.
I will mark this as solved then, I guess.
 
Old 02-01-2018, 07:55 AM   #32
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 2,263

Rep: Reputation: 499Reputation: 499Reputation: 499Reputation: 499Reputation: 499
For anyone following this thread, that page could be parsed with a little python and something that follows scripts to make a list.
Then download them.

Simple example with a few url's:

Code:
agent="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0"

list="
http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_1_.jpg
http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_2_.jpg
http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_3_.jpg
http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_4_.jpg
http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_5_.jpg
http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_6_.jpg
http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_7_.jpg
"

for i in $list; do
    #curl -A "$agent" "$i" -o "${i##*/}"
    wget -U "$agent" "$i" -O "${i##*/}"
done
Glad the OP found something that they liked.

Edit: syntax

Last edited by teckk; 02-01-2018 at 08:51 AM.
 
Old 02-01-2018, 08:26 AM   #33
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 12,996

Rep: Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097
Code:
list=(
line1
line2
...
)
for i in "${list[@]}"
do
....
done
is the correct syntax.
 
Old 02-01-2018, 08:58 AM   #34
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 2,263

Rep: Reputation: 499Reputation: 499Reputation: 499Reputation: 499Reputation: 499
Sorry @pan64, went too fast. No quotes should be around $list

Code:
for i in $list; do echo "curl -A "$agent" "$i" -o "${i##*/}""; done

curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_1_.jpg -o 001_1_.jpg
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_2_.jpg -o 001_2_.jpg
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_3_.jpg -o 001_3_.jpg
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_4_.jpg -o 001_4_.jpg
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_5_.jpg -o 001_5_.jpg
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_6_.jpg -o 001_6_.jpg
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 http://img24.rajce.idnes.cz/d2402/14/14895/14895753_1e7961b1abc41bbfa514b52860c141c6/images/001_7_.jpg -o 001_7_.jpg
Or like you said

Code:
list=(
one.com
two.com
three.com
)

for i in ${list[@]}; do echo "curl -A "$agent" "$i" -o "${i##*/}""; done
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 one.com -o one.com
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 two.com -o two.com
curl -A Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 three.com -o three.com
 
Old 02-01-2018, 02:10 PM   #35
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 12,307
Blog Entries: 9

Rep: Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310
Quote:
Originally Posted by teckk View Post
that page could be parsed with a little python
can python parse javascript?
or did you mean to first download the page with your browser (with javascript enabled), then parse it with python?
 
Old 02-01-2018, 03:42 PM   #36
Shadow_7
Senior Member
 
Registered: Feb 2003
Distribution: debian
Posts: 3,911
Blog Entries: 1

Rep: Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829
Quote:
Originally Posted by Time4Linux View Post
(Sounds time consuming and complicated for that reason. I won't have time to save a page over and over just to get wget or whatever to work with the links. It needs to A. do it automatically and B. check continuously and rapidly.)
There's this thing called programming. If you find yourself doing the same thing over and over, you write a program to do it for you. And then find a new employer or other things to consume your time.
 
Old 02-02-2018, 12:38 AM   #37
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 12,996

Rep: Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097
Quote:
Originally Posted by teckk View Post
Sorry @pan64, went too fast. No quotes should be around $list

Code:
for i in $list; do echo "curl -A "$agent" "$i" -o "${i##*/}""; done
....
Yes, you are right, but this implementation is unsafe. Probably works now, but probably won't work correctly in another case.


Quote:
Originally Posted by teckk View Post
Code:
list=(
one.com
two.com
three.com
)
for i in "${list[@]}"; do echo "curl -A "$agent" "$i" -o "${i##*/}""; done
...
you missed the " which is important here, without that it will be unsafe again.
 
Old 02-02-2018, 08:25 AM   #38
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 2,263

Rep: Reputation: 499Reputation: 499Reputation: 499Reputation: 499Reputation: 499
@ondoho
Quote:
can python parse javascript?
Yes, the easiest way is to use a browser engine, with beautiful soup, selenium, which I think is dead slow, or with PyQt5 and QtWebEngine or QtWebkit for example, those are in the repos of most distro's I think, and webengine is quite fast, but a little buggy. If you are using nouveau-webengine then run the script with --disable-gpu. You can read why that is.

Then dump the page requests, or specific requests to terminal or file, parse it etc.

For example this python3 code snippet
Code:
import sys, os, subprocess, re, time, calendar
from datetime import datetime
from functools import partial
from PyQt5.QtCore import Qt, QTimer, QDateTime, QUrl, pyqtSignal, pyqtSlot 
from PyQt5.QtNetwork import QNetworkCookie
from PyQt5.QtWidgets import QWidget, QHBoxLayout, QApplication
from PyQt5.QtWebEngineCore import QWebEngineUrlRequestInterceptor
from PyQt5.QtWebEngineWidgets import QWebEnginePage, QWebEngineView

class NetWorkManager(QWebEngineUrlRequestInterceptor):
    netS = pyqtSignal(str)
    
    def __init__(self,parent,url,print_request,block_request,default_block,
        select_request,get_link,req_file):
        super(NetWorkManager, self).__init__(parent)
        
        self.url = url
        self.print_request = print_request
        if block_request:
            self.block_request = block_request.split(',')
        else:
            self.block_request = []
            
        self.default_block = default_block
        self.select_request = select_request
        self.get_link = get_link
        self.req_file = req_file
        
    def interceptRequest(self,info):
        t = info.requestUrl()
        urlLnk = t.url()
        if self.get_link:
            if self.get_link in urlLnk:
                self.netS.emit(urlLnk)
                
        block_url = ''
        lower_case = urlLnk.lower()
        
        #Ads, banners, popup blocker
        lst = []
        if self.default_block:
            lst = [
            "doubleclick.net","ads",'.jpg','.gif','.css','facebook','.aspx',
            r"||youtube-nocookie.com/gen_204?", r"youtube.com###watch-branded-actions",
            "imagemapurl","b.scorecardresearch.com","rightstuff.com","scarywater.net",
            "popup.js","banner.htm","_tribalfusion","||n4403ad.doubleclick.net^$third-party",
            ".googlesyndication.com","graphics.js","fonts.googleapis.com/css",
            "s0.2mdn.net","server.cpmstar.com","||banzai/banner.$subdocument",
            "@@||anime-source.com^$document","/pagead2.","frugal.gif",
            "jriver_banner.png","show_ads.js",'##a[href^="http://billing.frugalusenet.com/"]',
            "||contextweb.com^$third-party",".gutter",".iab",'revcontent',
            "z-na.amazon-adsystem.com", "s.tribalfusion.com",
            "tags.expo9.exponential.com", "pagead2.googlesyndication.com"
            ]
            
        if self.block_request:
            lst = lst + self.block_request
        block = False
        for l in lst:
            if lower_case.find(l) != -1:
                block = True
                break
        if block:
            info.block(True)
            
        #Print page requests to term, spaced   
        if (self.select_request and self.select_request in urlLnk) or self.print_request:
            print('\n' + (urlLnk))
            
        #Save page requests to file, spaced 
        rlist = []
        if self.req_file:
            rlist.append(urlLnk)
            for i in rlist:
                with open(self.req_file, 'a') as f:
                    f.write(i + '\n\n')
That's just the the printing part of the script. Feed it with what you want.
Quote:
or did you mean to first download the page with your browser (with javascript enabled), then parse it with python?
One could do that too. Get all the links from a page that you desire with python - a browser engine like firefox in selenium - make a txt of links, then
Code:
wget -i file.txt
Why did I answer pointing to scripting to start with? To encourage new members to script in bash, python, ruby, perl etc.

There have been quite a few posts on LQ lately on this topic, and there seems to be a lack of interest with new members to learn how to script anything. If I can't click a button in Ubuntu then well...I'm just lost.
So I posted as much as possible without actually posting a script that will hack a website.

Scripting is Linux, it's one of the most powerful aspects of *nix. I can see that some of them don't like it, or are unable. Well then find a gui program like the OP did. And of course, the OP's aren't the only ones reading these threads. New members read these threads and don't participate, but pick up things, just like I did.

@pan64
Quote:
you missed the " which is important here, without that it will be unsafe again.
What do you mean unsafe? The script will fail, or something else?

Thanks.
 
Old 02-02-2018, 09:59 AM   #39
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 12,996

Rep: Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097Reputation: 4097
Quote:
Originally Posted by teckk View Post
@pan64

What do you mean unsafe? The script will fail, or something else?
Unsafe means the construct you use may work in some cases, but in some other cases it may fail. Therefore even if it currently (looks like it) made what you expected it is incorrect.
 
Old 02-03-2018, 03:30 PM   #40
Time4Linux
LQ Newbie
 
Registered: Jan 2018
Posts: 14

Original Poster
Rep: Reputation: Disabled
I am still here, reading.

Nice to see I caught some interest. I'm totally oblivious to making scripts, however and wouldn't know where to start even. Let alone what type of script and what "language", etc.
So I'm guessing this isn't beginner stuff.

I'm quite intrigued by this actually, and I'd love to see a solution to it, so I marked the thread as unsolved.

Go ahead and brainstorm!
 
Old 02-03-2018, 05:19 PM   #41
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 12,307
Blog Entries: 9

Rep: Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310
Quote:
Originally Posted by teckk View Post
@ondoho

Yes, the easiest way is to use a browser engine, with beautiful soup, selenium, which I think is dead slow, or with PyQt5 and QtWebEngine or QtWebkit for example, those are in the repos of most distro's I think, and webengine is quite fast, but a little buggy. If you are using nouveau-webengine then run the script with --disable-gpu. You can read why that is.

Then dump the page requests, or specific requests to terminal or file, parse it etc.
thanks for the detailed answer.
i should've known that python has a library for everything.
still, I'm impressed.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Wget (Command Line Downloader) command examples LXer Syndicated Linux News 0 12-22-2016 01:12 PM
what is the wget command line to down a complete web page for off line reading rob.rice Linux - Networking 12 10-29-2016 02:38 PM
LXer: GNU Wget 1.17 Command-line Download Manager Gets FTPS and HSTS Support LXer Syndicated Linux News 0 11-18-2015 08:02 AM
LXer: Wallbase.cc command line (bash/wget) wallpaper downloader (newest, random and search) LXer Syndicated Linux News 0 06-10-2012 08:40 PM
Downloads. GUI/command line, wget. Specify where to save. Kryptos Linux - Newbie 18 08-12-2011 04:09 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 03:14 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration