LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 03-20-2018, 07:38 PM   #1
PeterUK
Member
 
Registered: May 2009
Posts: 281

Rep: Reputation: 16
Wget on PHP page


Hi, I would like to download the image from a website.

The website are in the following example:

Main subpage
http://Example.com/........../650/

On the "650" directory is organize in the following pages:
http://Example.com/........../650/9601_00001.jpg.html
to:
http://Example.com/........../650/9601_00060.jpg.html
If you saved the image at this stage it is on Kb side.

If I click on the image - Link to imagine (magnifier)

http://Example.com/gallery/main.php?...serialNumber=2

If I click on the magnified the link to the full image is at:

http://Example.com/gallery/main.php?...emId=220253358

(this one is in MB side)

I tried to use wget like:

wget -r -A jpg http://Example.com/........../650/

I can see finding the pages but rejecting as the are not jpg

I also tried

wget -r -A jpg http://Example.com/........../650/9601_00001.jpg.html

But it does not download the image too, no even the KB one.

Went to the wget --help and tried many variants such as -m ‐‐execute robots=off --page-requisites etc, nothing works

Could you please help?

Also there is not direct access to

http://Example.com/gallery/

Last edited by PeterUK; 03-20-2018 at 07:39 PM.
 
Old 03-20-2018, 09:47 PM   #2
AwesomeMachine
LQ Guru
 
Registered: Jan 2005
Location: USA and Italy
Distribution: Debian testing/sid; OpenSuSE; Fedora; Mint
Posts: 5,524

Rep: Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015
The thing is with PHP, it can conceal the actual path. So, you have to surf with a browser, look at the page info, copy the actual path, and then use wget with the correct path. And I don't think 'robots' will work, because the switch is 'erobots'. You can use wget as a web spider to simply make a list of every object and it's path. Then you can look for what you want instead of all this trial and error.
 
Old 03-21-2018, 01:46 AM   #3
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
you can try to wget from the root of the site:
Code:
wget -r -A jpg http://Example.com/
it has worked for me on at least one occasion (where i was denied otherwise).
if that doesn't work, wget's spider option could be a first step.
or there are other tools to crawl pages and download stuff. httrack maybe.
but in the end, the site can always choose to deny non-direct access somehow.
 
Old 03-21-2018, 10:02 AM   #4
PeterUK
Member
 
Registered: May 2009
Posts: 281

Original Poster
Rep: Reputation: 16
I have managed to download the pages manually by:

Wget -A jpg 'http://Example.com/gallery/main.php?...serialNumber=2'

It was breaking the path if I don't add ''

I also managed to download the big one in the same way:

This is manually as I know what are the PHP path to this pages:

Now if I search the page:

http://Example.com/........../650/9601_00001.jpg.html -O - | grep "main.php"

I get:

<link rel="stylesheet" type="text/css" href="/gallery/main.php?g2_view=imageframe.CSS&amp;g2_frames=none"/>
<a href='Example.com/gallery/main.php?g2_view=core.DownloadItem&amp;g2_itemId=NUMBER' class='cloud-zoom' id='zoomA' rel="adjustX: 10, adjustY:-4"><img src="/gallery/main.php?g2_view=core.DownloadItem&amp;g2_itemId=NUMBER&amp;g2_serialNumber=2" width="671" height="960" id="IFid1" class="ImageFrame_none" alt="Immagine 1"/></a>

How do I go from:

href='Example.com/gallery/main.php?g2_view=core.DownloadItem&amp;g2_itemId=NUMBER'
src="/gallery/main.php?g2_view=core.DownloadItem&amp;g2_itemId=NUMBER&amp;g2_serialNumber=2"

to:

http://Example.com/gallery/main.php?..._itemId=NUMBER
http://Example.com/gallery2/main.php...serialNumber=2

It is simple just text manipulations such as replacing ; for a &

Any suggestions?
 
Old 03-21-2018, 12:45 PM   #5
scasey
LQ Veteran
 
Registered: Feb 2013
Location: Tucson, AZ, USA
Distribution: CentOS 7.9.2009
Posts: 5,725

Rep: Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211
&amp; is an encoded ampersand (&), so yes, it's text replacement...replace &amp; with & [or in the case of &amp;, just remove 'amp;']

You may also encounter other encoded symbols. They are listed here
 
Old 03-21-2018, 04:21 PM   #6
PeterUK
Member
 
Registered: May 2009
Posts: 281

Original Poster
Rep: Reputation: 16
I have done it now.

I have use the following sequence:

- used wget and grep to save into a file the raw HTML data (the relevant)

- grep -o '[0-9]\{5\}' to extract the numbers

- open the file with the numbers, while IFS=: read f1, with a do loop and use wget (with the number + php link)

I wonder do I need to save the data into files, would those commands read from variables?

Shall I mask those wget or can they still see I am using a script?
 
Old 03-21-2018, 04:31 PM   #7
AwesomeMachine
LQ Guru
 
Registered: Jan 2005
Location: USA and Italy
Distribution: Debian testing/sid; OpenSuSE; Fedora; Mint
Posts: 5,524

Rep: Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015
You might consider using --user-agent= with nothing after the = sign, so the site doesn't know what you are using, and use the --random-wait switch to stagger the timing of each download. Also, reading the entire man page for wget is of infinite value when using it.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] PHP curl - how to get the content of a page without processing the page robertjinx Linux - Software 8 01-14-2016 01:57 AM
[SOLVED] Downloaded complete web page with wget but browser wants internet to open page? SharpyWarpy Linux - General 15 08-16-2012 04:57 AM
wget the same page every 5 minutes theBowler Linux - Newbie 7 02-29-2012 02:11 AM
"wget -p" problem with PHP page dedeco Linux - General 2 06-04-2009 06:58 PM
How do I output information from a PHP page to an HTML page? SentralOrigin Programming 3 01-10-2009 01:54 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 12:39 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration