Wget on PHP page

PeterUK · 03-20-2018, 07:38 PM

Hi, I would like to download the image from a website.

The website are in the following example:

Main subpage
http://Example.com/........../650/

On the "650" directory is organize in the following pages:
http://Example.com/........../650/9601_00001.jpg.html
to:
http://Example.com/........../650/9601_00060.jpg.html
If you saved the image at this stage it is on Kb side.

If I click on the image - Link to imagine (magnifier)

http://Example.com/gallery/main.php?...serialNumber=2

If I click on the magnified the link to the full image is at:

http://Example.com/gallery/main.php?...emId=220253358

(this one is in MB side)

I tried to use wget like:

wget -r -A jpg http://Example.com/........../650/

I can see finding the pages but rejecting as the are not jpg

I also tried

wget -r -A jpg http://Example.com/........../650/9601_00001.jpg.html

But it does not download the image too, no even the KB one.

Went to the wget --help and tried many variants such as -m ‐‐execute robots=off --page-requisites etc, nothing works

Could you please help?

Also there is not direct access to

http://Example.com/gallery/

AwesomeMachine · 03-20-2018, 09:47 PM

The thing is with PHP, it can conceal the actual path. So, you have to surf with a browser, look at the page info, copy the actual path, and then use wget with the correct path. And I don't think 'robots' will work, because the switch is 'erobots'. You can use wget as a web spider to simply make a list of every object and it's path. Then you can look for what you want instead of all this trial and error.

ondoho · 03-21-2018, 01:46 AM

you can try to wget from the root of the site:

Code:

wget -r -A jpg http://Example.com/

it has worked for me on at least one occasion (where i was denied otherwise).
if that doesn't work, wget's spider option could be a first step.
or there are other tools to crawl pages and download stuff. httrack maybe.
but in the end, the site can always choose to deny non-direct access somehow.

PeterUK · 03-21-2018, 10:02 AM

I have managed to download the pages manually by:

Wget -A jpg 'http://Example.com/gallery/main.php?...serialNumber=2'

It was breaking the path if I don't add ''

I also managed to download the big one in the same way:

This is manually as I know what are the PHP path to this pages:

Now if I search the page:

http://Example.com/........../650/9601_00001.jpg.html -O - | grep "main.php"

I get:

<link rel="stylesheet" type="text/css" href="/gallery/main.php?g2_view=imageframe.CSS&g2_frames=none"/>
<a href='Example.com/gallery/main.php?g2_view=core.DownloadItem&g2_itemId=NUMBER' class='cloud-zoom' id='zoomA' rel="adjustX: 10, adjustY:-4"><img src="/gallery/main.php?g2_view=core.DownloadItem&g2_itemId=NUMBER&g2_serialNumber=2" width="671" height="960" id="IFid1" class="ImageFrame_none" alt="Immagine 1"/></a>

How do I go from:

href='Example.com/gallery/main.php?g2_view=core.DownloadItem&g2_itemId=NUMBER'
src="/gallery/main.php?g2_view=core.DownloadItem&g2_itemId=NUMBER&g2_serialNumber=2"

to:

http://Example.com/gallery/main.php?..._itemId=NUMBER
http://Example.com/gallery2/main.php...serialNumber=2

It is simple just text manipulations such as replacing ; for a &

Any suggestions?

scasey · 03-21-2018, 12:45 PM

& is an encoded ampersand (&), so yes, it's text replacement...replace & with & [or in the case of &, just remove 'amp;']

You may also encounter other encoded symbols. They are listed here

PeterUK · 03-21-2018, 04:21 PM

I have done it now.

I have use the following sequence:

- used wget and grep to save into a file the raw HTML data (the relevant)

- grep -o '[0-9]\{5\}' to extract the numbers

- open the file with the numbers, while IFS=: read f1, with a do loop and use wget (with the number + php link)

I wonder do I need to save the data into files, would those commands read from variables?

Shall I mask those wget or can they still see I am using a script?

AwesomeMachine · 03-21-2018, 04:31 PM

You might consider using --user-agent= with nothing after the = sign, so the site doesn't know what you are using, and use the --random-wait switch to stagger the timing of each download. Also, reading the entire man page for wget is of infinite value when using it.