[SOLVED] Extracting links recursively from url and saving them in text file?

peter7089 · 01-03-2019, 12:22 PM

Quote:

Originally Posted by l0f4r0

Ok, you can get rid of the directory structure with switch --no-directories as some members told you before.
However, my command should work nonetheless (from my side it's OK with a links.txt file full of links). So please provide the following outputs:

Code:

wget --version
wget -r -A "*.jpg,*.jpeg" --ignore-case --spider --no-directories http://www.slackware.com/ 2>&1
wget -r -A "*.jpg,*.jpeg" --ignore-case --spider --no-directories http://website.com/dir1/ 2>&1
curl http://website.com/dir1/

NB: regarding the last 2 commands, you might want to anonymize the outputs and give us bogus URLs instead. Actually, I'm only interested in seeing the global look of the output and noticing if there is any URLs pointing to .jpg/.jpeg resources...

The wget version is 1.18, but i am not sure how to post the output for the two wget commands. I tried exporting it to text file but it didn't worked.

l0f4r0 · 01-04-2019, 01:09 AM

^ Redirections can be tricky sometimes

Try this:

Code:

wget -r -A "*.jpg,*.jpeg" --ignore-case --spider --no-directories http://www.slackware.com/ &>wgetSlackware.txt
wget -r -A "*.jpg,*.jpeg" --ignore-case --spider --no-directories http://website.com/dir1/ &>wgetWebsite.txt
curl http://website.com/dir1/ >curlWebsite.txt

Then attach those 3 files in a new post (anonymize their content if need be).

peter7089 · 01-04-2019, 03:18 AM

This time it worked. But i found that wget gets out of the directory that is scraping. If the url is http://website.com/dir1/ it goes to http://website.com/ when finishing parsing the links in http://website.com/dir1/. I don't know if this is normal behavior though.

These are the files:

wgetSlackware.txt

curlWebsite.txt

wgetWebsite.txt

l0f4r0 · 01-04-2019, 06:57 AM

^ Ok, commands are well processed. It's just hard for me to analyze their outputs as you have anonymized them quite a lot!
Yes, it's normal that wget downloads resources from outside folder dir1 because of its recursive mode enabled. You can activate --level=1 and/or --no-parent options if you want to disable that behavior. Is it better now?

peter7089 · 01-04-2019, 12:34 PM

Quote:

Originally Posted by l0f4r0

^ Ok, commands are well processed. It's just hard for me to analyze their outputs as you have anonymized them quite a lot!

Ok, no problem. I still learned some new things.

l0f4r0 · 01-05-2019, 04:04 AM

If your problem has been resolved, please mark your thread as such (see HOWTO in my sig).