[SOLVED] Can I use "cURL" or "wget" to click on the links on a page?

n00b_noob · 03-25-2021, 11:23 AM

Hello,
On a web page, I want to click on all links. Can I use "cURL" or "wget" for this task?
I saw https://askubuntu.com/questions/6390...tiple-webpages and also found below "wget" command:

Code:

$ wget -r -p -k http://website

But, it download whole of a web site. I just want to click on all links on a page. For example, Consider https://www.amazon.com/s?k=linux&i=s...ref=nb_sb_noss URL, you can see a list of books on that page, I want to use cURL or wget tool, to click on all books on that page.

Thank you.

crts · 03-25-2021, 12:28 PM

No, you cannot 'click' with wget or curl. You may be able to simulate a 'click' by transmitting the data that may have been transmitted as a result from a 'click' but therefor you would first need to analyze the traffic that results from aforementioned 'click'.

Besides, some websites are implemented almost entirely with javascript and clicks may be processed entirely by javascript. Neither of those CLI tools can process javascript.

dc.901 · 03-25-2021, 02:00 PM

So, I am assuming you want to "crawl" and "scrape" website(s). You can do that, but I doubt it will be with a single line...

Here are some references (and there are many more if you search):
https://www.petergroom.com/index.php...ape-a-web-page
https://data36.com/web-scraping-tuto...age-with-bash/

teckk · 03-25-2021, 02:51 PM

I don't know why you would want to do that, but first step would be to get the hyperlinks.

Code:

url="https://www.amazon.com/s?k=linux&i=stripbooks-intl-ship&ref=nb_sb_noss"

agent="Mozilla/5.0 (Windows NT 10.1; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"

wget -k -U "$agent" "$url" -O myfile.html

grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" myfile.html

And that is not everything, just what is specified on the page. Look at myfile.html. Lots of server side scripts on that page that render other pages.

You could intercept the requests that a web browsers engine makes and print them.

Scraping for some content would be easier and do-able.

n00b_noob · 03-27-2021, 01:45 AM

Thank you.
I found below command:

Code:

$ curl URL 2>&1 | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2

It extracted all links, but I need to change it. I just want all links that started with "/text/" string. How can I change above command to just show all links that started with "/text/" string?

berndbausch · 03-27-2021, 06:49 AM

The easiest solution would be to append

Code:

grep ^/text/

to the pipeline. That only displays those lines that start with /text/.

n00b_noob · 03-29-2021, 10:03 AM

Thanks.
I used below command:

Code:

$ curl https://www.URL.com 2>&1 | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | grep ^/text/ > out.txt

The out.txt file includes the links that I wanted. How can I add "https://www.URL.com" to the beginning of each line in the out.txt file?

berndbausch · 03-29-2021, 05:46 PM

Quote:

Originally Posted by n00b_noob

The out.txt file includes the links that I wanted. How can I add "https://www.URL.com" to the beginning of each line in the out.txt file?

Add this to the pipeline:

Code:

sed 's|^|https://www.url.com|'

Or better, you can incorporate the grep into the sed:

Code:

sed -n 's|^/text/|https://www.url.com/text/|p'

I suggest you read the sed guide. Link is in my signature.

n00b_noob · 03-31-2021, 07:01 AM

Thank you.

I did:

Code:

$ curl https://www.URL.com 2>&1 | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | grep ^/text/ | sed 's|^|https://www.URL.com|' > out.txt
$ wget -i out.txt -O /dev/null