LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Can I use "cURL" or "wget" to click on the links on a page? (https://www.linuxquestions.org/questions/linux-newbie-8/can-i-use-curl-or-wget-to-click-on-the-links-on-a-page-4175692639/)

n00b_noob 03-25-2021 11:23 AM

Can I use "cURL" or "wget" to click on the links on a page?
 
Hello,
On a web page, I want to click on all links. Can I use "cURL" or "wget" for this task?
I saw https://askubuntu.com/questions/6390...tiple-webpages and also found below "wget" command:
Code:

$ wget -r -p -k http://website
But, it download whole of a web site. I just want to click on all links on a page. For example, Consider https://www.amazon.com/s?k=linux&i=s...ref=nb_sb_noss URL, you can see a list of books on that page, I want to use cURL or wget tool, to click on all books on that page.

Thank you.

crts 03-25-2021 12:28 PM

No, you cannot 'click' with wget or curl. You may be able to simulate a 'click' by transmitting the data that may have been transmitted as a result from a 'click' but therefor you would first need to analyze the traffic that results from aforementioned 'click'.

Besides, some websites are implemented almost entirely with javascript and clicks may be processed entirely by javascript. Neither of those CLI tools can process javascript.

dc.901 03-25-2021 02:00 PM

So, I am assuming you want to "crawl" and "scrape" website(s). You can do that, but I doubt it will be with a single line...

Here are some references (and there are many more if you search):
https://www.petergroom.com/index.php...ape-a-web-page
https://data36.com/web-scraping-tuto...age-with-bash/

teckk 03-25-2021 02:51 PM

I don't know why you would want to do that, but first step would be to get the hyperlinks.
Code:

url="https://www.amazon.com/s?k=linux&i=stripbooks-intl-ship&ref=nb_sb_noss"

agent="Mozilla/5.0 (Windows NT 10.1; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"

wget -k -U "$agent" "$url" -O myfile.html

grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" myfile.html

And that is not everything, just what is specified on the page. Look at myfile.html. Lots of server side scripts on that page that render other pages.

You could intercept the requests that a web browsers engine makes and print them.

Scraping for some content would be easier and do-able.

n00b_noob 03-27-2021 01:45 AM

Thank you.
I found below command:
Code:

$ curl URL 2>&1 | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2
It extracted all links, but I need to change it. I just want all links that started with "/text/" string. How can I change above command to just show all links that started with "/text/" string?

berndbausch 03-27-2021 06:49 AM

The easiest solution would be to append
Code:

grep ^/text/
to the pipeline. That only displays those lines that start with /text/.

n00b_noob 03-29-2021 10:03 AM

Thanks.
I used below command:
Code:

$ curl https://www.URL.com 2>&1 | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | grep ^/text/ > out.txt
The out.txt file includes the links that I wanted. How can I add "https://www.URL.com" to the beginning of each line in the out.txt file?

berndbausch 03-29-2021 05:46 PM

Quote:

Originally Posted by n00b_noob (Post 6235239)
The out.txt file includes the links that I wanted. How can I add "https://www.URL.com" to the beginning of each line in the out.txt file?

Add this to the pipeline:
Code:

sed 's|^|https://www.url.com|'
Or better, you can incorporate the grep into the sed:
Code:

sed -n 's|^/text/|https://www.url.com/text/|p'
I suggest you read the sed guide. Link is in my signature.

n00b_noob 03-31-2021 07:01 AM

Thank you.

I did:
Code:

$ curl https://www.URL.com 2>&1 | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | grep ^/text/ | sed 's|^|https://www.URL.com|' > out.txt
$ wget -i out.txt -O /dev/null



All times are GMT -5. The time now is 05:02 PM.