LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Open url and save page using terminal without x11 - Firefox or Chrome (https://www.linuxquestions.org/questions/programming-9/open-url-and-save-page-using-terminal-without-x11-firefox-or-chrome-4175700249/)

pedropt 09-05-2021 01:02 PM

Open url and save page using terminal without x11 - Firefox or Chrome
 
Hello guys , is there any way to over the terminal without x11 loaded open an url to save it as file without open the gui app on X11 session ?

I want to do it with Firefox or Chrome , i try httpie but it does not load the full webpage only 50% of it .

GentleThotSeaMonkey 09-05-2021 01:11 PM

wget or curl, or lynx maybe?

pedropt 09-05-2021 01:33 PM

wget and curl can not get it , its from dropbox

Edited : curl can download the links from dropbox , but i want the full webpage only .

astrogeek 09-05-2021 01:46 PM

Curl or wegt should be able to do it, they can make the same request and receive the same response as the browser would do with suitable options.

pedropt 09-05-2021 02:14 PM

Try it : https://www.dropbox.com/sh/erv1tyczt...HqxmYz_5a?dl=0

and if you can then get me all the links from the wav files inside that folder with whatever tool you may think it will work over the terminal only.

Note : I dont need the links parsed from webpage , i just want the webpage fully downloaded with all wav files links inside . I can only get 30 from 50 and it is with httpie because with curl or wget its impossible . I dont want to download the files , i just want the webpage as html downloaded and full .


Here it is a simple script that does all the stuff , i inserted httpie as the call tool to get the html from dropbox , but you can use any tool you want , just adjust the code at line 18 to whatever tool you want .
Code:

#!/bin/bash
rm out.file >/dev/null 2>&1
rm tmp.file >/dev/null 2>&1
echo -ne "Enter dropbox url : "
read -r url
chk=$(echo "$url" | grep "?dl=1")
chk1=$(echo "$url" | grep "https://www.dropbox.com/sh/")
if [[ ! -z "$chk" ]]
then
echo -e "Invalid Dropbox url"
exit 1
elif [[ -z "$chk1" ]]
then
echo -e "Invalid Dropbox url"
exit 1
fi
echo -ne "Retrieving links ...."
http $url -o tmp.file
grep -Eo "https?://\S+?\dl=0" tmp.file | sort | uniq > out.file
a1=$(wc -l out.file | awk '{print$1}')
clear
echo "Got $a1 Links"
echo "--------------------------------------------------------------------------"
cat out.file
echo "--------------------------------------------------------------------------"
exit 0


pan64 09-05-2021 02:18 PM

https://www.dropbox.com/install?os=lnx
https://superuser.com/questions/4706...g-wget-command

pedropt 09-05-2021 02:39 PM

Quote:

pan64 wrote :
https://www.dropbox.com/install?os=lnx
https://superuser.com/questions/4706...g-wget-command
__________________
A program will never do what you wish but what was implemented!


Happy with solution ... mark as [SOLVED]
If you really want to say thanks => click on Yes (bottom right corner).
I dont want to download the links , i just want to get the html where links are , i think i was clear when i started the thread .

astrogeek 09-05-2021 04:39 PM

That page is delivered as, and loaded with javascript with Micro$oft copyright notices, but apparently it checks the useragent string to decide whether to deliver just the page or to helpfully push the whole archive down the user's pipe...

Use wget with suitable options, in this case the -U option and your browser's useragent string and it will give you just the page without the files. Note that the page you get will also be obnoxiously delivered as javascript from which you will have to extract the links, but the links are there.

Anticipating your next question, "How can I get just the links as HTML without the javascript?"... probably use a different storage platform. I think you will need some post processing of the "page" (i.e. downloaded script) to extract those links.

pedropt 09-05-2021 05:00 PM

Quote:

That page is loaded with javascript with Micro$oft copyright notices, but apparently it checks the useragent string to decide whether to deliver just the page or to helpfully push the whole archive down the user's pipe...

Use wget with suitable options, in this case the -U option and your browser's useragent string and it will give you just the page without the files. Note that the page you get will also be obnoxiously delivered as javascript from which you will have to extract the links, but the links are there.
Did you have try it or is just a guess ?

I just test it here and got the same output as httpie
Quote:

wget -U "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36" "$url" -O tmp.file

astrogeek 09-05-2021 05:07 PM

I just tried it as a guess (after looking at the page source in my browser) and I got the page.

Code:

wget https://www.dropbox.com/sh/erv1tycztizfvyd/AADeXwemV9sK37MSHqxmYz_5a?dl=0 -O page.html -U 'Mozilla...my ua string here'
The page as delivered is 482471 bytes of script from which the page is to be rendered via javascript.

Without the -U option I get approx 11MB of zipped archive, all those wav files.

ondoho 09-06-2021 01:19 AM

The problem is javascript - wget, curl & co. do not handle it - only browsers and dedicated tools (e.g. phantomjs) do.

You can try using your browser in headless mode (with firefox, simply add --headless to the command) - if that doesn't work, you'll have to resort to phantomjs or some python modules (beautifulsoup if memory serves) etc. Rudimentary coding will be required. Example for phantomjs.

And web searches. Example.

pedropt 09-06-2021 12:26 PM

Running it from a shell got this :
Quote:

qt.qpa.xcb: could not connect to display
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
But after starting a X11 session and running it on terminal i got the same output as wget and httpie , witch is 31 links instead 50 .

ondoho 09-07-2021 12:21 AM

Quote:

Originally Posted by pedropt (Post 6282164)
Running it from a shell got this :


But after starting a X11 session and running it on terminal i got the same output as wget and httpie , witch is 31 links instead 50 .

What "it"???

pedropt 09-07-2021 12:27 PM

IT =
However this javascript is out of the question because it requires a X11 session opened , and i want everything to be working over a shell .
I made a ticket to httpie denvelopers on github , maybe they will find a way to override this issue .


All times are GMT -5. The time now is 05:51 AM.