[SOLVED] Find if file available on Webpage

orangepeel190 · 07-17-2020, 08:10 AM

Hi There,

I am trying to find if a file is available on the particular website. Typically it is an .mp3.

If it is available, I wish to action a command, likewise if not available - either exit or action a command.

I have had a go, but not sure where I am going wrong... I am hoping to write this in BASH as the other files I am running are in bash - keeping it simple (if possible)...

Code:

file=“music.mp3"
url="http://www.some.site/here”


curl -s $url/$file | grep 404

  
if [ -f $file ] ; then
     echo " File -> $file <- FOUND!"
     Run_download_script_here
  else
   echo " File -> $file <- Not found!" 
fi
exit 0

Above is not displaying the desired result. I am simply wanting to check that the file is available on the website (for downloading).

I am getting lost with the grep section, possibly the If statement checking for the file showing on the website and whether or not I am required to download the file or is there a command to make sure that the file is there (rather than downloading it).

Hope that makes sense...?

Thank you very much
Cheers

pan64 · 07-17-2020, 08:16 AM

probably you need to add -w http_code to curl

Turbocapitalist · 07-17-2020, 09:29 AM

Also, curl sends the fetched data to stdout. So the output from -w will be appended there unless you do -w '%{stderr} %{http_code}' to redirect to stderr. However, I don't know how to juggle that so you'd get the file redirected from stdin at the same time stderr gets piped to grep.

Another option would be to use wget which will return an error code in the event of a 404 HTTP status code or similar failure.

Code:

#!/bin/sh

file="music.mp3"
url="http://www.some.site/here”

if wget $url/$file; then
     echo " File -> $file <- FOUND!"
     Run_download_script_here
else
     echo " File -> $file <- Not found!" 
fi

exit 0

pan64 · 07-17-2020, 10:07 AM

curl -O <file> can be useful too

shruggy · 07-17-2020, 10:24 AM

Quote:

Originally Posted by Turbocapitalist

I don't know how to juggle that so you'd get the file redirected from stdin at the same time stderr gets piped to grep.

It's sure possible, but rather complicated:

Code:

#!/bin/sh
{
  {
    (echo file; echo 404 >&2) |
      sed 's/^/stdout: /'
  } 2>&1 >&3 3>&- |
    sed 's/^/stderr: /'
} 3>&1

orangepeel190 · 07-17-2020, 04:35 PM

I am aiming to run the script to see if the file is present on the webpage, not necessarily downloading the file to trigger a command.

With the stdout, assume that would be after the curl line (no output file)?

Let say the file is found (mp3), will curl download the file or simply output some data which we can use a command to say Yes/No in the IF statement?

I don’t understand what the code in #5 is doing and where it should go to try and make it happen?
It looks like it’s doing something when it sees a 404 error?

scasey · 07-17-2020, 05:17 PM

In the OP:

Code:

if [ -f $file ]

^^ Isn't this testing for the file name contained in the variable (music.mp3) exists on the local disk in the directory in which the script is running? Has nothing to do with whether a file by that name is found by curl...

shruggy · 07-17-2020, 05:35 PM

Quote:

Originally Posted by orangepeel190

I am aiming to run the script to see if the file is present on the webpage, not necessarily downloading the file to trigger a command.

Then perhaps curl -f would suffice?

Quote:

Originally Posted by orangepeel190

I don’t understand what the code in #5 is doing and where it should go to try and make it happen?
It looks like it’s doing something when it sees a 404 error?

It separately evaluates standard output and standard error. Using your code from #1 it would be something like this

Code:

#!/bin/sh
{
  {
    curl -w '%{stderr} %{http_code}' -s $url/$file >$file
  } 2>&1 >&3 3>&- | grep -q 404 && echo not found
} 3>&1

orangepeel190 · 07-17-2020, 05:36 PM

Thanks scasey,

I was simply having an attempt at some scripting, rather than be “one of those people” that simply asks someone else to do all the work for them.

Yes, I am aware that -f will see if the file is available on the local disc, maybe it was not the best to use that “-f” statement given the request to see if the file is simply available on the website. I was interested to see where the scripting in #5 would best go to give it a go.

The issue that I am seeing is that the script could download a error message imposed as $file which the system will see as a Pass.

Classic example was this morning, I ran a script thinking it was downloading the file, yet when I explored deeper, the file looked like the mp3 file, but the message below: cat $file

Quote:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>509 Bandwidth Limit Exceeded</TITLE>
</HEAD><BODY>
<H1>Bandwidth Limit Exceeded</H1>

The server is temporarily unable to service your
request due to the site owner reaching his/her
bandwidth limit. Please try again later.
</BODY></HTML>

The -f resulted in a “Yes, the file was downloaded”, when it clearly was not the audio file. I am now having to add an external conditional statement in the script to ensure the file is larger than, say, 1Mb. If the file is downloaded and larger than 1Mb, then “Success, the file was downloaded”, if not, delete the small file and Error out - try again later.

It would be good to not have to download the file IF is not the correct file or not even available..... kinda gotten a little bigger problem than a simple download script.

Happy to try as many options as possible to get a functioning script and learn in the process.
Appreciate the feedback and assistance to my steep learning curve....

orangepeel190 · 07-17-2020, 05:47 PM

Thanks shruggy,

I’ll try and pop that into the script as well as develop a IF/THEN statement for the size of the file to try and filter the file correctly.

The curve ball presented this morning with the bandwidth message masquerading as the .mp3 file. I am thinking that a size comparison filter would be the best to help tighten the filter on the correct file download., as typically the error files are smaller than 1Mb and audio files on this site are larger than 3Mb.

I was hoping to have not gone through the download process and filter locally, rather have some funky script that looks at the size of the remote file to make sure that it “appears” to be the correct file (by availability, name and size) and if so, run a command (either download or send email) that the correct remote file appears to be available.

I was thinking either dump the HTML code and use grep to pull out the file name or 404 error, but that technique appears to be flawed with the file masquerading as the audio file but contains the “This server is unavailable....” message this morning - hence now also having to compare file sizes

I am hoping that something would be possible with some crafty but robust scripting and commands in Linux? I’m running it on a Raspberry Pi.

Turbocapitalist · 07-18-2020, 01:05 AM

Quote:

Originally Posted by orangepeel190

I was thinking either dump the HTML code and use grep to pull out the file name or 404 error, but that technique appears to be flawed with the file masquerading as the audio file but contains the “This server is unavailable....” message this morning - hence now also having to compare file sizes

Is that information visible in the response headers?

Code:

curl --silent --head ${url}/${file}
wget --quiet --server-response --output-document=/dev/null ${url}/${file}

orangepeel190 · 07-18-2020, 01:16 AM

Thanks for your response....
The website issue have since resolved, as it was a hosting issue this morning.... so I cannot answer your question when the server was experiencing issues. Don’t know where there was a bandwidth issue but I sent an email, they confirmed they were having issues and it appears the issues have resolved (more bandwidth?)

I’ve checked the curl output and it would appear to be best running a check based on a header for 404 or returning 200?

Bash Script

Code:

user:~/$ curl --silent —head $url/$bogus_file
HTTP/1.1 404 Not Found
Date: Sat, 18 Jul 2020 06:07:05 GMT
Server: Apache
Content-Type: text/html; charset=iso-8859-1

user:~/$ curl --silent --head $url/$file
HTTP/1.1 200 OK
Date: Sat, 18 Jul 2020 06:07:52 GMT
Server: Apache
Last-Modified: Fri, 17 Jul 2020 07:17:15 GMT
Accept-Ranges: bytes
Content-Length: 5281011
Content-Type: audio/mpeg

How can I grep the header (either 200 or 404) to provide me with an IF/THEN/ELSE/FI option to either run the download if “200” is returned or error out if “404” returned?

It seems like the webserver has a fit this morning which added to my additional question relating to file size

Would something like this potentially work?

Code:

enquiry=$(curl -sLo --head /dev/null -w "%{$url/$file}\n" ${1})
  if [[ $enquiry != 200 ]]; then
    echo "Success  ${enquiry} on ${1}"
    echo “Sending notification email the file is available....”
else
   echo “File is not available.... try again later”
  fi

Would it be —head or -I (capital i (eye))

Or alternate for enquiry != 404 (equating to error or not available)

Close for a bash script?

Turbocapitalist · 07-18-2020, 03:26 AM

Quote:

Originally Posted by orangepeel190

Would it be —head or -I (capital i (eye))

Or alternate for enquiry != 404 (equating to error or not available)

Close for a bash script?

Usually there are long options and short options, so with curl it would be matter of style and not substance whether to use -I or --head. Mind the type of dashes though. You need two single dashes (n-dashes) not one m-dash in this particular case.

The conditional statements in shell scripting work on the exit codes of programs or piped chains of programs. So you could do,

Code:

if /usr/bin/true; then
        echo OK
else
        echo Not OK
fi

if /usr/bin/false; then
        echo OK
else
        echo Not OK
fi

And then

Code:

if curl --silent --head $url/$file | grep -q -c 1 -P '^HTTP/\w\.\w\s200\sOK'; then
        echo OK
else
        echo Try later
fi

pan64 · 07-18-2020, 04:08 AM

what about this?

Code:

curl -s --fail -w '%{http_code}\n' $url

orangepeel190 · 07-18-2020, 04:29 AM

pan64 = that appears to download the file rather than checking if it’s available... is there an alternative to downloading and check via a curl command or result (potentially in the header)?