LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Prevent wget to download index.html of 404 page (https://www.linuxquestions.org/questions/linux-newbie-8/prevent-wget-to-download-index-html-of-404-page-4175482479/)

unclesamcrazy 10-28-2013 10:12 AM

Prevent wget to download index.html of 404 page
 
I use wget to download files, it is working fine but if a wrong path is entered, it goes on 404 page and an index.html is saved.

I do not want that it should create an index.html if wrong path is entered.

Example : here is the mp4 video file link on local server
192.168.1.28/client_demo.mp4

but if some one enters
wget 192.168.1.28/client-demo.mp4
it saves index.html file in his/her home directory.
because there is no file client-demo.mp4, the actual file name is client_demo.mp4

A script is doing this. User enters only path of video and it downloads video file for user.
but if user enters wrong path, it saves index.html in user's home directory.
it is embarrassing for us.

Please help so if user enters wrong path, it will not download index.html insted if it gives message like "path not available", it will be very fine.

Thanks.

pan64 10-29-2013 03:00 AM

I think it is decided on the server side, requests to invalid pages will be automatically redirected. wget will not be able to recognize invalid urls.

zarglink 04-19-2018 10:52 PM

prevent download of 404 index
 
wget doesnt have a feature which will download hits and omit 404 but you can use bash's excellent string sorting and manipulation features to read the output of wget and figure out which files to download and which ones to omit.

wget has a --spider option which will simply tell you if the file is present and will not download anything. in the code i list you can see we first use wget with --spider to check to see if a 404 is obtained. if it is the script skips the download, if it isnt then the script downloads the file by invoking wget without spider.


note that wget puts its output to stderr instead of stdout so you see we have redirected it to stdout by using 2>&1.
after this we search for a relevant string in the output and it happens that "404 Not" is part of what pops up if you get a 404 error and including the word "Not" will prevent unexpected files with 404 in them from fooling our script.
we use grep with -q to see if it finds 404, if it does then the script returns your file not found notice, and if it doesnt it goes to 'else' and downloads the video from the link.
note awk is printing the sixth word of the wget output which happens to be '404' which is what we are looking for. this works with my system though it is possible that some newer version of wget may not format its output the same way and some code modification would then be needed.

i use a snippet exactly like this in a webscraper to get files with numerically incremented names embedded in a while loop that increments the file name im looking for with each pass. also my file names are set to variables. this may be complicated if you know nothing of bash but is pretty standard once you start using it.
if you are going to put this in a while loop and dowload incrementing file names, please put a sleep pause in so as not to overload the server and for your own interests it will help you evade detection by the server who might otherwise block you if you set off a warning for them.

code follows:
--------------------------------------

#bash script to only allow wget to download a file if it is present


RETURN=$(wget --spider 192.168.1.28/client_demo.mp4 2>&1 | grep '404 Not' | awk '{ print $6 }')
if
echo $RETURN |grep -q 404
then
echo "FILE NOT FOUND"

else
wget 192.168.1.28/client_demo.mp4

fi

AwesomeMachine 04-19-2018 11:56 PM

Try this:
Code:


#!/bin/bash

# if `check_link $url >/dev/null`; then wget 192.168.1.28/client_demo.mp4; else echo "does not exist"; fi


function check_link(){
  if [[ `wget -S --spider $1  2>&1 | grep 'HTTP/1.1 200 OK'` ]]; then echo "true"; fi
}

Just figure out how you want to present $url.


All times are GMT -5. The time now is 03:10 PM.