LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Wget script for overnight downloads. (https://www.linuxquestions.org/questions/programming-9/wget-script-for-overnight-downloads-4175673764/)

jmgibson1981 04-23-2020 12:40 AM

Wget script for overnight downloads.
 
I just finished hacking this together. I tested as I built. It seems to be working. I've got it running on 2 directories with 2 different input files. Will see tomorrow if it succeeded all files. Sharing both to get suggestions on how to improve, as well as if anyone may find it useful. I'm sure there is a way to simplify this.

Code:

until [[ $(date +%H) == 04 ]] ; do
                while read line ; do
                        if ! echo "$line" | grep '#' > /dev/null ; then
                                VAR1=$(echo "$line" | awk '{print $1}')
                                if echo "$line" | grep '"' > /dev/null ; then
                                        VAR2=$(echo "$line" | cut -d\" -f 2 | cut -d \" -f 1)
                                else
                                        VAR2=$(echo "$line" | awk '{print $2}')
                                fi
                                if [[ ! "$VAR2" ]] ; then
                                        exit 0
                                fi
                                cd "$1" && wget -O "$VAR2" "$VAR1"
                                sed -i "s/${VAR2}/${VAR2} #/" "$2"
                        fi
                done < "$2"
        done

The idea is because I'm still figuring how to scrape an html5 website. Combine that with needing to be able to name the files a certain way and I ended up with this.

Input file layout

Code:

www.download.link "name of file.whatever"
www.download.link2 file.name
done


pan64 04-23-2020 12:54 AM

would be nice to show an input file (how is it related to a html5 page).
use shellcheck to check your script (will show you interesting comments)

your
Code:

echo $line | grep char
# can be replaced by
[[ $line =~ char ]]
# in most cases

which is much faster

also you can use read -r host url
which will give you host and url immediately, there will be no need to use awk/cut...

But there is room for other improvements too...

jmgibson1981 04-23-2020 01:14 AM

It's not related to html5 at all on the input. It's manually created. I need to figure out how to scrape it. I'll redo to fit your suggestion on the $line.

ondoho 04-23-2020 06:31 AM

VAR1 and VAR2 is not very descriptive.
Neither is $1 and $2.
There's no comments in the code and I don't understand what most of it does.
Starting with the first line: 'until [[ $(date +%H) == 04 ]]' ????

jmgibson1981 04-23-2020 10:31 AM

Code:

#!/bin/bash
# tadaen sylvermane | jason gibson

# initial build auto downloader for overnight run on cronta
set -x

if [[ "$2" ]] ; then
        while read line ; do
                # if '#' exists in line then skips to next line
                if [[ ! "$line" =~ '#' ]] ; then
                        # complete download link via copy paste from webpage
                        DL_LINK=$(echo "$line" | awk '{print $1}')
                        # desired output name of file. if file has spaces must be encapsulated
                        # with " marks same as in a regular terminal usage
                        if [[ "$line" =~ '"' ]] ; then
                                OPFILENAME=$(echo "$line" | cut -d "\"" -f 2 | cut -d "\"" -f 1)
                        else
                                OPFILENAME=$(echo "$line" | awk '{print $2}')
                        fi
                        # end of file should have single word. i'm using 'done' when reaching
                        # final line, exit clean
                        if [[ ! "$OPFILENAME" ]] ; then
                                exit 0
                        fi
                        # makes directory as needed
                        if [[ ! -d "$1" ]] ; then
                                mkdir -p "$1"
                        fi
                        # download current line to specified directory with proper name
                        wget -O "$1"/"$OPFILENAME" "$DL_LINK"
                        # add marker to determine if download of given line has been completed
                        # or not yet. idea here is if the list doesn't complete in a given night
                        # it will pick up where it left off without re-downloading the whole
                        if [[ "$line" =~ '"' ]] ; then
                                sed -i "s/\"${OPFILENAME}\"/\"${OPFILENAME}\"#/" "$2"
                        else
                                sed -i "s/${OPFILENAME}/${OPFILENAME}#/" "$2"
                        fi
                fi
                case $(date +%H) in
                        04|05)
                                exit 0
                                ;;
                esac
        done < "$2"
else
        echo "usage ${0} (/download/target/path | filename)"
fi

Ok I though it was working with my stop time (the until loop). full test revealed no dice. Above is what I ended up with. Commented and more defined variable names. The case is so that it has a 2 hour window. hopefully anything I download should be under that timeframe.

pan64 04-23-2020 11:58 AM

It looks like you did not check it with shellcheck. I told you additional improvements, you probably missed them. Don't really important.
You can also put wget commands into background and in that case they will run parallel, need not wait to each other.

jmgibson1981 04-23-2020 12:08 PM

I got the [[ "$line" =~ ]] one added. Not sure how to use on the others. more research.

michaelk 04-23-2020 01:03 PM

Code:

www.download.link,name of file.whatever
www.download.link2,file.name
done

Code:

#!/bin/bash

while IFS=',' read -r var1 var2
do
  now=$(date +%H)
  if [[ "$now" == "04" ]]; then
      echo "Times up"
      break
  fi
  if [[ "$var1" == "done" ]]; then
      echo "All Done"
      break
  fi

  echo "var1=$var1"
  echo "var2=$var2"
 
done < "$2"

If I understand the basics of your program here is a quick skeleton example.

jmgibson1981 04-23-2020 05:46 PM

Ive seen IFS but never used it. Working with it now. Thank you.

jmgibson1981 04-23-2020 10:00 PM

Code:

#!/bin/bash
# tadaen sylvermane | jason gibson
# v 2.0
# automated downloader started by crontab in evening
# additional suggestions and help provided by,
# pan64, ondoho, michaelk from linuxquestions.org

if [[ "$2" ]] ; then
        # uses ` as field separator in input file
        while IFS='`' read -r dl_link op_name ; do
                # exits script if reaches final single field line
                if [[ ! "$op_name" ]] ; then
                        exit 0
                fi
                # if line isn't commented runs everything else to make the download
                if [[ ! "$op_name" =~ '#' ]] ; then
                        # this whole multi variable thing is the only way I could reason this out.
                        # if there is an easier way I will try to make adjustments.
                        # gets file name extension from the download link
                        dl_link_ext=$(basename "$dl_link" | cut -d. -f 2 )
                        # removes all punctuation, single spacing between words
                        # linux may not require file extensions. kodi does
                        op_name_edit=$(echo "$op_name" | sed -e 's/[[:punct:]]/ /g' -e 's/ \+/ /g' -e 's/\b\(.\)/\u\1/g' -e 's/[ \t]+$//')
                        # gets final word in expected file name to sed extension on the end
                        end_of_name=$(echo "$op_name_edit" | awk '{print $NF}')
                        # result of all above operations is file name formatted exactly how I like it
                        op_name_full=$(echo "$op_name_edit" | sed -e "s/${end_of_name}/${end_of_name}.${dl_link_ext}/" -e 's/ *$//')
                        # if required directory doesn't exist, creates it
                        if [[ ! -d "$1" ]] ; then
                                mkdir -p "$1"
                        fi
                        # download and mark files with # as downloaded in input file
                        wget -O "$1"/"$op_name_full" "$dl_link"
                        sed -i "s/${op_name}/${op_name}#/g" "$2"
                        # 2 hour window starting at 3am and ending at 4:59am where it will finish current download
                        # will not start another download during this window
                        # expected result is script will stop and not run again till fired again via crontab in the evening
                        # when run again it will iterate through lines until reaching line not having the # mark and starting
                        # there
                        case $(date +%H) in
                                03|04)
                                        exit 0
                                        ;;
                        esac
                fi
        done < "$2"
else
        echo "usage ${0} (/download/target/path | /input/file)"
fi

This should employ your suggestions as well as give me some regex editing of file name so I can copy paste directly from website and move on.

Inputfile example

Code:

https://download-a.akamaihd.net/files/media_publication/a8/ebtv_E_01_r720P.mp4`Was the Universe Created?
done

After file downloaded

Code:

https://download-a.akamaihd.net/files/media_publication/a8/ebtv_E_01_r720P.mp4`Was the Universe Created?#
done

End filename result in folder

Code:

-rw-rw-r-- 1 jason jason 38401322 Apr 13 08:30 'Was The Universe Created.mp4'
I am grateful for the help. I'm sure this can be optimized and streamlined but for now it's a huge step up from my first attempt posted at first. Next step, is to research how to parse html5 for what I want and script to scrape all of into a file in the above format. Then combine the 2.

pan64 04-24-2020 01:27 AM

Code:

#!/bin/bash

# check input parameters
[[ $2 ]] || { echo "usage ${0} (/download/target/path | /input/file)" >&2; exit 1 }

# process input file
while IFS='`' read -r dl_link op_name ; do
    [[ "$op_name" ]] || break  # end of while loop
    [[ "$op_name" =~ '#' ]] || continue # next line

    dlink_ext="${op_name##*.}"

    ........

    # it is definitely wrong
    # you must not modify the file
    # which is currently read by this while loop
    sed -i "s/${op_name}/${op_name}#/g" "$2"
    # instead you need to make a copy (before the while)
    # and run sed on that file
    # and when the while has completed replace the original

    ....
 
    case $(date +%H) in
        03|04) break;;
    esac

done

# here comes the replacement


michaelk 04-24-2020 08:23 AM

I agree with pan64, you can not edit a file that is already open and and to elaborate on the temp file suggestion.
Code:

tmp=$(mktemp)
cp $2 $tmp # copy list to temporary file
while IFS='`' read -r dl_link op_name
do
  [[ "$op_name" ]] || break  # end of while loop

  [[ "$op_name" == *"#" ]] && continue # continue if line contains #

  dlink_ext="${op_name##*.}"
 
  .........

  sed -i "s/${op_name}/${op_name}#/g" "$2"

  now=$(date +%H)
  if [[ "$now" == "05" ]]; then
      echo "Times up"
      break
  fi
done < "$tmp"
rm "$tmp" # clean up temporary file


jmgibson1981 04-24-2020 02:07 PM

OK. changes made. will run it tonight and see what comes out. Thank you all again.

jmgibson1981 04-24-2020 11:15 PM

Ok did some more testing today and had to redo my sed syntax to get it all right. managed to greatly simplify in the process. incorporated syntax suggestions above in almost every place mentioned.

Code:

#!/bin/bash
# tadaen sylvermane | jason gibson
# v 2.1
# automated downloader started by crontab in evening
# additional suggestions and help provided by,
# pan64, ondoho, michaelk from linuxquestions.org
# most of sed sourced from various google searches

# begin script #

if [[ "$2" ]] ; then
        # create temp file to read from so i can edit source as needed to insert #
        # marks
        TEMPINPUT=/tmp/"$2".temp
        cp "$2" "$TEMPINPUT"
        while IFS='`' read -r dl_link op_name ; do
                # exit while loop upon reaching last line in file which should be single
                # word or field
                [[ "$op_name" ]] || break
                # if $op_name has # mark anywhere it will move to next line in sequence
                [[ "$op_name" =~ '#' ]] && continue
                # gets filename extension from download link
                dl_link_ext=$(basename "$dl_link" | cut -d. -f 2 )
                # strips punctuation, capitalizes first letter all words, and changes all
                # spacing to single spaces. the first sed field is for the occasional
                # character as one of the many speakers on the videos has a foreign accent
                # last name. unsure of nationality. as i go i may need to add more of these
                # depending on speakers on individual video files. this is greatly
                # simplified from v 2.0
                op_name_edit=$(echo "$op_name" \
                | sed -e 's/ö/o/g' -e 's/—/ /g' -e 's/[[:punct:]]//g' -e 's/\b\(.\)/\u\1/g'\
                | tr -s ' ')
                # adds filename extension and removes trailing spaces
                op_name_full=$(echo "$op_name_edit"."$dl_link_ext" | sed -e 's/ *$//')
                # mkdir as needed if not exist yet
                [[ -d "$1" ]] || mkdir -p "$1"
                wget -O "$1"/"$op_name_full" "$dl_link"
                # add hash mark to line in source file to prevent re-doing the same line
                # on next run
                sed -i "s/${op_name}/${op_name}#/g" "$2"
                # i opted for a case so I can give myself a bigger window depending on
                # internet speed at a given time. this will give 1 hour and 59 minutes to
                # finish the current download and run of script at the appropriate time
                case $(date +%H) in
                        03|04)
                                break
                                ;;
                esac
        done < "$TEMPINPUT"
        # remove temp input file copy as no longer needed
        rm "$TEMPINPUT"
else
        echo "usage ${0} (/download/target/path | /input/file)"
fi

# end script #


pan64 04-25-2020 02:33 AM

ok, so do you need any help now?


All times are GMT -5. The time now is 06:12 AM.