LinuxQuestions.org - Wget script for overnight downloads.

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Wget script for overnight downloads. (https://www.linuxquestions.org/questions/programming-9/wget-script-for-overnight-downloads-4175673764/)

Wget script for overnight downloads.

I just finished hacking this together. I tested as I built. It seems to be working. I've got it running on 2 directories with 2 different input files. Will see tomorrow if it succeeded all files. Sharing both to get suggestions on how to improve, as well as if anyone may find it useful. I'm sure there is a way to simplify this.

Code:

until [[ $(date +%H) == 04 ]] ; do

                while read line ; do

                        if ! echo "$line" | grep '#' > /dev/null ; then

                                VAR1=$(echo "$line" | awk '{print $1}')

                                if echo "$line" | grep '"' > /dev/null ; then

                                        VAR2=$(echo "$line" | cut -d\" -f 2 | cut -d \" -f 1)

                                else

                                        VAR2=$(echo "$line" | awk '{print $2}')

                                fi

                                if [[ ! "$VAR2" ]] ; then

                                        exit 0

                                fi

                                cd "$1" && wget -O "$VAR2" "$VAR1"

                                sed -i "s/${VAR2}/${VAR2} #/" "$2"

                        fi

                done < "$2"

        done

The idea is because I'm still figuring how to scrape an html5 website. Combine that with needing to be able to name the files a certain way and I ended up with this.

Input file layout

Code:

www.download.link "name of file.whatever"

www.download.link2 file.name

done

would be nice to show an input file (how is it related to a html5 page).
use shellcheck to check your script (will show you interesting comments)

your

Code:

echo $line | grep char

# can be replaced by

[[ $line =~ char ]]

# in most cases

which is much faster

also you can use read -r host url
which will give you host and url immediately, there will be no need to use awk/cut...

But there is room for other improvements too...

It's not related to html5 at all on the input. It's manually created. I need to figure out how to scrape it. I'll redo to fit your suggestion on the $line.

VAR1 and VAR2 is not very descriptive.
Neither is $1 and $2.
There's no comments in the code and I don't understand what most of it does.
Starting with the first line: 'until [[ $(date +%H) == 04 ]]' ????

Code:

#!/bin/bash

# tadaen sylvermane | jason gibson



# initial build auto downloader for overnight run on cronta

set -x



if [[ "$2" ]] ; then

        while read line ; do

                # if '#' exists in line then skips to next line

                if [[ ! "$line" =~ '#' ]] ; then

                        # complete download link via copy paste from webpage

                        DL_LINK=$(echo "$line" | awk '{print $1}')

                        # desired output name of file. if file has spaces must be encapsulated

                        # with " marks same as in a regular terminal usage

                        if [[ "$line" =~ '"' ]] ; then

                                OPFILENAME=$(echo "$line" | cut -d "\"" -f 2 | cut -d "\"" -f 1)

                        else

                                OPFILENAME=$(echo "$line" | awk '{print $2}')

                        fi

                        # end of file should have single word. i'm using 'done' when reaching

                        # final line, exit clean

                        if [[ ! "$OPFILENAME" ]] ; then

                                exit 0

                        fi

                        # makes directory as needed

                        if [[ ! -d "$1" ]] ; then

                                mkdir -p "$1"

                        fi

                        # download current line to specified directory with proper name

                        wget -O "$1"/"$OPFILENAME" "$DL_LINK"

                        # add marker to determine if download of given line has been completed

                        # or not yet. idea here is if the list doesn't complete in a given night

                        # it will pick up where it left off without re-downloading the whole

                        if [[ "$line" =~ '"' ]] ; then

                                sed -i "s/\"${OPFILENAME}\"/\"${OPFILENAME}\"#/" "$2"

                        else

                                sed -i "s/${OPFILENAME}/${OPFILENAME}#/" "$2"

                        fi

                fi

                case $(date +%H) in

                        04|05)

                                exit 0

                                ;;

                esac

        done < "$2"

else

        echo "usage ${0} (/download/target/path | filename)"

fi

Ok I though it was working with my stop time (the until loop). full test revealed no dice. Above is what I ended up with. Commented and more defined variable names. The case is so that it has a 2 hour window. hopefully anything I download should be under that timeframe.

It looks like you did not check it with shellcheck. I told you additional improvements, you probably missed them. Don't really important.
You can also put wget commands into background and in that case they will run parallel, need not wait to each other.

I got the [[ "$line" =~ ]] one added. Not sure how to use on the others. more research.

Code:

www.download.link,name of file.whatever

www.download.link2,file.name

done

Code:

#!/bin/bash



while IFS=',' read -r var1 var2

do

  now=$(date +%H)

  if [[ "$now" == "04" ]]; then

      echo "Times up"

      break

  fi

  if [[ "$var1" == "done" ]]; then

      echo "All Done"

      break

  fi



  echo "var1=$var1" 

  echo "var2=$var2"

  

done < "$2"

If I understand the basics of your program here is a quick skeleton example.

Ive seen IFS but never used it. Working with it now. Thank you.

Code:

#!/bin/bash

# tadaen sylvermane | jason gibson

# v 2.0

# automated downloader started by crontab in evening

# additional suggestions and help provided by,

# pan64, ondoho, michaelk from linuxquestions.org



if [[ "$2" ]] ; then

        # uses ` as field separator in input file

        while IFS='`' read -r dl_link op_name ; do

                # exits script if reaches final single field line

                if [[ ! "$op_name" ]] ; then

                        exit 0

                fi

                # if line isn't commented runs everything else to make the download

                if [[ ! "$op_name" =~ '#' ]] ; then

                        # this whole multi variable thing is the only way I could reason this out.

                        # if there is an easier way I will try to make adjustments.

                        # gets file name extension from the download link

                        dl_link_ext=$(basename "$dl_link" | cut -d. -f 2 )

                        # removes all punctuation, single spacing between words

                        # linux may not require file extensions. kodi does

                        op_name_edit=$(echo "$op_name" | sed -e 's/[[:punct:]]/ /g' -e 's/ \+/ /g' -e 's/\b\(.\)/\u\1/g' -e 's/[ \t]+$//')

                        # gets final word in expected file name to sed extension on the end

                        end_of_name=$(echo "$op_name_edit" | awk '{print $NF}')

                        # result of all above operations is file name formatted exactly how I like it

                        op_name_full=$(echo "$op_name_edit" | sed -e "s/${end_of_name}/${end_of_name}.${dl_link_ext}/" -e 's/ *$//')

                        # if required directory doesn't exist, creates it

                        if [[ ! -d "$1" ]] ; then

                                mkdir -p "$1"

                        fi

                        # download and mark files with # as downloaded in input file

                        wget -O "$1"/"$op_name_full" "$dl_link"

                        sed -i "s/${op_name}/${op_name}#/g" "$2"

                        # 2 hour window starting at 3am and ending at 4:59am where it will finish current download

                        # will not start another download during this window

                        # expected result is script will stop and not run again till fired again via crontab in the evening

                        # when run again it will iterate through lines until reaching line not having the # mark and starting

                        # there

                        case $(date +%H) in

                                03|04)

                                        exit 0

                                        ;;

                        esac

                fi

        done < "$2"

else

        echo "usage ${0} (/download/target/path | /input/file)"

fi

This should employ your suggestions as well as give me some regex editing of file name so I can copy paste directly from website and move on.

Inputfile example

Code:

https://download-a.akamaihd.net/files/media_publication/a8/ebtv_E_01_r720P.mp4`Was the Universe Created?

done

After file downloaded

Code:

https://download-a.akamaihd.net/files/media_publication/a8/ebtv_E_01_r720P.mp4`Was the Universe Created?#

done

End filename result in folder

Code:

-rw-rw-r-- 1 jason jason 38401322 Apr 13 08:30 'Was The Universe Created.mp4'

I am grateful for the help. I'm sure this can be optimized and streamlined but for now it's a huge step up from my first attempt posted at first. Next step, is to research how to parse html5 for what I want and script to scrape all of into a file in the above format. Then combine the 2.

Code:

#!/bin/bash



# check input parameters

[[ $2 ]] || { echo "usage ${0} (/download/target/path | /input/file)" >&2; exit 1 }



# process input file

while IFS='`' read -r dl_link op_name ; do

    [[ "$op_name" ]] || break  # end of while loop

    [[ "$op_name" =~ '#' ]] || continue # next line



    dlink_ext="${op_name##*.}"



    ........



    # it is definitely wrong

    # you must not modify the file

    # which is currently read by this while loop

    sed -i "s/${op_name}/${op_name}#/g" "$2" 

    # instead you need to make a copy (before the while)

    # and run sed on that file

    # and when the while has completed replace the original



    ....

 

    case $(date +%H) in

        03|04) break;;

    esac



done



# here comes the replacement

I agree with pan64, you can not edit a file that is already open and and to elaborate on the temp file suggestion.

Code:

tmp=$(mktemp) 

cp $2 $tmp # copy list to temporary file

while IFS='`' read -r dl_link op_name

do

  [[ "$op_name" ]] || break  # end of while loop



  [[ "$op_name" == *"#" ]] && continue # continue if line contains #



  dlink_ext="${op_name##*.}"

  

  .........



  sed -i "s/${op_name}/${op_name}#/g" "$2"



  now=$(date +%H)

  if [[ "$now" == "05" ]]; then

      echo "Times up"

      break

  fi

done < "$tmp"

rm "$tmp" # clean up temporary file

OK. changes made. will run it tonight and see what comes out. Thank you all again.

Ok did some more testing today and had to redo my sed syntax to get it all right. managed to greatly simplify in the process. incorporated syntax suggestions above in almost every place mentioned.

Code:

#!/bin/bash

# tadaen sylvermane | jason gibson

# v 2.1

# automated downloader started by crontab in evening

# additional suggestions and help provided by,

# pan64, ondoho, michaelk from linuxquestions.org

# most of sed sourced from various google searches



# begin script #



if [[ "$2" ]] ; then

        # create temp file to read from so i can edit source as needed to insert #

        # marks

        TEMPINPUT=/tmp/"$2".temp

        cp "$2" "$TEMPINPUT"

        while IFS='`' read -r dl_link op_name ; do

                # exit while loop upon reaching last line in file which should be single

                # word or field

                [[ "$op_name" ]] || break

                # if $op_name has # mark anywhere it will move to next line in sequence

                [[ "$op_name" =~ '#' ]] && continue

                # gets filename extension from download link

                dl_link_ext=$(basename "$dl_link" | cut -d. -f 2 )

                # strips punctuation, capitalizes first letter all words, and changes all

                # spacing to single spaces. the first sed field is for the occasional

                # character as one of the many speakers on the videos has a foreign accent

                # last name. unsure of nationality. as i go i may need to add more of these

                # depending on speakers on individual video files. this is greatly 

                # simplified from v 2.0

                op_name_edit=$(echo "$op_name" \

                | sed -e 's/ö/o/g' -e 's/—/ /g' -e 's/[[:punct:]]//g' -e 's/\b\(.\)/\u\1/g'\

                | tr -s ' ')

                # adds filename extension and removes trailing spaces

                op_name_full=$(echo "$op_name_edit"."$dl_link_ext" | sed -e 's/ *$//')

                # mkdir as needed if not exist yet

                [[ -d "$1" ]] || mkdir -p "$1"

                wget -O "$1"/"$op_name_full" "$dl_link"

                # add hash mark to line in source file to prevent re-doing the same line

                # on next run

                sed -i "s/${op_name}/${op_name}#/g" "$2"

                # i opted for a case so I can give myself a bigger window depending on

                # internet speed at a given time. this will give 1 hour and 59 minutes to

                # finish the current download and run of script at the appropriate time

                case $(date +%H) in

                        03|04)

                                break

                                ;;

                esac

        done < "$TEMPINPUT"

        # remove temp input file copy as no longer needed

        rm "$TEMPINPUT"

else

        echo "usage ${0} (/download/target/path | /input/file)"

fi



# end script #

ok, so do you need any help now?