LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   How can I purge files logarithmically by modification date (BASH)? (https://www.linuxquestions.org/questions/programming-9/how-can-i-purge-files-logarithmically-by-modification-date-bash-923914/)

gunnarflax 01-15-2012 12:58 PM

How can I purge files logarithmically by modification date (BASH)?
 
Hi!

Today I do backups regularly but purge old backups older than a specific date. What I would like is to save all files from the last two days, one file per day from the last week, one file per week for the last month, one file per month for the last year and one file for every year.

I don't fully understand what logic I should implement to achieve something like this. Can anyone help me with pointers on how to implement this and maybe suggestion on packages that can be of help?

What I have achieved so far is this:

Code:

smart_rm ()
{
        #If wrong number of parameters been specified exit
        if [ -z "$1" ]; then
                echo "$ISO_DATETIME [ERROR]: You must specify a directory to clean."
                return 1
        fi

        local TRGT_DIR=$1

        #Target must be a directory
        if [ ! -d "$TRGT_DIR" ]; then
                echo "$ISO_DATETIME [ERROR]: The target must exist and be a directory."
                return 1
        fi

        #Make sure that the path ends with /
        if [ "${TRGT_DIR#${TRGT_DIR%?}}" != "/" ]; then
                TRGT_DIR="${TRGT_DIR}/"
        fi

        #Select and sort all files
        local FILES

        for i in $(ls -t $TRGT_DIR)
        do
                FILES=(${FILES[@]} "${TRGT_DIR}${i}")
        done

        #Delete files
        local FILES_TO_KEEP
        local FILES_FROM_LAST_WEEK

        for i in ${FILES[@]}
        do
                local MOD_DATE=$(stat -c %y $FILE)
                MOD_DATE=$(date -d "${MOD_DATE:0:10}" +%s)

                #If file has been modified within two days we save it
                if [ $MOD_DATE > $(date -d "2 days ago" +%s) ]; then
                        FILES_TO_KEEP=(${FILES[@]} "${TRGT_DIR}${i}")
                fi

#WHAT NOW?!?!

        done
}

Thanks!

rigor 01-15-2012 11:27 PM

Have you you looked at the exec mtime and similar options on the find command?

CollieJim 01-16-2012 12:16 AM

Some possible logic:
Code:

BASENAME = SomeFileName


if day of month == 1
  BASENAME = $BASENAME + "_MO"
if day of month == 1  and  month == 6  then
  BASENAME = $BASENAME + "_AN"
if day of week == MONDAY
  BASENAME = $BASENAME + "_WK"

do backups

for each backup file
  if *AN*
      continue
  else if *MO*
      if older than 1 year
          delete
      fi
      continue
  else if *WK*
      if older than 30 days
          delete
      fi
      continue
  else
      if older than 7 days
          delete
      fi


devUnix 01-16-2012 12:40 AM

CollieJim's pseudo code looks good. You have to decide on Day of Week, Day of Month, Month of Year values, etc.

gunnarflax 01-16-2012 03:38 AM

CollieJim's pseudo code looks promising though I don't fully understand how this would be implemented. In the first part are you suggesting I should modify the filename of the files?

Code:

if day of month == 1
  BASENAME = $BASENAME + "_MO"
if day of month == 1  and  month == 6  then
  BASENAME = $BASENAME + "_AN"
if day of week == MONDAY
  BASENAME = $BASENAME + "_WK"

Since I can't set a filename before the backup has been run (since I need the script to purge old backups) I don't understand how I should be able to select the files. My problem is that I don't understand how I should be able to select files within a certain time period and only select only one of them.

With this snippet I can sort the files on their modification date:

Code:

#Select and sort all files
local FILES

for i in $(ls -t $TRGT_DIR)
do
        FILES=(${FILES[@]} "${TRGT_DIR}${i}")
done

But I still get all files when I just want one for ever day, one for every week, etc. So my first attempt was to try and filter these afterwards:

Code:

#Delete files
local FILES_TO_KEEP
local FILES_FROM_LAST_WEEK

for i in ${FILES[@]}
do
        local MOD_DATE=$(stat -c %y $FILE)
        MOD_DATE=$(date -d "${MOD_DATE:0:10}" +%s)

        #If file has been modified within two days we save it
        if [ $MOD_DATE > $(date -d "2 days ago" +%s) ]; then
                FILES_TO_KEEP=(${FILES[@]} "${TRGT_DIR}${i}")
        fi

#WHAT NOW?!?!

done

...But I have no idea what to do with it.

I have also thought about the approach with find and using the -mtime option:

Code:

find /path/to/files* -mtime +5 -exec rm {} \;
At the moment it seems like the most reasonable option. I guess I would still need to compare it to a modification date on the file to get a date range. And I would also like to be able to sort them on modification date so that I keep the newest copy from the week, etc.

Any suggestions on how I should proceed? If I've have misunderstood CollieJim's code then please help me understand what he means :)

catkin 01-16-2012 05:39 AM

Could you use something like the Towers of Hanoi backup rotation scheme? There is a shell script that says it implements it here.

gunnarflax 01-16-2012 06:46 AM

Quote:

Originally Posted by catkin (Post 4575751)
Could you use something like the Towers of Hanoi backup rotation scheme? There is a shell script that says it implements it here.

That looks interesting! Thanks, I'll look into that!

I also came across information about Rsnapshot. It's a utility that does what I want automatically, so I might base the whole backup system on that instead. Suggestions?

CollieJim 01-16-2012 07:26 AM

I expected basename to be derived from a hostname or username and timestamp, among other possibilities. That way each is unique but grouped by tag (AN, WK, MO).

gunnarflax 01-16-2012 03:48 PM

rsnapshot couldn't be used in the way I needed it too so I've kept going trying to find a solution myself. This is what I've come up with:

Code:

#!/bin/bash

smart_rm ()
{
        #If wrong number of parameters been specified exit
        if [ -z "$1" ]; then
                echo "$ISO_DATETIME [ERROR]: You must specify a directory to clean."
                return 1
        fi

        local TRGT_DIR=$1

        #Target must be a directory
        if [ ! -d "$TRGT_DIR" ]; then
                echo "$ISO_DATETIME [ERROR]: The target must exist and be a directory."
                return 1
        fi

        #Make sure that the path ends with /
        if [ "${TRGT_DIR#${TRGT_DIR%?}}" != "/" ]; then
                TRGT_DIR="${TRGT_DIR}/"
        fi

        #Files to delete
        local FILES_TO_DELETE
        #Set a minimum age for files to be deleted
        local DATE_RM_THRESHOLD=2
        #Create the controller for found files
        local FOUND_ONE=1

        COUNTER=0

        #Loop as long as there are files to examine
        for FILE in $(ls -t $TRGT_DIR)
        do
                #Get the file's modification date
                MTIME=$(date -d "$(stat -c %y $TRGT_DIR$FILE)" +%s)

                #Find one to save for every day the last 7 days
                if [ $DATE_RM_THRESHOLD -le 7 ]; then

                        #Get date range
                        DAY_END=$(date -d "$DATE_RM_THRESHOLD days ago" +%s)
                        DAY_START=$(($DAY_END-60*60*24))

                        #If the file's modification time is earlier then our thrashold we push it back one day
                        if [ $MTIME -lt $DAY_END ]; then
                                DATE_RM_THRESHOLD=$(($DATE_RM_THRESHOLD+1))
                                FOUND_ONE=1
                        fi

                        #Have we found one to keep for this day?
                        if [ $FOUND_ONE -eq 1 ] && [ $MTIME -ge $DAY_START ] && [ $MTIME -lt $DAY_END ]; then
                                FOUND_ONE=0
                                echo "DAY"
                                echo "$FILE"
                        else
                                FILES_TO_DELETE=(${FILES_TO_DELETE[@]} "$TRGT_DIR$FILE")
                        fi
                fi
               
                #Find one to save for every week the last 4 weeks
                if [ $DATE_RM_THRESHOLD -gt 7 ] && [ $DATE_RM_THRESHOLD -le $((7*4)) ]; then
                       
                        #Get date range
                        WEEK_END=$(date -d "$DATE_RM_THRESHOLD days ago" +%s)
                        WEEK_START=$(($WEEK_START-60*60*24*7))

                        #If the file's modification time is earlier than our threshold we push it back one week
                        if [ $MTIME -lt $WEEK_END ]; then
                                DATE_RM_THRESHOLD=$(($DATE_RM_THRESHOLD+7))
                                FOUND_ONE=1
                        fi

                        #Have we found one to keep for this week?
                        if [ $FOUND_ONE -eq 1 ] && [ $MTIME -ge $WEEK_START ] && [ $MTIME -lt $WEEK_END ]; then
                                FOUND_ONE=0
                                echo "WEEK"
                                echo "$FILE"
                        else
                                FILES_TO_DELETE=(${FILES_TO_DELETE[@]} "$TRGT_DIR$FILE")
                        fi       
                fi

                #Find one to save for every month the last 12 months
                if [ $DATE_RM_THRESHOLD -gt $((7*4)) ] && [ $DATE_RM_THRESHOLD -le $((30*12)) ]; then

                        #Get date range
                        MONTH_END=$(date -d "$DATE_RM_THRESHOLD days ago" +%s)
                        MONTH_START=$(($MONTH_START-60*60*24*30))

                        #If the file's modification time is earlier than our threshold we push it back one month
                        if [ $MTIME -lt $MONTH_END ]; then
                                DATE_RM_THRESHOLD=$(($DATE_RM_THRESHOLD+30))
                                FOUND_ONE=1
                        fi

                        #Have we found one to keep for this week?
                        if [ $FOUND_ONE -eq 1 ] && [ $MTIME -ge $MONTH_START ] && [ $MTIME -lt $MONTH_END ]; then
                                FOUND_ONE=0
                                echo "MONTH"
                                echo "$FILE"
                        else
                                FILES_TO_DELETE=(${FILES_TO_DELETE[@]} "$TRGT_DIR$FILE")
                        fi       
                fi

                #Find one to save for every year
                if [ $DATE_RM_THRESHOLD -gt $((30*12)) ]; then
                       
                        #Get date range
                        YEAR_END=$(date -d "$DATE_RM_THRESHOLD days ago" +%s)
                        YEAR_START=$(($MONTH_START-60*60*24*365))

                        #If the file's modification time is earlier than our threshold we push it back one month
                        if [ $MTIME -lt $YEAR_END ]; then
                                DATE_RM_THRESHOLD=$(($DATE_RM_THRESHOLD+365))
                                FOUND_ONE=1
                        fi

                        #Have we found one to keep for this week?
                        if [ $FOUND_ONE -eq 1 ] && [ $MTIME -ge $YEAR_START ] && [ $MTIME -lt $YEAR_END ]; then
                                FOUND_ONE=0
                                echo "YEAR"
                                echo "$FILE"
                        else
                                FILES_TO_DELETE=(${FILES_TO_DELETE[@]} "$TRGT_DIR$FILE")
                        fi
                fi
        done

        #Show result
        #for FILE in ${FILES_TO_DELETE[@]}
        #do
        #        echo $FILE
        #done

        #Delete the selected files
        for FILE in ${FILES_TO_DELETE[@]}
        do
                echo $FILE
                rm -R $FILE
        done
}

I "almost" works! The first run everything goes as it should but on every subsequent run it deletes one more file though there should be no more files to delete.

I've used this script to generate files to test with:

Code:

#!/bin/bash

DAYS=0
DATE=$(date -d "$DAYS days ago" +%Y-%m-%d)

while [ $DAYS -le 1200 ]
do
        DATE=$(date -d "$DAYS days ago" +%Y-%m-%d)

        touch "/home/niklas/test/$DATE.txt"
        touch -d "$DATE" "/home/niklas/test/$DATE.txt"

        DAYS=$(($DAYS+1))
done

echo "You've just created a whole lot of files!"

Any suggestions or improvements to my code?

rigor 01-16-2012 07:37 PM

Rather than running the stat command on each individual file, I might be tempted to do something like this:

Code:

ls -ltd --time-style full-iso | ( read  modes links owner group size date time utc_offset file_name
while [ $? -eq 0 ]
    do
        date -d "$date $time $utc_offset" +%s
        read  modes links owner group size date time utc_offset file_name
    done
)


padeen 01-16-2012 08:13 PM

Just use find, this is what it is for.

Code:

# all files between 25 and 35 days old to maximum depth of 2.
FILES="$(find .  \( -mtime +25 -a -mtime -35 \) -maxdepth 2  -type f -exec /bin/ls -1 {} \+)"
for i in "$FILES" ; do /bin/ls -l "$i" ; done

An old saying in the software field is "good enough is good enough". IOW, it is easy to obsess on getting This done exactly right, and That done exactly right, etc. Really, good enough is ok. If you have some file(s) that is about 30 days old, that is good enough. Rinse and repeat for 7 days, 90 days, 180 days, etc.

gunnarflax 01-17-2012 02:45 AM

Quote:

Originally Posted by padeen (Post 4576384)
Just use find, this is what it is for.

Code:

# all files between 25 and 35 days old to maximum depth of 2.
FILES="$(find .  \( -mtime +25 -a -mtime -35 \) -maxdepth 2  -type f -exec /bin/ls -1 {} \+)"
for i in "$FILES" ; do /bin/ls -l "$i" ; done

An old saying in the software field is "good enough is good enough". IOW, it is easy to obsess on getting This done exactly right, and That done exactly right, etc. Really, good enough is ok. If you have some file(s) that is about 30 days old, that is good enough. Rinse and repeat for 7 days, 90 days, 180 days, etc.

I tried with find first but I didn't know it could find files modified within a date range larger than one day (find -mtime x) I never thought that you could combine the same tests :) Thank you! This will make the code much simpler!

Just curious though, will find require more resources?

padeen 01-17-2012 07:13 AM

At some point, whatever tool you use is going to have to walk the filesystem, whether it is the shell doing it through wildcards or whether it is find.

find's role is to do just that and, while I don't have any data to back me up, I would be surprised if it isn't optimised.

BTW, if you want to look for an alternative, stat is useful for its variety of output. You can parse the output quite easily to get attributes you want. But I would use find if it was me.

gunnarflax 01-17-2012 09:13 AM

Quote:

Originally Posted by padeen (Post 4576803)
At some point, whatever tool you use is going to have to walk the filesystem, whether it is the shell doing it through wildcards or whether it is find.

find's role is to do just that and, while I don't have any data to back me up, I would be surprised if it isn't optimised.

BTW, if you want to look for an alternative, stat is useful for its variety of output. You can parse the output quite easily to get attributes you want. But I would use find if it was me.

I'm trying "find" solution now but I keep getting an error I don't know how to get rid off:

Code:

FILES="$(find $TRGT_DIR* -daystart \( -mtime +$DATE_RM_THRESHOLD -a -mtime -$DATE_RM_LIMIT \) \+)"

output:
find: paths must precede expression: +
Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec] [path...] [expression]

How do I proceed? I can't find an answer on google :)

padeen 01-17-2012 09:48 AM

You haven't given find an exec action. -exec some_command {} \+

{} is a placeholder for all the files that find finds. + means pass them all through at once.

gunnarflax 01-17-2012 10:39 AM

Quote:

Originally Posted by padeen (Post 4576936)
You haven't given find an exec action. -exec some_command {} \+

{} is a placeholder for all the files that find finds. + means pass them all through at once.

Ok, I implemented it in a similar way to my old script since what find does in my script is:

I use Find to find all files within the date range, then I delete the first row from the returned string to make sure that I keep one file for the date range. Can this be done within the Find command with some option like "skip first row" or something? After that I iterate through all files stored in a variable and delete them one by one.

Because I need to save one file per date range I don't think I can utilize the -exec function of Find. Please take a look at the code I have now and see if I can utilize it better.

Unfortunately I also have a bug which I cannot find. My previous script managed to delete properly so that I kept files according to the pattern I wanted. My new implementation with find strangely saves 406 files instead of 20-30 that my other one did. Please help me spot bugs:

Code:

smart_rm ()
{
        #If wrong number of parameters been specified exit
        if [ -z "$1" ]; then
                echo "$ISO_DATETIME [ERROR]: You must specify a directory to clean."
                return 1
        fi

        local TRGT_DIR=$1

        #Target must be a directory
        if [ ! -d "$TRGT_DIR" ]; then
                echo "$ISO_DATETIME [ERROR]: The target must exist and be a directory."
                return 1
        fi

        #Make sure that the path ends with /
        if [ "${TRGT_DIR#${TRGT_DIR%?}}" != "/" ]; then
                TRGT_DIR="${TRGT_DIR}/"
        fi

        #Files to delete
        local FILES_TO_DELETE
        #Set a minimum age for files to be deleted
        local DAY_RM_THRESHOLD=2
        local DAY_SPAN=1
        local DAY_RM_LIMIT=
        local FILES=
        local FILE_COUNT=

        #Loop as long as there are older files
        while [ $(find "$TRGT_DIR"* -daystart -mtime +$DAY_RM_THRESHOLD | wc -l) -gt 0 ]
        do
                if [ $DAY_RM_THRESHOLD -le 7 ]; then
                        FILES=$(find "$TRGT_DIR"* -daystart -mtime $DAY_RM_THRESHOLD)
                else
                        DAY_RM_LIMIT=$(($DAY_RM_THRESHOLD+$DAY_SPAN))
                        FILES=$(find "$TRGT_DIR"* -daystart \( -mtime +$DAY_RM_THRESHOLD -a -mtime -$DAY_RM_LIMIT \) )
                fi

                #Select files to delete
                FILE_COUNT=$(echo "$FILES" | wc -l )

                #Add all except the first to the delete array
                for FILE in $(echo "$FILES" | sed -n 2,"$FILE_COUNT"p)
                do
                        FILES_TO_DELETE=(${FILES_TO_DELETE[@]} "$FILE")
                done

                #Increase the day span accordingly
                if [ $DAY_RM_THRESHOLD -lt 7 ]; then
                        DAY_SPAN=1
                        echo "INCREASE DAY"
                elif [ $DAY_RM_THRESHOLD -ge 7 ] && [ $DAY_RM_THRESHOLD -lt 28 ]; then
                        DAY_SPAN=7
                        echo "INCREASE WEEK"
                elif [ $DAY_RM_THRESHOLD -ge 28 ] && [ $DAY_RM_THRESHOLD -lt 365 ]; then
                        DAY_SPAN=30
                        echo "INCREASE MONTH"
                else
                        DAY_SPAN=365
                        echo "INCREASE YEAR"
                fi

                #Increase the age threshold
                DAY_RM_THRESHOLD=$(($DAY_RM_THRESHOLD+$DAY_SPAN))
        done

        #Show result
        #for FILE in ${FILES_TO_DELETE[@]}
        #do
        #        echo $FILE
        #done

        echo $(ls "$TRGT_DIR" | wc -l)
        echo ${#FILES_TO_DELETE[@]}

        #Delete the selected files
        for FILE in ${FILES_TO_DELETE[@]}
        do
                rm -R $FILE
        done

        echo $(ls "$TRGT_DIR" | wc -l)
}


gunnarflax 01-17-2012 06:40 PM

Now I got a fully functioning version with the "ls"-method! It is as follows:
Code:

        #Files to delete
        local FILES_TO_DELETE
        #Set a minimum age for files to be deleted
        local DAY_RM_THRESHOLD=2
        local DAY_SPAN=1
        #Create the controller for found files
        local FOUND_ONE=1

        #Loop as long as there are files to examine
        for FILE in $(ls -t $TRGT_DIR)
        do
                #Get the file's modification date
                FILE="$TRGT_DIR$FILE"
                MTIME=$(date -d "$(stat -c %y $FILE)" +%s)

                #Increase the day span accordingly
                if [ $DAY_RM_THRESHOLD -lt 7 ]; then
                        DAY_SPAN=1
                elif [ $DAY_RM_THRESHOLD -ge 7 ] && [ $DAY_RM_THRESHOLD -lt 28 ]; then
                        DAY_SPAN=7
                elif [ $DAY_RM_THRESHOLD -ge 28 ] && [ $DAY_RM_THRESHOLD -lt $((28+30*11)) ]; then
                        DAY_SPAN=30
                else
                        DAY_SPAN=365
                fi

                #If the file's modification time is earlier then our date range we push it back one $DAY_SPAN
                if [ $MTIME -lt $(date -d "$(($DAY_RM_THRESHOLD+$DAY_SPAN)) days ago" +%s) ]; then
                        DAY_RM_THRESHOLD=$(($DAY_RM_THRESHOLD+$DAY_SPAN))
                        FOUND_ONE=1
                fi

                #Get date range
                DATE_END=$(date -d "$DAY_RM_THRESHOLD days ago" +%s)
                DATE_START=$(date -d "$(($DAY_RM_THRESHOLD+$DAY_SPAN)) days ago" +%s)

                #Have we found one to keep for this day?
                if [ $FOUND_ONE -eq 1 ] && [ $MTIME -ge $DATE_START ] && [ $MTIME -lt $DATE_END ]; then
                        FOUND_ONE=0
                else
                        FILES_TO_DELETE=(${FILES_TO_DELETE[@]} "$FILE")
                fi       
        done

        #Delete the selected files
        for FILE in ${FILES_TO_DELETE[@]}
        do
                rm -R $FILE
        done

Though I would very much like to know why my "find"-version don't select all files. The find-version removes about 400 files to few. The ls-version is basically the same as the find so I don't understand what can cause this strange behaviour. Can I get some feedback on these two scripts and why find behaves so differently?

find:
Code:

        #Files to delete
        local FILES_TO_DELETE
        #Set a minimum age for files to be deleted
        local DAY_RM_THRESHOLD=2
        local DAY_SPAN=1
        local FILES=
        local LINE_COUNT=

        #Loop as long as there are older files
        while [ $(find "$TRGT_DIR"* -daystart -mtime +$DAY_RM_THRESHOLD | wc -l) -gt 0 ]
        do
                if [ $DAY_RM_THRESHOLD -le 7 ]; then
                        FILES=$(ls -t $(find "$TRGT_DIR"* -daystart -mtime $DAY_RM_THRESHOLD))
                else
                        FILES=$(ls -t $(find "$TRGT_DIR"* -daystart \( -mtime +$(($DAY_RM_THRESHOLD)) -a -mtime -$(($DAY_RM_THRESHOLD+$DAY_SPAN)) \) ))
                fi

                #Select files to delete
                LINE_COUNT=$(echo "$FILES" | wc -l )
                #echo $LINE_COUNT

                #Add all files except the first to the delete array
                for FILE in $(echo "$FILES" | tail -n $(($LINE_COUNT)))
                do
                        #echo $FILE
                        FILES_TO_DELETE=(${FILES_TO_DELETE[@]} "$FILE")
                done

                #Increase the day span accordingly
                if [ $DAY_RM_THRESHOLD -lt 7 ]; then
                        DAY_SPAN=1
                        echo "INCREASE DAY"
                elif [ $DAY_RM_THRESHOLD -ge 7 ] && [ $DAY_RM_THRESHOLD -lt 28 ]; then
                        DAY_SPAN=7
                        echo "INCREASE WEEK"
                elif [ $DAY_RM_THRESHOLD -ge 28 ] && [ $DAY_RM_THRESHOLD -lt $((28+30*11)) ]; then
                        DAY_SPAN=30
                        echo "INCREASE MONTH"
                else
                        DAY_SPAN=365
                        echo "INCREASE YEAR"
                fi

                #echo $DAY_SPAN
                #echo $DAY_RM_THRESHOLD

                #Increase the age threshold
                DAY_RM_THRESHOLD=$(($DAY_RM_THRESHOLD+$DAY_SPAN))
        done

        #Delete the selected files
        for FILE in ${FILES_TO_DELETE[@]}
        do
                rm -R $FILE
        done


rigor 01-17-2012 08:13 PM

find operates recursively unless you tell it not to, ls operates recursively if you tell it to.

So in most situations you wouldn't put the "star"/"asterisk" pattern matching char after the directory name with find.

Typically find would be used:

Code:

find "$TRGT_DIR"
not

Code:

find "$TRGT_DIR"*
In some situations, using the asterisk, might effectively cause duplicates of file names in the list of files.

rigor 01-17-2012 08:40 PM

Another possible issue is what is sometimes called the "command buffer", or "argument length". That's why, although it may not have seemed "elegant", I illustrated the output of an ls command being read as variables, and acting on a single file name per loop iteration, rather than build a single "long" command with a list of file names. The list of file names may grow to be too long, depending on your exact situation.

That's also why using the exec option of find, or using find with the xargs command can be so nice, since it puts the list of files through a pipe, not in a command buffer, which may be implemented with length limitations.

gunnarflax 01-18-2012 06:39 AM

I would like to get it working with find but I can't find a working way to implement it. My ls-version works rather well right now. The only issue I have, which isn't a problem for me right now but could if the script would be used on another directory, is that I can't process file names with spaces in them. I loop on the result I get from ls and the loop apparently don't process line per line but processes word per word instead. Is there some way to solve this? Can I do the loop on some other way? Maybe pipe the ls result into something else?

Here is my script as it is right now with ls:

Code:

smart_rm_backups ()
{
        local TRGT_DIR=""
        local DAY_RM_THRESHOLD=2
        local DAY_SPAN=1
        local DIRECTORIES=1

        while getopts ":p:t:d" opt; do
                case $opt in
                        p)
                                TRGT_DIR=$OPTARG
                        ;;
                        t)
                                if [ $OPTARG -lt 7 ]; then
                                        DAY_RM_THRESHOLD=$OPTARG
                                else
                                        DAY_RM_THRESHOLD=7
                                        DAY_SPAN=7
                                fi
                        ;;
                        d)
                                DIRECTORIES=0
                        ;;
                esac
        done

        #Target must be a directory
        if [ ! -d "$TRGT_DIR" ]; then
                echo "$ISO_DATETIME [ERROR]: The target must exist and be a directory."
                return 1
        fi

        #Make sure that the path ends with /
        if [ "${TRGT_DIR#${TRGT_DIR%?}}" != "/" ]; then
                TRGT_DIR="${TRGT_DIR}/"
        fi

        #Files to delete
        local FILES_TO_DELETE
        local FOUND_ONE=1

        #Loop as long as there are files to examine
        for FILE in $(ls -1 -t $TRGT_DIR -I "*~")
        do
                #Get the file's modification date
                FILE="$TRGT_DIR$FILE"
                MTIME=$(date -d "$(stat -c %y $FILE)" +%s)

                #Check if we should skip directories
                if [ $DIRECTORIES -eq 1 ] && [ -d "$FILE" ]; then
                        continue
                fi

                #Increase the day span accordingly
                if [ $DAY_RM_THRESHOLD -lt 7 ]; then
                        DAY_SPAN=1
                elif [ $DAY_RM_THRESHOLD -ge 7 ] && [ $DAY_RM_THRESHOLD -lt 28 ]; then
                        DAY_SPAN=7
                elif [ $DAY_RM_THRESHOLD -ge 28 ] && [ $DAY_RM_THRESHOLD -lt $((28+30*11)) ]; then
                        DAY_SPAN=30
                else
                        DAY_SPAN=365
                fi

                #If the file's modification time is earlier than our date range we push it back one $DAY_SPAN
                if [ $MTIME -lt $(date -d "$(($DAY_RM_THRESHOLD+$DAY_SPAN)) days ago" +%s) ]; then
                        DAY_RM_THRESHOLD=$(($DAY_RM_THRESHOLD+$DAY_SPAN))
                        FOUND_ONE=1
                fi

                #Get date range
                DATE_END=$(date -d "$DAY_RM_THRESHOLD days ago" +%s)
                DATE_START=$(date -d "$(($DAY_RM_THRESHOLD+$DAY_SPAN)) days ago" +%s)

                #Have we found one to keep for this day?
                if [ $FOUND_ONE -eq 1 ] && [ $MTIME -ge $DATE_START ] && [ $MTIME -lt $DATE_END ]; then
                        FOUND_ONE=0
                else
                        FILES_TO_DELETE=(${FILES_TO_DELETE[@]} "$FILE")
                fi       
        done

        #Delete the selected files
        for FILE in ${FILES_TO_DELETE[@]}
        do
                rm -R $FILE
        done
}


Cedrik 01-18-2012 07:21 AM

Maybe use read ?
Code:

ls -1 -t $TRGT_DIR -I "*~" | while read FILE; do
...

Also don't forget to quote $FILE everywhere
Code:

MTIME=$(date -d "$(stat -c %y "$FILE")" +%s)

gunnarflax 01-18-2012 09:21 AM

Quote:

Originally Posted by Cedrik (Post 4577791)
Maybe use read ?
Code:

ls -1 -t $TRGT_DIR -I "*~" | while read FILE; do
...

Also don't forget to quote $FILE everywhere
Code:

MTIME=$(date -d "$(stat -c %y "$FILE")" +%s)

Thank you! That solved it, though I had some trouble finding out that read was in a subshell and couldn't set variables so I had to output it all to a temporary file.

This isn't a very elegant solution so if someone knows how to get this done find, please let me know :) this is what I got now:

Code:

smart_rm_backups ()
{
        local TRGT_DIR=""
        local DAY_RM_THRESHOLD=2
        local DAY_SPAN=1
        local DIRECTORIES=1
        local FOUND_ONE=1

        while getopts ":p:t:d" opt; do
                case $opt in
                        p)
                                TRGT_DIR=$OPTARG
                        ;;
                        t)
                                if [ $OPTARG -lt 7 ]; then
                                        DAY_RM_THRESHOLD=$OPTARG
                                else
                                        DAY_RM_THRESHOLD=7
                                        DAY_SPAN=7
                                fi
                        ;;
                        d)
                                DIRECTORIES=0
                        ;;
                esac
        done

        #Target must be a directory
        if [ ! -d "$TRGT_DIR" ]; then
                echo "$ISO_DATETIME [ERROR]: The target must exist and be a directory."
                return 1
        fi

        #Make sure that the path ends with /
        if [ "${TRGT_DIR#${TRGT_DIR%?}}" != "/" ]; then
                TRGT_DIR="${TRGT_DIR}/"
        fi

        #Find files to remove and put them in "files_to_remove.tmp"
        ls -1 -t $TRGT_DIR -I "*~" | while read FILE
        do
                #Get the file's modification date
                FILE="$TRGT_DIR$FILE"
                MTIME=$(date -d "$(stat -c %y "$FILE")" +%s)

                #Check if we should skip directories
                if [ $DIRECTORIES -eq 1 ] && [ -d "$FILE" ]; then
                        continue
                fi

                #Increase the day span accordingly
                if [ $DAY_RM_THRESHOLD -lt 7 ]; then
                        DAY_SPAN=1
                elif [ $DAY_RM_THRESHOLD -ge 7 ] && [ $DAY_RM_THRESHOLD -lt 28 ]; then
                        DAY_SPAN=7
                elif [ $DAY_RM_THRESHOLD -ge 28 ] && [ $DAY_RM_THRESHOLD -lt $((28+30*11)) ]; then
                        DAY_SPAN=30
                else
                        DAY_SPAN=365
                fi

                #If the file's modification time is earlier than our date range we push it back one $DAY_SPAN
                if [ $MTIME -lt $(date -d "$(($DAY_RM_THRESHOLD+$DAY_SPAN)) days ago" +%s) ]; then
                        DAY_RM_THRESHOLD=$(($DAY_RM_THRESHOLD+$DAY_SPAN))
                        FOUND_ONE=1
                fi

                #Get date range
                DATE_END=$(date -d "$DAY_RM_THRESHOLD days ago" +%s)
                DATE_START=$(date -d "$(($DAY_RM_THRESHOLD+$DAY_SPAN)) days ago" +%s)

                #Have we found one to keep for this day?
                if [ $FOUND_ONE -eq 1 ] && [ $MTIME -ge $DATE_START ] && [ $MTIME -lt $DATE_END ]; then
                        FOUND_ONE=0
                else
                        echo "$FILE"
                fi       
        done > "files_to_delete.tmp"

        #Delete the files
        OLDIFS=$IFS
        IFS=$'\n'

        cat "files_to_delete.tmp" | while read FILE
        do
                rm -R $FILE
        done

        #Reset IFS
        IFS=$OLDIFS

        #Remove the temporary file containing what to remove
        rm "files_to_delete.tmp"
}


rigor 01-18-2012 03:16 PM

Could you just put the "rm" in place of the "echo", eliminate the other loop and the temporary file?

Code:


                .  .  .

                #Have we found one to keep for this day?
                if [ $FOUND_ONE -eq 1 ] && [ $MTIME -ge $DATE_START ] && [ $MTIME -lt $DATE_END ]; then
                        FOUND_ONE=0
                else
                    rm -R "$FILE"
                fi
        done
}

As has effectively already been mentioned, if you've got spaces within the file names, you should quote virtually any use of the FILE variable value.

Another thought, there are different types of elegance. Someone could have a program with 100 lines of code that uses "brute force" approaches in the code, which might not seem very elegant, at a code level. Perhaps the program could be changed to use more "sophisticated" approaches, which result in a program with only 10 lines of code. The program with 10 lines of code, might seem to have more elegant code. But if the program that is more elegant at a code level, instead consumes 10 times as much of the computer's horsepower than the approach that it is brute force at a code level, then the "elegant" code is not elegant in it's use of the computer's resources.

With a shell script, typically, using something built into the shell to do the same thing as an external program, takes less of the computer's horsepower. Reading values into variables may seem rather brute-force/not-elegant, but if it saves running programs external to the shell, it may be elegant as far as it's use of the computer's horsepower.

gunnarflax 01-18-2012 04:06 PM

I guess I can. It just doesn't feel right to remove a file before knowing exactly what to delete :) I can also pipe into xargs at the end of the loop. Would it be better to run rm once for every file or just once for all files?

rigor 01-18-2012 06:13 PM

Quote:

Originally Posted by gunnarflax (Post 4578279)
I guess I can. It just doesn't feel right to remove a file before knowing exactly what to delete :) I can also pipe into xargs at the end of the loop. Would it be better to run rm once for every file or just once for all files?

Maybe I'm in too much of a hurry, so misinterpreting the shell script code you've shown us, but it appears that when you are ready to echo the file name, you do know what file you wish to delete. By echoing the file name, you are placing that file name into the temporary file, which contains the list of file names to be deleted, yes? Don't you already run "rm" in a loop, a separate loop after the main loop. If you don't know how many characters worth of files names you might have, that's where that command buffer size or argument length limitation comes into play. If you try to remove all the files you want to remove with a single command, the script might fail on an error, because you exceed such limit(s).

If you could assure that file names on your system had at most one space in sequence, not two, and that your script would not be running when "midnight" occurs, you could use code such as this to get the "dates" for all backup files all at once:

Code:

#!/bin/bash

declare -a file_info

# Get all file modification dates as seconds since Unix/Linux Epoch,
# with a single command.
# Then eliminate output columns apart from date and file name.
# Handle file name as portion of array during read, to account
# for possible space in file name.
ls  -1lt -I "*~"  --time-style +%s  $TRGT_DIR  |  tail +2  |  cut -c36-  |  while read -a file_info
        do
                file_date=${file_info[0]}
                # Remove file date from array.
                unset file_info[0]
                # Concatenate array elements to form file name, handles single spaces in file names, not two spaces in sequence.
                file_name="${file_info[@]}"
                echo "file_name: '$file_name', date as secs. since Epoch: $file_date"
        done

For each date threshold, store the current number of seconds since the Epoch in a some variables, so avoid the use of the external date and stat commands inside the loop.

padeen 02-01-2012 09:17 PM

Quote:

Originally Posted by kakaka (Post 4578347)
command buffer size or argument length limitation

Just FYI, it's pretty unlikely nowadays, since kernel 2.6.23. Most implementations have it around the 2MB size, which is a lot of file names! man execve(2) and http://stackoverflow.com/questions/1...variable-value

rigor 02-01-2012 11:05 PM

Well, I'm using kernel 2.6.37, and I get the error often enough on average to remind me that it's there.

After all, if you process plenty of full path names, as output by commands such as locate, or find, a recursive grep with plenty of result file names, the total number of characters can add up fairly quickly.

gunnarflax 02-02-2012 03:41 AM

I've finished the script and it works quite well, I'll post it here as soon as I can. Thanks for all the help!

gunnarflax 02-02-2012 11:30 AM

I've attached the final code to this post. I've never put something under a license before but I thought it'd be nice to do it. I simply followed the instructions here: http://www.gnu.org/licenses/gpl-howto.html

I couldn't attach an archive including the license but I think that's ok. Please let me know if I did something wrong :)

Code:

#!/bin/bash

#----------------------------------------------#
#    Copyright (C) Niklas Rosenqvist, 2012
#----------------------------------------------#
#
#    This program is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see <http://www.gnu.org/licenses/>.

#----------------------------#
# Smart rm old files
#----------------------------#

# -t: = path to directory to clean
# -a: = amount of days to save all files (default = 2, max 7)
# -d = delete directories as well

show_error ()
{
        echo "${PROG_NAME}: ${1:-"Unknown error"}"
}

PROG_NAME=$(basename $0)
TRGT_DIR=""
DAY_RM_THRESHOLD=2
DAY_SPAN=1
DAYS_TO_SAVE=$(date -d "2 days ago" +%s)
DIRECTORIES=1
FOUND_ONE=1

while getopts ":t:a:d" opt; do
        case $opt in
                t)
                        TRGT_DIR=$OPTARG
                ;;
                a)
                        #Check if age is an integer
                        if ! [[ "$OPTARG" =~ [0-97]+$ ]]; then
                                show_error "[ERROR]: Age (-a) must be an integer with the value 1-7"
                                exit 1
                        fi

                        if [ $OPTARG -lt 7 ]; then
                                DAY_RM_THRESHOLD=$OPTARG
                                DAYS_TO_SAVE=$(date -d "$(($OPTARG)) days ago" +%s)
                        else
                                DAY_RM_THRESHOLD=7
                                DAYS_TO_SAVE=$(date -d "7 days ago" +%s)
                                DAY_SPAN=7
                        fi
                ;;
                d)
                        DIRECTORIES=0
                ;;
        esac
done

#Reset $OPTIND
OPTIND=1

#Target must be a directory
if [ ! -d "$TRGT_DIR" ]; then
        show_error "[ERROR]: The target must exist and be a directory."
        exit 1
fi

echo "[INFO]: Starting the logarithmic backup cleaning."

#Find files to remove and put them in "files_to_remove.tmp"
ls -1 -t "$TRGT_DIR/" -I "*~" | while read FILE
do
        #Get the file's modification date
        FILE="$TRGT_DIR/$FILE"
        MTIME=$(date -d "$(stat -c %y "$FILE")" +%s)

        #If it's within the range to save all files we skip this one
        if [ $MTIME -ge $DAYS_TO_SAVE ]; then
                continue
        fi

        #Check if we should skip directories
        if [ $DIRECTORIES -eq 1 ] && [ -d "$FILE" ]; then
                continue
        fi

        #Increase the day span accordingly
        if [ $DAY_RM_THRESHOLD -lt 7 ]; then
                DAY_SPAN=1
        elif [ $DAY_RM_THRESHOLD -ge 7 ] && [ $DAY_RM_THRESHOLD -lt 28 ]; then
                DAY_SPAN=7
        elif [ $DAY_RM_THRESHOLD -ge 28 ] && [ $DAY_RM_THRESHOLD -lt $((28+30*11)) ]; then
                DAY_SPAN=30
        else
                DAY_SPAN=365
        fi

        #If the file's modification time is earlier than our date range we push it back one $DAY_SPAN
        if [ $MTIME -lt $(date -d "$(($DAY_RM_THRESHOLD+$DAY_SPAN)) days ago" +%s) ]; then
                DAY_RM_THRESHOLD=$(($DAY_RM_THRESHOLD+$DAY_SPAN))
                FOUND_ONE=1
        fi

        #Get date range
        DATE_END=$(date -d "$DAY_RM_THRESHOLD days ago" +%s)
        DATE_START=$(date -d "$(($DAY_RM_THRESHOLD+$DAY_SPAN)) days ago" +%s)

        #Have we found one to keep for this day?
        if [ $FOUND_ONE -eq 1 ] && [ $MTIME -ge $DATE_START ] && [ $MTIME -lt $DATE_END ]; then
                FOUND_ONE=0
        else
                rm -R "$FILE"
        fi
done
#done | xargs -d '\n' rm -R

echo "[INFO]: Cleaning of old files complete!"

exit 0

Thanks for all the help!

Reuti 02-02-2012 04:48 PM

Looks like it would be nice if find has an option to specify besides -depth something to sort the entries by time, either in each directory or overall.

NB: stat -c %Y "$FILE" (uppercase y) outputs the seconds directly.

gunnarflax 02-03-2012 02:34 AM

Quote:

Originally Posted by Reuti (Post 4592197)
Looks like it would be nice if find has an option to specify besides -depth something to sort the entries by time, either in each directory or overall.

NB: stat -c %Y "$FILE" (uppercase y) outputs the seconds directly.

I don't use find in the script :S

Thanks for the tip about stat!

caco3 03-10-2013 02:45 PM

Thank you for this great script!
I improved it a bit, so one can also define how many weeks it should keep the backups.
Also I added a bit of documentation and a dry run option:


Code:

#!/bin/bash

#----------------------------------------------#
#    Copyright (C) Niklas Rosenqvist, 2012
#    Improved by George Ruinelli, 2013
#    For updates see http://www.linuxquestions.org/questions/programming-9/how-can-i-purge-files-logarithmically-by-modification-date-bash-923914/page3.html
#----------------------------------------------#
#
#    This program is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see <http://www.gnu.org/licenses/>.

#----------------------------#
# Smart rm old files
#----------------------------#

# This script will remove all files except the ones that match the following:
# - All files younger than -a days
# - For all files older than -a days, it keeps one per week, but only for -w weeks
#

# -t: = path to directory to clean
# -a: = amount of days to save all files (default = 2, max 7)
# -w: = amount of weeks to save one file per week (default = 8)
# -d = delete directories as well
# -D = Dry run, do not actually do anything

show_error ()
{
        echo "${PROG_NAME}: ${1:-"Unknown error"}"

        echo "Valid parameters:"
        echo "-t: = path to directory to clean"
        echo "-a: = amount of days to save all files (default = 2, max 7)"
        echo "-w: = amount of weeks to save one file per week (default = 8)"
        echo "-d = delete directories as well"
        echo "-D = Dry run, not actually deleting any file"
}

PROG_NAME=$(basename $0)
TRGT_DIR=""
DAY_RM_THRESHOLD=2
WEEK_RM_THRESHOLD=8
DAY_SPAN=1
DAYS_TO_SAVE=$(date -d "2 days ago" +%s)
DIRECTORIES=1
FOUND_ONE=1
DRY_RUN=0

while getopts ":t:a:w:d:D" opt; do
        case $opt in
                t)
                        TRGT_DIR=$OPTARG
                ;;
                a)
                        #Check if age is an integer
                        if ! [[ "$OPTARG" =~ [0-97]+$ ]]; then
                                show_error "[ERROR]: Age (-a) must be an integer with the value 1-7"
                                exit 1
                        fi

                        if [ $OPTARG -lt 7 ]; then
                                DAY_RM_THRESHOLD=$OPTARG
                                DAYS_TO_SAVE=$(date -d "$(($OPTARG)) days ago" +%s)
                        else
                                DAY_RM_THRESHOLD=7
#                                show_error "[WARNING]: Limiting age (-a) to 7"
                                DAYS_TO_SAVE=$(date -d "7 days ago" +%s)
                                DAY_SPAN=7
                        fi
                ;;
                w)
                        #Set weeks
                        if ! [[ "$OPTARG" =~ [0-97]+$ ]]; then
                                show_error "[ERROR]: Week (-w) must be an integer"
                                exit 1
                        fi

                        WEEK_RM_THRESHOLD=$OPTARG
                ;;
                d)
                        DIRECTORIES=0
                ;;
                D)
                        DRY_RUN=1
                ;;
        esac
done

#Reset $OPTIND
OPTIND=1

#Target must be a directory
if [ ! -d "$TRGT_DIR" ]; then
        show_error "[ERROR]: The target must exist and be a directory."
        exit 1
fi

echo "[INFO]: Starting the logarithmic backup cleaning."

if [ $DRY_RUN -eq 1 ]; then
    echo "[INFO]: We are in dry run mode, not actually deleting any file!"
fi

echo "DAY_RM_THRESHOLD=$DAY_RM_THRESHOLD"
echo "WEEK_RM_THRESHOLD=$WEEK_RM_THRESHOLD"

#Find files to remove and put them in "files_to_remove.tmp"
ls -1 -t "$TRGT_DIR/" -I "*~" | while read FILE
do
        #Get the file's modification date
        FILE="$TRGT_DIR/$FILE"
        MTIME=$(date -d "$(stat -c %y "$FILE")" +%s)

        #If it's within the range to save all files we skip this one
        if [ $MTIME -ge $DAYS_TO_SAVE ]; then
                echo "Keep  $FILE"
                continue
        fi

        #Check if we should skip directories
        if [ $DIRECTORIES -eq 1 ] && [ -d "$FILE" ]; then
                continue
        fi

        #Increase the day span accordingly
        if [ $DAY_RM_THRESHOLD -lt 7 ]; then
                DAY_SPAN=1
        elif [ $DAY_RM_THRESHOLD -ge 7 ] && [ $DAY_RM_THRESHOLD -lt $((WEEK_RM_THRESHOLD*7)) ]; then
                DAY_SPAN=7
        elif [ $DAY_RM_THRESHOLD -ge 28 ] && [ $DAY_RM_THRESHOLD -lt $((28+30*11)) ]; then
                DAY_SPAN=30
        else
                DAY_SPAN=365
        fi

        #If the file's modification time is earlier than our date range we push it back one $DAY_SPAN
        if [ $MTIME -lt $(date -d "$(($DAY_RM_THRESHOLD+$DAY_SPAN)) days ago" +%s) ]; then
                DAY_RM_THRESHOLD=$(($DAY_RM_THRESHOLD+$DAY_SPAN))
                FOUND_ONE=1
        fi

        #Get date range
        DATE_END=$(date -d "$DAY_RM_THRESHOLD days ago" +%s)
        DATE_START=$(date -d "$(($DAY_RM_THRESHOLD+$DAY_SPAN)) days ago" +%s)

        #Have we found one to keep for this day?
        if [ $FOUND_ONE -eq 1 ] && [ $MTIME -ge $DATE_START ] && [ $MTIME -lt $DATE_END ]; then
                FOUND_ONE=0
                echo "Keep  $FILE"
        else
                echo "Remove $FILE"

                if [ $DRY_RUN -eq 0 ]; then
                      rm -R "$FILE"
                fi
        fi
done
#done | xargs -d '\n' rm -R

echo "[INFO]: Cleaning of old files complete!"

exit 0



All times are GMT -5. The time now is 08:12 AM.