LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   How can I purge files logarithmically by modification date (BASH)? (https://www.linuxquestions.org/questions/programming-9/how-can-i-purge-files-logarithmically-by-modification-date-bash-923914/)

gunnarflax 01-15-2012 12:58 PM

How can I purge files logarithmically by modification date (BASH)?
 
Hi!

Today I do backups regularly but purge old backups older than a specific date. What I would like is to save all files from the last two days, one file per day from the last week, one file per week for the last month, one file per month for the last year and one file for every year.

I don't fully understand what logic I should implement to achieve something like this. Can anyone help me with pointers on how to implement this and maybe suggestion on packages that can be of help?

What I have achieved so far is this:

Code:

smart_rm ()
{
        #If wrong number of parameters been specified exit
        if [ -z "$1" ]; then
                echo "$ISO_DATETIME [ERROR]: You must specify a directory to clean."
                return 1
        fi

        local TRGT_DIR=$1

        #Target must be a directory
        if [ ! -d "$TRGT_DIR" ]; then
                echo "$ISO_DATETIME [ERROR]: The target must exist and be a directory."
                return 1
        fi

        #Make sure that the path ends with /
        if [ "${TRGT_DIR#${TRGT_DIR%?}}" != "/" ]; then
                TRGT_DIR="${TRGT_DIR}/"
        fi

        #Select and sort all files
        local FILES

        for i in $(ls -t $TRGT_DIR)
        do
                FILES=(${FILES[@]} "${TRGT_DIR}${i}")
        done

        #Delete files
        local FILES_TO_KEEP
        local FILES_FROM_LAST_WEEK

        for i in ${FILES[@]}
        do
                local MOD_DATE=$(stat -c %y $FILE)
                MOD_DATE=$(date -d "${MOD_DATE:0:10}" +%s)

                #If file has been modified within two days we save it
                if [ $MOD_DATE > $(date -d "2 days ago" +%s) ]; then
                        FILES_TO_KEEP=(${FILES[@]} "${TRGT_DIR}${i}")
                fi

#WHAT NOW?!?!

        done
}

Thanks!

rigor 01-15-2012 11:27 PM

Have you you looked at the exec mtime and similar options on the find command?

CollieJim 01-16-2012 12:16 AM

Some possible logic:
Code:

BASENAME = SomeFileName


if day of month == 1
  BASENAME = $BASENAME + "_MO"
if day of month == 1  and  month == 6  then
  BASENAME = $BASENAME + "_AN"
if day of week == MONDAY
  BASENAME = $BASENAME + "_WK"

do backups

for each backup file
  if *AN*
      continue
  else if *MO*
      if older than 1 year
          delete
      fi
      continue
  else if *WK*
      if older than 30 days
          delete
      fi
      continue
  else
      if older than 7 days
          delete
      fi


devUnix 01-16-2012 12:40 AM

CollieJim's pseudo code looks good. You have to decide on Day of Week, Day of Month, Month of Year values, etc.

gunnarflax 01-16-2012 03:38 AM

CollieJim's pseudo code looks promising though I don't fully understand how this would be implemented. In the first part are you suggesting I should modify the filename of the files?

Code:

if day of month == 1
  BASENAME = $BASENAME + "_MO"
if day of month == 1  and  month == 6  then
  BASENAME = $BASENAME + "_AN"
if day of week == MONDAY
  BASENAME = $BASENAME + "_WK"

Since I can't set a filename before the backup has been run (since I need the script to purge old backups) I don't understand how I should be able to select the files. My problem is that I don't understand how I should be able to select files within a certain time period and only select only one of them.

With this snippet I can sort the files on their modification date:

Code:

#Select and sort all files
local FILES

for i in $(ls -t $TRGT_DIR)
do
        FILES=(${FILES[@]} "${TRGT_DIR}${i}")
done

But I still get all files when I just want one for ever day, one for every week, etc. So my first attempt was to try and filter these afterwards:

Code:

#Delete files
local FILES_TO_KEEP
local FILES_FROM_LAST_WEEK

for i in ${FILES[@]}
do
        local MOD_DATE=$(stat -c %y $FILE)
        MOD_DATE=$(date -d "${MOD_DATE:0:10}" +%s)

        #If file has been modified within two days we save it
        if [ $MOD_DATE > $(date -d "2 days ago" +%s) ]; then
                FILES_TO_KEEP=(${FILES[@]} "${TRGT_DIR}${i}")
        fi

#WHAT NOW?!?!

done

...But I have no idea what to do with it.

I have also thought about the approach with find and using the -mtime option:

Code:

find /path/to/files* -mtime +5 -exec rm {} \;
At the moment it seems like the most reasonable option. I guess I would still need to compare it to a modification date on the file to get a date range. And I would also like to be able to sort them on modification date so that I keep the newest copy from the week, etc.

Any suggestions on how I should proceed? If I've have misunderstood CollieJim's code then please help me understand what he means :)

catkin 01-16-2012 05:39 AM

Could you use something like the Towers of Hanoi backup rotation scheme? There is a shell script that says it implements it here.

gunnarflax 01-16-2012 06:46 AM

Quote:

Originally Posted by catkin (Post 4575751)
Could you use something like the Towers of Hanoi backup rotation scheme? There is a shell script that says it implements it here.

That looks interesting! Thanks, I'll look into that!

I also came across information about Rsnapshot. It's a utility that does what I want automatically, so I might base the whole backup system on that instead. Suggestions?

CollieJim 01-16-2012 07:26 AM

I expected basename to be derived from a hostname or username and timestamp, among other possibilities. That way each is unique but grouped by tag (AN, WK, MO).

gunnarflax 01-16-2012 03:48 PM

rsnapshot couldn't be used in the way I needed it too so I've kept going trying to find a solution myself. This is what I've come up with:

Code:

#!/bin/bash

smart_rm ()
{
        #If wrong number of parameters been specified exit
        if [ -z "$1" ]; then
                echo "$ISO_DATETIME [ERROR]: You must specify a directory to clean."
                return 1
        fi

        local TRGT_DIR=$1

        #Target must be a directory
        if [ ! -d "$TRGT_DIR" ]; then
                echo "$ISO_DATETIME [ERROR]: The target must exist and be a directory."
                return 1
        fi

        #Make sure that the path ends with /
        if [ "${TRGT_DIR#${TRGT_DIR%?}}" != "/" ]; then
                TRGT_DIR="${TRGT_DIR}/"
        fi

        #Files to delete
        local FILES_TO_DELETE
        #Set a minimum age for files to be deleted
        local DATE_RM_THRESHOLD=2
        #Create the controller for found files
        local FOUND_ONE=1

        COUNTER=0

        #Loop as long as there are files to examine
        for FILE in $(ls -t $TRGT_DIR)
        do
                #Get the file's modification date
                MTIME=$(date -d "$(stat -c %y $TRGT_DIR$FILE)" +%s)

                #Find one to save for every day the last 7 days
                if [ $DATE_RM_THRESHOLD -le 7 ]; then

                        #Get date range
                        DAY_END=$(date -d "$DATE_RM_THRESHOLD days ago" +%s)
                        DAY_START=$(($DAY_END-60*60*24))

                        #If the file's modification time is earlier then our thrashold we push it back one day
                        if [ $MTIME -lt $DAY_END ]; then
                                DATE_RM_THRESHOLD=$(($DATE_RM_THRESHOLD+1))
                                FOUND_ONE=1
                        fi

                        #Have we found one to keep for this day?
                        if [ $FOUND_ONE -eq 1 ] && [ $MTIME -ge $DAY_START ] && [ $MTIME -lt $DAY_END ]; then
                                FOUND_ONE=0
                                echo "DAY"
                                echo "$FILE"
                        else
                                FILES_TO_DELETE=(${FILES_TO_DELETE[@]} "$TRGT_DIR$FILE")
                        fi
                fi
               
                #Find one to save for every week the last 4 weeks
                if [ $DATE_RM_THRESHOLD -gt 7 ] && [ $DATE_RM_THRESHOLD -le $((7*4)) ]; then
                       
                        #Get date range
                        WEEK_END=$(date -d "$DATE_RM_THRESHOLD days ago" +%s)
                        WEEK_START=$(($WEEK_START-60*60*24*7))

                        #If the file's modification time is earlier than our threshold we push it back one week
                        if [ $MTIME -lt $WEEK_END ]; then
                                DATE_RM_THRESHOLD=$(($DATE_RM_THRESHOLD+7))
                                FOUND_ONE=1
                        fi

                        #Have we found one to keep for this week?
                        if [ $FOUND_ONE -eq 1 ] && [ $MTIME -ge $WEEK_START ] && [ $MTIME -lt $WEEK_END ]; then
                                FOUND_ONE=0
                                echo "WEEK"
                                echo "$FILE"
                        else
                                FILES_TO_DELETE=(${FILES_TO_DELETE[@]} "$TRGT_DIR$FILE")
                        fi       
                fi

                #Find one to save for every month the last 12 months
                if [ $DATE_RM_THRESHOLD -gt $((7*4)) ] && [ $DATE_RM_THRESHOLD -le $((30*12)) ]; then

                        #Get date range
                        MONTH_END=$(date -d "$DATE_RM_THRESHOLD days ago" +%s)
                        MONTH_START=$(($MONTH_START-60*60*24*30))

                        #If the file's modification time is earlier than our threshold we push it back one month
                        if [ $MTIME -lt $MONTH_END ]; then
                                DATE_RM_THRESHOLD=$(($DATE_RM_THRESHOLD+30))
                                FOUND_ONE=1
                        fi

                        #Have we found one to keep for this week?
                        if [ $FOUND_ONE -eq 1 ] && [ $MTIME -ge $MONTH_START ] && [ $MTIME -lt $MONTH_END ]; then
                                FOUND_ONE=0
                                echo "MONTH"
                                echo "$FILE"
                        else
                                FILES_TO_DELETE=(${FILES_TO_DELETE[@]} "$TRGT_DIR$FILE")
                        fi       
                fi

                #Find one to save for every year
                if [ $DATE_RM_THRESHOLD -gt $((30*12)) ]; then
                       
                        #Get date range
                        YEAR_END=$(date -d "$DATE_RM_THRESHOLD days ago" +%s)
                        YEAR_START=$(($MONTH_START-60*60*24*365))

                        #If the file's modification time is earlier than our threshold we push it back one month
                        if [ $MTIME -lt $YEAR_END ]; then
                                DATE_RM_THRESHOLD=$(($DATE_RM_THRESHOLD+365))
                                FOUND_ONE=1
                        fi

                        #Have we found one to keep for this week?
                        if [ $FOUND_ONE -eq 1 ] && [ $MTIME -ge $YEAR_START ] && [ $MTIME -lt $YEAR_END ]; then
                                FOUND_ONE=0
                                echo "YEAR"
                                echo "$FILE"
                        else
                                FILES_TO_DELETE=(${FILES_TO_DELETE[@]} "$TRGT_DIR$FILE")
                        fi
                fi
        done

        #Show result
        #for FILE in ${FILES_TO_DELETE[@]}
        #do
        #        echo $FILE
        #done

        #Delete the selected files
        for FILE in ${FILES_TO_DELETE[@]}
        do
                echo $FILE
                rm -R $FILE
        done
}

I "almost" works! The first run everything goes as it should but on every subsequent run it deletes one more file though there should be no more files to delete.

I've used this script to generate files to test with:

Code:

#!/bin/bash

DAYS=0
DATE=$(date -d "$DAYS days ago" +%Y-%m-%d)

while [ $DAYS -le 1200 ]
do
        DATE=$(date -d "$DAYS days ago" +%Y-%m-%d)

        touch "/home/niklas/test/$DATE.txt"
        touch -d "$DATE" "/home/niklas/test/$DATE.txt"

        DAYS=$(($DAYS+1))
done

echo "You've just created a whole lot of files!"

Any suggestions or improvements to my code?

rigor 01-16-2012 07:37 PM

Rather than running the stat command on each individual file, I might be tempted to do something like this:

Code:

ls -ltd --time-style full-iso | ( read  modes links owner group size date time utc_offset file_name
while [ $? -eq 0 ]
    do
        date -d "$date $time $utc_offset" +%s
        read  modes links owner group size date time utc_offset file_name
    done
)


padeen 01-16-2012 08:13 PM

Just use find, this is what it is for.

Code:

# all files between 25 and 35 days old to maximum depth of 2.
FILES="$(find .  \( -mtime +25 -a -mtime -35 \) -maxdepth 2  -type f -exec /bin/ls -1 {} \+)"
for i in "$FILES" ; do /bin/ls -l "$i" ; done

An old saying in the software field is "good enough is good enough". IOW, it is easy to obsess on getting This done exactly right, and That done exactly right, etc. Really, good enough is ok. If you have some file(s) that is about 30 days old, that is good enough. Rinse and repeat for 7 days, 90 days, 180 days, etc.

gunnarflax 01-17-2012 02:45 AM

Quote:

Originally Posted by padeen (Post 4576384)
Just use find, this is what it is for.

Code:

# all files between 25 and 35 days old to maximum depth of 2.
FILES="$(find .  \( -mtime +25 -a -mtime -35 \) -maxdepth 2  -type f -exec /bin/ls -1 {} \+)"
for i in "$FILES" ; do /bin/ls -l "$i" ; done

An old saying in the software field is "good enough is good enough". IOW, it is easy to obsess on getting This done exactly right, and That done exactly right, etc. Really, good enough is ok. If you have some file(s) that is about 30 days old, that is good enough. Rinse and repeat for 7 days, 90 days, 180 days, etc.

I tried with find first but I didn't know it could find files modified within a date range larger than one day (find -mtime x) I never thought that you could combine the same tests :) Thank you! This will make the code much simpler!

Just curious though, will find require more resources?

padeen 01-17-2012 07:13 AM

At some point, whatever tool you use is going to have to walk the filesystem, whether it is the shell doing it through wildcards or whether it is find.

find's role is to do just that and, while I don't have any data to back me up, I would be surprised if it isn't optimised.

BTW, if you want to look for an alternative, stat is useful for its variety of output. You can parse the output quite easily to get attributes you want. But I would use find if it was me.

gunnarflax 01-17-2012 09:13 AM

Quote:

Originally Posted by padeen (Post 4576803)
At some point, whatever tool you use is going to have to walk the filesystem, whether it is the shell doing it through wildcards or whether it is find.

find's role is to do just that and, while I don't have any data to back me up, I would be surprised if it isn't optimised.

BTW, if you want to look for an alternative, stat is useful for its variety of output. You can parse the output quite easily to get attributes you want. But I would use find if it was me.

I'm trying "find" solution now but I keep getting an error I don't know how to get rid off:

Code:

FILES="$(find $TRGT_DIR* -daystart \( -mtime +$DATE_RM_THRESHOLD -a -mtime -$DATE_RM_LIMIT \) \+)"

output:
find: paths must precede expression: +
Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec] [path...] [expression]

How do I proceed? I can't find an answer on google :)

padeen 01-17-2012 09:48 AM

You haven't given find an exec action. -exec some_command {} \+

{} is a placeholder for all the files that find finds. + means pass them all through at once.


All times are GMT -5. The time now is 10:46 PM.