How can I purge files logarithmically by modification date (BASH)?
Hi!
Today I do backups regularly but purge old backups older than a specific date. What I would like is to save all files from the last two days, one file per day from the last week, one file per week for the last month, one file per month for the last year and one file for every year. I don't fully understand what logic I should implement to achieve something like this. Can anyone help me with pointers on how to implement this and maybe suggestion on packages that can be of help? What I have achieved so far is this: Code:
smart_rm () |
Have you you looked at the exec mtime and similar options on the find command?
|
Some possible logic:
Code:
BASENAME = SomeFileName |
CollieJim's pseudo code looks good. You have to decide on Day of Week, Day of Month, Month of Year values, etc.
|
CollieJim's pseudo code looks promising though I don't fully understand how this would be implemented. In the first part are you suggesting I should modify the filename of the files?
Code:
if day of month == 1 With this snippet I can sort the files on their modification date: Code:
#Select and sort all files Code:
#Delete files I have also thought about the approach with find and using the -mtime option: Code:
find /path/to/files* -mtime +5 -exec rm {} \; Any suggestions on how I should proceed? If I've have misunderstood CollieJim's code then please help me understand what he means :) |
Could you use something like the Towers of Hanoi backup rotation scheme? There is a shell script that says it implements it here.
|
Quote:
I also came across information about Rsnapshot. It's a utility that does what I want automatically, so I might base the whole backup system on that instead. Suggestions? |
I expected basename to be derived from a hostname or username and timestamp, among other possibilities. That way each is unique but grouped by tag (AN, WK, MO).
|
rsnapshot couldn't be used in the way I needed it too so I've kept going trying to find a solution myself. This is what I've come up with:
Code:
#!/bin/bash I've used this script to generate files to test with: Code:
#!/bin/bash |
Rather than running the stat command on each individual file, I might be tempted to do something like this:
Code:
ls -ltd --time-style full-iso | ( read modes links owner group size date time utc_offset file_name |
Just use find, this is what it is for.
Code:
# all files between 25 and 35 days old to maximum depth of 2. |
Quote:
Just curious though, will find require more resources? |
At some point, whatever tool you use is going to have to walk the filesystem, whether it is the shell doing it through wildcards or whether it is find.
find's role is to do just that and, while I don't have any data to back me up, I would be surprised if it isn't optimised. BTW, if you want to look for an alternative, stat is useful for its variety of output. You can parse the output quite easily to get attributes you want. But I would use find if it was me. |
Quote:
Code:
FILES="$(find $TRGT_DIR* -daystart \( -mtime +$DATE_RM_THRESHOLD -a -mtime -$DATE_RM_LIMIT \) \+)" |
You haven't given find an exec action. -exec some_command {} \+
{} is a placeholder for all the files that find finds. + means pass them all through at once. |
Quote:
I use Find to find all files within the date range, then I delete the first row from the returned string to make sure that I keep one file for the date range. Can this be done within the Find command with some option like "skip first row" or something? After that I iterate through all files stored in a variable and delete them one by one. Because I need to save one file per date range I don't think I can utilize the -exec function of Find. Please take a look at the code I have now and see if I can utilize it better. Unfortunately I also have a bug which I cannot find. My previous script managed to delete properly so that I kept files according to the pattern I wanted. My new implementation with find strangely saves 406 files instead of 20-30 that my other one did. Please help me spot bugs: Code:
smart_rm () |
Now I got a fully functioning version with the "ls"-method! It is as follows:
Code:
#Files to delete find: Code:
#Files to delete |
find operates recursively unless you tell it not to, ls operates recursively if you tell it to.
So in most situations you wouldn't put the "star"/"asterisk" pattern matching char after the directory name with find. Typically find would be used: Code:
find "$TRGT_DIR" Code:
find "$TRGT_DIR"* |
Another possible issue is what is sometimes called the "command buffer", or "argument length". That's why, although it may not have seemed "elegant", I illustrated the output of an ls command being read as variables, and acting on a single file name per loop iteration, rather than build a single "long" command with a list of file names. The list of file names may grow to be too long, depending on your exact situation.
That's also why using the exec option of find, or using find with the xargs command can be so nice, since it puts the list of files through a pipe, not in a command buffer, which may be implemented with length limitations. |
I would like to get it working with find but I can't find a working way to implement it. My ls-version works rather well right now. The only issue I have, which isn't a problem for me right now but could if the script would be used on another directory, is that I can't process file names with spaces in them. I loop on the result I get from ls and the loop apparently don't process line per line but processes word per word instead. Is there some way to solve this? Can I do the loop on some other way? Maybe pipe the ls result into something else?
Here is my script as it is right now with ls: Code:
smart_rm_backups () |
Maybe use read ?
Code:
ls -1 -t $TRGT_DIR -I "*~" | while read FILE; do Code:
MTIME=$(date -d "$(stat -c %y "$FILE")" +%s) |
Quote:
This isn't a very elegant solution so if someone knows how to get this done find, please let me know :) this is what I got now: Code:
smart_rm_backups () |
Could you just put the "rm" in place of the "echo", eliminate the other loop and the temporary file?
Code:
Another thought, there are different types of elegance. Someone could have a program with 100 lines of code that uses "brute force" approaches in the code, which might not seem very elegant, at a code level. Perhaps the program could be changed to use more "sophisticated" approaches, which result in a program with only 10 lines of code. The program with 10 lines of code, might seem to have more elegant code. But if the program that is more elegant at a code level, instead consumes 10 times as much of the computer's horsepower than the approach that it is brute force at a code level, then the "elegant" code is not elegant in it's use of the computer's resources. With a shell script, typically, using something built into the shell to do the same thing as an external program, takes less of the computer's horsepower. Reading values into variables may seem rather brute-force/not-elegant, but if it saves running programs external to the shell, it may be elegant as far as it's use of the computer's horsepower. |
I guess I can. It just doesn't feel right to remove a file before knowing exactly what to delete :) I can also pipe into xargs at the end of the loop. Would it be better to run rm once for every file or just once for all files?
|
Quote:
If you could assure that file names on your system had at most one space in sequence, not two, and that your script would not be running when "midnight" occurs, you could use code such as this to get the "dates" for all backup files all at once: Code:
#!/bin/bash |
Quote:
|
Well, I'm using kernel 2.6.37, and I get the error often enough on average to remind me that it's there.
After all, if you process plenty of full path names, as output by commands such as locate, or find, a recursive grep with plenty of result file names, the total number of characters can add up fairly quickly. |
I've finished the script and it works quite well, I'll post it here as soon as I can. Thanks for all the help!
|
I've attached the final code to this post. I've never put something under a license before but I thought it'd be nice to do it. I simply followed the instructions here: http://www.gnu.org/licenses/gpl-howto.html
I couldn't attach an archive including the license but I think that's ok. Please let me know if I did something wrong :) Code:
#!/bin/bash |
Looks like it would be nice if find has an option to specify besides -depth something to sort the entries by time, either in each directory or overall.
NB: stat -c %Y "$FILE" (uppercase y) outputs the seconds directly. |
Quote:
Thanks for the tip about stat! |
Thank you for this great script!
I improved it a bit, so one can also define how many weeks it should keep the backups. Also I added a bit of documentation and a dry run option: Code:
#!/bin/bash |
All times are GMT -5. The time now is 08:12 AM. |