File Name Comparison --> Delete
This is a bit of a confusing task, at least for me being new to all this.
I have a computer loaded with a few thousands of backups. test_21312145_201208.txt test_21312145_201209.txt test_21312145_201210.txt test_21312145_201211.txt test_56343434_201208.txt test_56343434_201209.txt So essentially I want to create a script that would find all the duplicates of the same key (the middle number) delete all of them except the two most recent ones. Any idea on how I should go about this? |
The files are in same directory ?
|
Yes,they're all in the same directory. Any help would definitely be appreciated! I'm essentially completely new to anything Unix/Linux related but have done some projects in Windows through Powershell.
|
You need to write a script.
In the script, first change to the directory where the files are. (The below example script takes the directory as a parameter. Remember, current directory is . ) List all files (better use find for this if you have lots of them), and give the list to awk for processing. In the awk script, you can split each file name into components. Using an associative array (an array where the indexes can be anything, not just numbers), generate a list of date numbers for each set of files. Fortunately, you have sane timestamps: if you treat the timestamp parts as integers, you wish to delete all but the two largest ones, right? After the awk script has generated the list for each set of files, and the list has more than two items in it, find the two largest numbers in the list. (GNU awk does have a sort function you could use, but the linear search I used is both faster and more portable.) Then, go through the list again, and print the file names for all entries smaller than the (smaller) maximum you found. The result is a list of files to be deleted. You can feed it to xargs -r rm -f , which will then call rm -f for those files. (xargs also splits the files into as many sets as is needed, so it will work even if you had gazillions of files.) Here is the entire script: Code:
#!/bin/bash If you are satisfied it would delete the correct files, replace the final command (chmod go-r) with rm -f Hope this helps, |
Hmm, when you say change to the directory where the files are located, do you mean add a
cd /home/chicken/test to the top? |
Quote:
The $1 means the first parameter, and the || exit $? means that if the command on the left side fails, the script will abort. If cd cannot enter a directory, it will output an error message. Therefore cd some-directory || exit $? will either change to the directory, or print an error message and abort the script. |
Here is a bit different approach. This one only looks for _YYYYmmdd_HHMMSS. in the file name, and ignores all file names that do not have it. Everything around it is assumed to be exactly the same for each set of files (both before and after the timestamp).
Because of the simpler file name handling, you can modify the find command to consider subdirectories too if you want to. (Because the directory is included in the file name, files in each subdirectory are considered as separate sets, even if the file name part did not differ.) This one requires GNU find and GNU awk, because it uses the ASCII NUL as the file name separator, and also because it uses the gawk-only asort(). It will therefore work for all possible file names, as long as they have the above-format time stamp, and you can pick any number of latest files to be kept. Code:
#!/bin/bash |
If the file names are fixed in the format, also a one-liner could do:
Code:
$ find . | sort -r | awk '{ number=substr($1,index($1,"_")+1,8); if (old_number == number) { counter++; if (counter > 2) { print $1 }} else { old_number=number; counter=1 }}' | xargs -n 10 rm |
All times are GMT -5. The time now is 11:30 AM. |