ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
So essentially I want to create a script that would find all the duplicates of the same key (the middle number) delete all of them except the two most recent ones.
Yes,they're all in the same directory. Any help would definitely be appreciated! I'm essentially completely new to anything Unix/Linux related but have done some projects in Windows through Powershell.
In the script, first change to the directory where the files are. (The below example script takes the directory as a parameter. Remember, current directory is . )
List all files (better use find for this if you have lots of them), and give the list to awk for processing.
In the awk script, you can split each file name into components. Using an associative array (an array where the indexes can be anything, not just numbers), generate a list of date numbers for each set of files.
Fortunately, you have sane timestamps: if you treat the timestamp parts as integers, you wish to delete all but the two largest ones, right?
After the awk script has generated the list for each set of files, and the list has more than two items in it, find the two largest numbers in the list. (GNU awk does have a sort function you could use, but the linear search I used is both faster and more portable.) Then, go through the list again, and print the file names for all entries smaller than the (smaller) maximum you found.
The result is a list of files to be deleted. You can feed it to xargs -r rm -f , which will then call rm -f for those files. (xargs also splits the files into as many sets as is needed, so it will work even if you had gazillions of files.)
Here is the entire script:
Code:
#!/bin/bash
# If the user just runs this script, show the usage instead.
if [ $# -ne 1 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
echo "Usage: $0 directory" >&2
exit 1
fi
# Move to the directory specified on the command line.
cd "$1" || exit $?
# List all files here, and supply them to the awk script.
find . -mindepth 1 -maxdepth 1 -type f -printf '%f\n' | awk '
BEGIN {
# Accept any form of newlines, and remove leading and trailing whitespace.
RS = "[\t\n\v\f\r ]*[\r\n][\t\n\v\f\r ]*"
# Fields are separated by whitespace, underscores, and/or dots.
FS = "[\t\v\f _.]"
files = 0
split("", known)
split("", prefix)
split("", suffix)
split("", table)
}
# We only consider foo_bar_<number>.baz file names.
($0 ~ /^[0-9A-Za-z]+_[0-9A-Za-z]+_[0-9]+\.[0-9A-Za-z]+$/) {
p = $1 "_" $2 "_" # Prefix bit
i = $3 # Index
s = "." $4 # Suffix bit
k = p s # Prefix and suffix identify the file
if (!(k in known)) {
files++
known[k] = files
prefix[files] = p
suffix[files] = s
}
file = known[k]
table[file] = table[file] " " i
}
END {
for (file = 1; file <= files; file++) {
# Remove the leading space.
sub(/^ /, "", table[file])
# Split the table into a list.
n = split(table[file], list, " ")
# If no more than two, we keep all.
if (n <= 2)
continue
# Find the two largest values.
max1 = -1
max2 = -1
for (i = 1; i <= n; i++)
if (list[i] > max1) {
max2 = max1
max1 = list[i]
} else
if (list[i] > max2) {
max2 = list[i]
}
# List images smaller than the max.
for (i = 1; i <= n; i++)
if (list[i] < max2)
printf("%s%s%s\n", prefix[file], list[i], suffix[file])
}
}' | xargs -r chmod go-r
Now, if you run the above script as-is, it will only remove read access from the group and others. This should make it easier for you to verify it would remove the correct files.
If you are satisfied it would delete the correct files, replace the final command (chmod go-r) with rm -f
Hope this helps,
Last edited by Nominal Animal; 01-26-2012 at 11:19 AM.
Hmm, when you say change to the directory where the files are located, do you mean add a
cd /home/chicken/test to the top?
No, I was describing what the script does. You supply the directory as a command line parameter (./script /home/chicken/test), and the cd "$1" || exit $? line in the script does the deed.
The $1 means the first parameter, and the || exit $? means that if the command on the left side fails, the script will abort.
If cd cannot enter a directory, it will output an error message. Therefore cd some-directory || exit $? will either change to the directory, or print an error message and abort the script.
Here is a bit different approach. This one only looks for _YYYYmmdd_HHMMSS. in the file name, and ignores all file names that do not have it. Everything around it is assumed to be exactly the same for each set of files (both before and after the timestamp).
Because of the simpler file name handling, you can modify the find command to consider subdirectories too if you want to. (Because the directory is included in the file name, files in each subdirectory are considered as separate sets, even if the file name part did not differ.)
This one requires GNU find and GNU awk, because it uses the ASCII NUL as the file name separator, and also because it uses the gawk-only asort(). It will therefore work for all possible file names, as long as they have the above-format time stamp, and you can pick any number of latest files to be kept.
Code:
#!/bin/bash
# Usage.
if [ $# -lt 2 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
exec >&2
echo ""
echo "Usage: $0 KEEP DIRECTORY..."
echo ""
echo "Helper script for determining which backup files to remove."
echo "To remove all except the KEEP latest files in each set, use"
echo ""
echo " $0 KEEP DIRECTORY... | xargs -r0 rm -f"
echo ""
echo "This script will output an ASCII NUL -delimited list of files,"
echo "omitting the KEEP latest ones, based on the name."
echo "(This script ignores the filesystem timestamps.)"
echo ""
echo "First, the specified directories are scanned for files containing a"
echo " _YYYYmmdd_HHMMSS."
echo "format timestamp in their pathname. All files that only differ by"
echo "the timestamp in the same directory are considered a file set."
echo "The script will not descend into any subdirectories."
echo ""
echo "The timestamps in each file set are checked,"
echo "then the names of all files with older timestamps than"
echo "the KEEP latest ones will be emitted."
echo ""
exit 1
fi
if [ -n "${1//[0-9]/}" ]; then
echo "$1: Invalid number of files to keep (not list)." >&2
exit 1
fi
KEEP=$[$1] || exit $?
shift 1
find "$@" -maxdepth 1 -type f -printf '%f\0' | gawk -v nmax="$KEEP" '
BEGIN {
# Accept any form of newlines, and remove leading and trailing whitespace.
RS = "\0" # ASCII NUL separators
FS = "\0" # No field splitting
files = 0
split("", lookup)
split("", prefix)
split("", suffix)
split("", copies)
}
{
# Locate the timestamp in the file name.
i = match($0, /_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9]\./)
if (i < 1)
next
head = substr($0, 1, i)
when = substr($0, i + 1, 15)
tail = substr($0, i + 16)
# Everything else but the timestamp defines the fileset.
uniq = head tail
# Find which fileset this file belongs to.
file = lookup[uniq]
if (file < 1) {
# New fileset.
file = ++files
lookup[uniq] = file
prefix[file] = head
suffix[file] = tail
}
# Add timestamp to fileset.
copies[file] = copies[file] " " when
}
END {
for (file = 1; file <= files; file++) {
# Remove extra leading space from fileset timestamp list,
sub(/^ +/, "", copies[file])
# and change underscores to dots.
gsub(/_/, ".", copies[file])
# Convert string to array.
n = split(copies[file], list, " ")
# If no more than two files in set, list none.
if (n <= nmax)
continue
# Sort the timestamp array.
asort(list)
max = list[n - nmax + 1]
# Display all filenames with older timestamps.
for (i = 1; i <= n; i++)
if (list[i] < max) {
when = list[i]
sub(/\./, "_", when)
printf("%s%s%s\0", prefix[file], when, suffix[file])
}
}
}'
The process is: use find to get the list of all files in the directory, sort them in reverse order, keep the first two and output all additional ones, feed them in bunches of ten to rm (avoids that the command lines gets too long).
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.