File Name Comparison --> Delete

chickens · 01-26-2012, 09:54 AM

This is a bit of a confusing task, at least for me being new to all this.

I have a computer loaded with a few thousands of backups.

test_21312145_201208.txt
test_21312145_201209.txt
test_21312145_201210.txt
test_21312145_201211.txt
test_56343434_201208.txt
test_56343434_201209.txt

So essentially I want to create a script that would find all the duplicates of the same key (the middle number) delete all of them except the two most recent ones.

Any idea on how I should go about this?

Cedrik · 01-26-2012, 11:05 AM

The files are in same directory ?

chickens · 01-26-2012, 11:07 AM

Yes,they're all in the same directory. Any help would definitely be appreciated! I'm essentially completely new to anything Unix/Linux related but have done some projects in Windows through Powershell.

Nominal Animal · 01-26-2012, 11:16 AM

You need to write a script.

In the script, first change to the directory where the files are. (The below example script takes the directory as a parameter. Remember, current directory is . )

List all files (better use find for this if you have lots of them), and give the list to awk for processing.

In the awk script, you can split each file name into components. Using an associative array (an array where the indexes can be anything, not just numbers), generate a list of date numbers for each set of files.

Fortunately, you have sane timestamps: if you treat the timestamp parts as integers, you wish to delete all but the two largest ones, right?

After the awk script has generated the list for each set of files, and the list has more than two items in it, find the two largest numbers in the list. (GNU awk does have a sort function you could use, but the linear search I used is both faster and more portable.) Then, go through the list again, and print the file names for all entries smaller than the (smaller) maximum you found.

The result is a list of files to be deleted. You can feed it to xargs -r rm -f , which will then call rm -f for those files. (xargs also splits the files into as many sets as is needed, so it will work even if you had gazillions of files.)

Here is the entire script:

Code:

#!/bin/bash

# If the user just runs this script, show the usage instead.
if [ $# -ne 1 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
    echo "Usage: $0 directory" >&2
    exit 1
fi

# Move to the directory specified on the command line.
cd "$1" || exit $?

# List all files here, and supply them to the awk script.    
find . -mindepth 1 -maxdepth 1 -type f -printf '%f\n' | awk '
    BEGIN {
        # Accept any form of newlines, and remove leading and trailing whitespace.
        RS = "[\t\n\v\f\r ]*[\r\n][\t\n\v\f\r ]*"

        # Fields are separated by whitespace, underscores, and/or dots.
        FS = "[\t\v\f _.]"

        files = 0
        split("", known)
        split("", prefix)
        split("", suffix)
        split("", table)
    }

    # We only consider foo_bar_<number>.baz file names.
    ($0 ~ /^[0-9A-Za-z]+_[0-9A-Za-z]+_[0-9]+\.[0-9A-Za-z]+$/) {
        p = $1 "_" $2 "_"  # Prefix bit
        i = $3             # Index
        s = "." $4         # Suffix bit
        k = p s            # Prefix and suffix identify the file

        if (!(k in known)) {
           files++
           known[k] = files
           prefix[files] = p
           suffix[files] = s
        }

        file = known[k]
        table[file] = table[file] " " i
    }

    END {
        for (file = 1; file <= files; file++) {
            # Remove the leading space.
            sub(/^ /, "", table[file])

            # Split the table into a list.
            n = split(table[file], list, " ")

            # If no more than two, we keep all.
            if (n <= 2)
                continue

            # Find the two largest values.
            max1 = -1
            max2 = -1

            for (i = 1; i <= n; i++)
                if (list[i] > max1) {
                    max2 = max1
                    max1 = list[i]
                } else
                if (list[i] > max2) {
                    max2 = list[i]
                }

            # List images smaller than the max.
            for (i = 1; i <= n; i++)
                if (list[i] < max2)
                    printf("%s%s%s\n", prefix[file], list[i], suffix[file])
        }
    }' | xargs -r chmod go-r

Now, if you run the above script as-is, it will only remove read access from the group and others. This should make it easier for you to verify it would remove the correct files.

If you are satisfied it would delete the correct files, replace the final command (chmod go-r) with rm -f

Hope this helps,

chickens · 01-26-2012, 01:00 PM

Hmm, when you say change to the directory where the files are located, do you mean add a
cd /home/chicken/test to the top?

Nominal Animal · 01-26-2012, 02:37 PM

Quote:

Originally Posted by chickens

Hmm, when you say change to the directory where the files are located, do you mean add a
cd /home/chicken/test to the top?

No, I was describing what the script does. You supply the directory as a command line parameter (./script /home/chicken/test), and the cd "$1" || exit $? line in the script does the deed.

The $1 means the first parameter, and the || exit $? means that if the command on the left side fails, the script will abort.

If cd cannot enter a directory, it will output an error message. Therefore cd some-directory || exit $? will either change to the directory, or print an error message and abort the script.

Nominal Animal · 01-26-2012, 07:30 PM

Here is a bit different approach. This one only looks for _YYYYmmdd_HHMMSS. in the file name, and ignores all file names that do not have it. Everything around it is assumed to be exactly the same for each set of files (both before and after the timestamp).

Because of the simpler file name handling, you can modify the find command to consider subdirectories too if you want to. (Because the directory is included in the file name, files in each subdirectory are considered as separate sets, even if the file name part did not differ.)

This one requires GNU find and GNU awk, because it uses the ASCII NUL as the file name separator, and also because it uses the gawk-only asort(). It will therefore work for all possible file names, as long as they have the above-format time stamp, and you can pick any number of latest files to be kept.

Code:

#!/bin/bash

# Usage.
if [ $# -lt 2 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
    exec >&2
    echo ""
    echo "Usage: $0 KEEP DIRECTORY..."
    echo ""
    echo "Helper script for determining which backup files to remove."
    echo "To remove all except the KEEP latest files in each set, use"
    echo ""
    echo "       $0 KEEP DIRECTORY... | xargs -r0 rm -f"
    echo ""
    echo "This script will output an ASCII NUL -delimited list of files,"
    echo "omitting the KEEP latest ones, based on the name."
    echo "(This script ignores the filesystem timestamps.)"
    echo ""
    echo "First, the specified directories are scanned for files containing a"
    echo "       _YYYYmmdd_HHMMSS."
    echo "format timestamp in their pathname. All files that only differ by"
    echo "the timestamp in the same directory are considered a file set."
    echo "The script will not descend into any subdirectories."
    echo ""
    echo "The timestamps in each file set are checked,"
    echo "then the names of all files with older timestamps than"
    echo "the KEEP latest ones will be emitted."
    echo ""
    exit 1
fi
if [ -n "${1//[0-9]/}" ]; then
    echo "$1: Invalid number of files to keep (not list)." >&2
    exit 1
fi

KEEP=$[$1] || exit $?
shift 1

find "$@" -maxdepth 1 -type f -printf '%f\0' | gawk -v nmax="$KEEP" '
    BEGIN {
        # Accept any form of newlines, and remove leading and trailing whitespace.
        RS = "\0"   # ASCII NUL separators
        FS = "\0"   # No field splitting

        files = 0
        split("", lookup)
        split("", prefix)
        split("", suffix)
        split("", copies)
    }

    {
        # Locate the timestamp in the file name.
        i = match($0, /_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9]\./)
        if (i < 1)
            next

        head = substr($0, 1, i)
        when = substr($0, i + 1, 15)
        tail = substr($0, i + 16)

        # Everything else but the timestamp defines the fileset.
        uniq = head tail

        # Find which fileset this file belongs to.
        file = lookup[uniq]
        if (file < 1) {
            # New fileset.
            file = ++files
            lookup[uniq] = file
            prefix[file] = head
            suffix[file] = tail
        }

        # Add timestamp to fileset.
        copies[file] = copies[file] " " when
    }

    END {
        for (file = 1; file <= files; file++) {
            # Remove extra leading space from fileset timestamp list,
            sub(/^ +/, "", copies[file])
            # and change underscores to dots.
            gsub(/_/, ".", copies[file])

            # Convert string to array.
            n = split(copies[file], list, " ")

            # If no more than two files in set, list none.
            if (n <= nmax)
                continue

            # Sort the timestamp array.
            asort(list)
            max = list[n - nmax + 1]

            # Display all filenames with older timestamps.
            for (i = 1; i <= n; i++)
                if (list[i] < max) {
                    when = list[i]
                    sub(/\./, "_", when)
                    printf("%s%s%s\0", prefix[file], when, suffix[file])
                }
        }
    }'

Reuti · 01-27-2012, 06:36 AM

If the file names are fixed in the format, also a one-liner could do:

Code:

$ find . | sort -r | awk '{ number=substr($1,index($1,"_")+1,8); if (old_number == number) { counter++; if (counter > 2) { print $1 }} else { old_number=number; counter=1 }}' | xargs -n 10 rm

The process is: use find to get the list of all files in the directory, sort them in reverse order, keep the first two and output all additional ones, feed them in bunches of ten to rm (avoids that the command lines gets too long).