LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-26-2012, 10:54 AM   #1
chickens
LQ Newbie
 
Registered: Jan 2012
Posts: 3

Rep: Reputation: Disabled
File Name Comparison --> Delete


This is a bit of a confusing task, at least for me being new to all this.

I have a computer loaded with a few thousands of backups.

test_21312145_201208.txt
test_21312145_201209.txt
test_21312145_201210.txt
test_21312145_201211.txt
test_56343434_201208.txt
test_56343434_201209.txt

So essentially I want to create a script that would find all the duplicates of the same key (the middle number) delete all of them except the two most recent ones.

Any idea on how I should go about this?
 
Old 01-26-2012, 12:05 PM   #2
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 243Reputation: 243Reputation: 243
The files are in same directory ?
 
Old 01-26-2012, 12:07 PM   #3
chickens
LQ Newbie
 
Registered: Jan 2012
Posts: 3

Original Poster
Rep: Reputation: Disabled
Yes,they're all in the same directory. Any help would definitely be appreciated! I'm essentially completely new to anything Unix/Linux related but have done some projects in Windows through Powershell.

Last edited by chickens; 01-26-2012 at 12:09 PM.
 
Old 01-26-2012, 12:16 PM   #4
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947
You need to write a script.

In the script, first change to the directory where the files are. (The below example script takes the directory as a parameter. Remember, current directory is . )

List all files (better use find for this if you have lots of them), and give the list to awk for processing.

In the awk script, you can split each file name into components. Using an associative array (an array where the indexes can be anything, not just numbers), generate a list of date numbers for each set of files.

Fortunately, you have sane timestamps: if you treat the timestamp parts as integers, you wish to delete all but the two largest ones, right?

After the awk script has generated the list for each set of files, and the list has more than two items in it, find the two largest numbers in the list. (GNU awk does have a sort function you could use, but the linear search I used is both faster and more portable.) Then, go through the list again, and print the file names for all entries smaller than the (smaller) maximum you found.

The result is a list of files to be deleted. You can feed it to xargs -r rm -f , which will then call rm -f for those files. (xargs also splits the files into as many sets as is needed, so it will work even if you had gazillions of files.)

Here is the entire script:
Code:
#!/bin/bash

# If the user just runs this script, show the usage instead.
if [ $# -ne 1 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
    echo "Usage: $0 directory" >&2
    exit 1
fi

# Move to the directory specified on the command line.
cd "$1" || exit $?

# List all files here, and supply them to the awk script.    
find . -mindepth 1 -maxdepth 1 -type f -printf '%f\n' | awk '
    BEGIN {
        # Accept any form of newlines, and remove leading and trailing whitespace.
        RS = "[\t\n\v\f\r ]*[\r\n][\t\n\v\f\r ]*"

        # Fields are separated by whitespace, underscores, and/or dots.
        FS = "[\t\v\f _.]"

        files = 0
        split("", known)
        split("", prefix)
        split("", suffix)
        split("", table)
    }

    # We only consider foo_bar_<number>.baz file names.
    ($0 ~ /^[0-9A-Za-z]+_[0-9A-Za-z]+_[0-9]+\.[0-9A-Za-z]+$/) {
        p = $1 "_" $2 "_"  # Prefix bit
        i = $3             # Index
        s = "." $4         # Suffix bit
        k = p s            # Prefix and suffix identify the file

        if (!(k in known)) {
           files++
           known[k] = files
           prefix[files] = p
           suffix[files] = s
        }

        file = known[k]
        table[file] = table[file] " " i
    }

    END {
        for (file = 1; file <= files; file++) {
            # Remove the leading space.
            sub(/^ /, "", table[file])

            # Split the table into a list.
            n = split(table[file], list, " ")

            # If no more than two, we keep all.
            if (n <= 2)
                continue

            # Find the two largest values.
            max1 = -1
            max2 = -1

            for (i = 1; i <= n; i++)
                if (list[i] > max1) {
                    max2 = max1
                    max1 = list[i]
                } else
                if (list[i] > max2) {
                    max2 = list[i]
                }

            # List images smaller than the max.
            for (i = 1; i <= n; i++)
                if (list[i] < max2)
                    printf("%s%s%s\n", prefix[file], list[i], suffix[file])
        }
    }' | xargs -r chmod go-r
Now, if you run the above script as-is, it will only remove read access from the group and others. This should make it easier for you to verify it would remove the correct files.

If you are satisfied it would delete the correct files, replace the final command (chmod go-r) with rm -f

Hope this helps,

Last edited by Nominal Animal; 01-26-2012 at 12:19 PM.
 
1 members found this post helpful.
Old 01-26-2012, 02:00 PM   #5
chickens
LQ Newbie
 
Registered: Jan 2012
Posts: 3

Original Poster
Rep: Reputation: Disabled
Hmm, when you say change to the directory where the files are located, do you mean add a
cd /home/chicken/test to the top?
 
Old 01-26-2012, 03:37 PM   #6
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947
Quote:
Originally Posted by chickens View Post
Hmm, when you say change to the directory where the files are located, do you mean add a
cd /home/chicken/test to the top?
No, I was describing what the script does. You supply the directory as a command line parameter (./script /home/chicken/test), and the cd "$1" || exit $? line in the script does the deed.

The $1 means the first parameter, and the || exit $? means that if the command on the left side fails, the script will abort.

If cd cannot enter a directory, it will output an error message. Therefore cd some-directory || exit $? will either change to the directory, or print an error message and abort the script.
 
Old 01-26-2012, 08:30 PM   #7
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947
Here is a bit different approach. This one only looks for _YYYYmmdd_HHMMSS. in the file name, and ignores all file names that do not have it. Everything around it is assumed to be exactly the same for each set of files (both before and after the timestamp).

Because of the simpler file name handling, you can modify the find command to consider subdirectories too if you want to. (Because the directory is included in the file name, files in each subdirectory are considered as separate sets, even if the file name part did not differ.)

This one requires GNU find and GNU awk, because it uses the ASCII NUL as the file name separator, and also because it uses the gawk-only asort(). It will therefore work for all possible file names, as long as they have the above-format time stamp, and you can pick any number of latest files to be kept.

Code:
#!/bin/bash

# Usage.
if [ $# -lt 2 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
    exec >&2
    echo ""
    echo "Usage: $0 KEEP DIRECTORY..."
    echo ""
    echo "Helper script for determining which backup files to remove."
    echo "To remove all except the KEEP latest files in each set, use"
    echo ""
    echo "       $0 KEEP DIRECTORY... | xargs -r0 rm -f"
    echo ""
    echo "This script will output an ASCII NUL -delimited list of files,"
    echo "omitting the KEEP latest ones, based on the name."
    echo "(This script ignores the filesystem timestamps.)"
    echo ""
    echo "First, the specified directories are scanned for files containing a"
    echo "       _YYYYmmdd_HHMMSS."
    echo "format timestamp in their pathname. All files that only differ by"
    echo "the timestamp in the same directory are considered a file set."
    echo "The script will not descend into any subdirectories."
    echo ""
    echo "The timestamps in each file set are checked,"
    echo "then the names of all files with older timestamps than"
    echo "the KEEP latest ones will be emitted."
    echo ""
    exit 1
fi
if [ -n "${1//[0-9]/}" ]; then
    echo "$1: Invalid number of files to keep (not list)." >&2
    exit 1
fi

KEEP=$[$1] || exit $?
shift 1

find "$@" -maxdepth 1 -type f -printf '%f\0' | gawk -v nmax="$KEEP" '
    BEGIN {
        # Accept any form of newlines, and remove leading and trailing whitespace.
        RS = "\0"   # ASCII NUL separators
        FS = "\0"   # No field splitting

        files = 0
        split("", lookup)
        split("", prefix)
        split("", suffix)
        split("", copies)
    }

    {
        # Locate the timestamp in the file name.
        i = match($0, /_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9]\./)
        if (i < 1)
            next

        head = substr($0, 1, i)
        when = substr($0, i + 1, 15)
        tail = substr($0, i + 16)

        # Everything else but the timestamp defines the fileset.
        uniq = head tail

        # Find which fileset this file belongs to.
        file = lookup[uniq]
        if (file < 1) {
            # New fileset.
            file = ++files
            lookup[uniq] = file
            prefix[file] = head
            suffix[file] = tail
        }

        # Add timestamp to fileset.
        copies[file] = copies[file] " " when
    }

    END {
        for (file = 1; file <= files; file++) {
            # Remove extra leading space from fileset timestamp list,
            sub(/^ +/, "", copies[file])
            # and change underscores to dots.
            gsub(/_/, ".", copies[file])

            # Convert string to array.
            n = split(copies[file], list, " ")

            # If no more than two files in set, list none.
            if (n <= nmax)
                continue

            # Sort the timestamp array.
            asort(list)
            max = list[n - nmax + 1]

            # Display all filenames with older timestamps.
            for (i = 1; i <= n; i++)
                if (list[i] < max) {
                    when = list[i]
                    sub(/\./, "_", when)
                    printf("%s%s%s\0", prefix[file], when, suffix[file])
                }
        }
    }'
 
Old 01-27-2012, 07:36 AM   #8
Reuti
Senior Member
 
Registered: Dec 2004
Location: Marburg, Germany
Distribution: openSUSE 13.1
Posts: 1,330

Rep: Reputation: 254Reputation: 254Reputation: 254
If the file names are fixed in the format, also a one-liner could do:
Code:
$ find . | sort -r | awk '{ number=substr($1,index($1,"_")+1,8); if (old_number == number) { counter++; if (counter > 2) { print $1 }} else { old_number=number; counter=1 }}' | xargs -n 10 rm
The process is: use find to get the list of all files in the directory, sort them in reverse order, keep the first two and output all additional ones, feed them in bunches of ten to rm (avoids that the command lines gets too long).
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
File comparison ! noony123 Linux - Newbie 2 03-16-2011 05:23 AM
[SOLVED] File comparison tool noony123 Linux - Newbie 1 01-28-2011 06:39 AM
more file comparison Stannjudy Linux - Newbie 7 04-15-2008 01:08 PM
Perl - File Comparison PsypherPunk Programming 1 09-01-2006 01:58 PM
Tried to delete file as root but it says I don't have permission to delete it! beejayzed Mandriva 23 03-12-2004 03:46 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:52 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration