LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 06-26-2012, 10:04 AM   #1
yensidetlaw
LQ Newbie
 
Registered: Dec 2005
Location: Northern VA, USA
Distribution: Kubuntu 5.10
Posts: 5

Rep: Reputation: 0
BASH: Deleting nearly duplicate files based on file name


Hey everyone,

I have a problem that is beyond my ability that I'm hoping someone here can help with. Before I explain, I have spent a fair amount of time reading and searching the internet. I get pieces of the puzzle but I think my inexperience is preventing me from seeing the solution.

Problem: I have 3460 files in a directory that an automated system (that I have inherited) kicks out. These files have a common naming structure that is:

Systm|Account Num|Report Dt|Gen Dt |Unique Num
STARK_XYZ04502093_06132012_06182012_08530739.csv

System: Box that Generates the Report
Account Number: The general Account Number
Report Date: Last date of reporting period
Gen Date: Report Generation Date
Unique Number: A unique number

I have noticed that the system in some cases is generating 2 reports in the same month. These files are nearly identical in size, but always have some small difference (a single character in most cases). Here is an example from May and June:
STARK_XYZ04502093_05132012_05142012_14260407.csv
STARK_XYZ04502093_06132012_06142012_05270652.csv
STARK_XYZ04502093_06132012_06182012_08530739.csv

I need a BASH script that will evaluate the entire set and move the earlier generated (duplicate) file to another directory.

I have done some testing and came up with the following test to figure out which of the files is larger and smaller.

Code:
#!/bin/bash

Var1=STARK_XYZ04502093_06132012_06142012_05270652.xls.csv
Var2=STARK_XYZ04502093_06132012_06182012_08530739.xls.csv

if [ ${Var1:9:8} != ${Var2:9:8} ]
then
        echo Failure! ${Var1:9:8} and ${Var2:9:8} are not equall!
elif [ ${Var1:18:8} != ${Var2:18:8} ]
then
        echo Failure! ${Var1:18:8} and ${Var2:18:8} are not equall!
elif [ ${Var1:27:8} = ${Var2:27:8} ]
then
        echo Success! ${Var1:27:8} is ${Var2:27:8}
else
        test ${Var1:27:8} -gt ${Var2:27:8} && echo True! ${Var1:27:8} is greater than ${Var2:27:8} || echo False! ${Var1:27:8} is less than ${Var2:27:8}
fi
This works when hard setting the variables with 2 files. Where I have getting stuck is expanding this beyond my 2 test files to have it look at all files in the directory.

I think I need to have a for loop added that looks at the list of files and then compares each file to the entire list to see if the System|AcctNum|ReportDt are the same and then determine which is smaller, but this jumps beyond my BASH experience.

So I'm stuck. In the above example, I want to move the files with the Geration Date of 06142012 to a directory (old/).

Any ideas? Thanks in advance! I've spent the better part of 2 work days trying to come up with something that works and I'm striking out.

Last edited by yensidetlaw; 06-26-2012 at 10:06 AM.
 
Old 06-26-2012, 10:44 AM   #2
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
I would think you could simply sort by modification date---for example, this gives a listing with the most recent file first:
Code:
ls -t
You could then pass this thru code to eliminate duplicates based on some criteria.(eg the same date)--the following is pseudo code only and is only my first guess
Code:
ls -t <dirname> >filelist  ##creates a list of filenames sorted by modification date--most recent first
while read filename; do
    date=<code to extract date from current line> ##do this with sed or grep
    if [ $date == $lastdate ]; then
        mv $filename <somewhere>
    endif
    lastdate=$date   ##this sets "lastdate" equal to "date".  the first time thru the loop, lastdate will be unset, so the first if will fail. (as desired)
done < filelist
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Deleting Duplicate files on server adamsjw2 Linux - Server 1 03-07-2007 03:09 PM
deleting duplicate files cs-cam Linux - General 3 11-14-2006 11:27 PM
Bash - Deleting duplicate records Wire323 Programming 5 12-04-2005 08:51 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 07:43 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration