LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-05-2006, 11:57 PM   #1
rickh
Senior Member
 
Registered: May 2004
Location: Albuquerque, NM USA
Distribution: Debian-Lenny/Sid 32/64 Desktop: Generic AMD64-EVGA 680i Laptop: Generic Intel SIS-AC97
Posts: 4,250

Rep: Reputation: 62
Identify Duplicate Files by File Name


I have used 'mp3report' to make a text file listing all my MP3s suitable for import into OO-Calc. It looks like the excerpt shown below. In spite of the appearance of this section, the file is not sorted by Artist/Title. (Fields 2 and 3, if you will.) Obviously, if the actual files were in one directory, there would be no duplicates, but they're in about 30 different directories.

Code:
0001|April Stevens|Teach Me Tiger|01|2.19 MB|128 kbps|44.1 kHz|02:23
0002|Billy Joe Shaver|I'm Going Crazy in 3-4 Time|01|3.23 MB|128 kbps|44.1 kHz|03:31
0003|Billy Joe Shaver|Old Chunk of Coal|01|2.34 MB|128 kbps|44.1 kHz|02:33
0004|Billy Joe Shaver|Serious Souls|01|2.00 MB|128 kbps|44.1 kHz|02:10
0005|Blue Velvet Band|Hitch Hiker|01|3.32 MB|160 kbps|44.1 kHz|02:53
0006|Blue Velvet Band|Ramblin' Man|01|3.25 MB|160 kbps|44.1 kHz|02:50
0007|Blue Velvet Band|Sittin' on Top of the World|01|3.89 MB|160 kbps|44.1 kHz|03:23
0008|Blue Velvet Band|Somebody Else You've Known|01|2.83 MB|160 kbps|44.1 kHz|02:28
0009|Blue Velvet Band|Sweet Moments|01|2.86 MB|160 kbps|44.1 kHz|02:29
0010|Blue Velvet Band|The Knight Upon the Road|01|4.21 MB|160 kbps|44.1 kHz|03:40
0011|Blue Velvet Band|Weary Blues From Waitin'|01|3.51 MB|160 kbps|44.1 kHz|03:03
0012|Blue Velvet Band|You'll Find Her Name Written There|01|3.16 MB|160 kbps|44.1 kHz|02:45
0013|Bonnie Raitt|Let me In|01|3.36 MB|128 kbps|44.1 kHz|03:40
0014|Burl Ives|Time|01|2.70 MB|128 kbps|44.1 kHz|02:57
.
.
.
I would like to have a program or script that will scan the entire file and identify any Artist/Title duplicates.

If necessary, I could of course create another file that contains only that (Artist/Title) data... but if I could scan the file as is, that would be even better.

I've looked thru google and these forums, but suggestions for identifying duplicate files I've seen are based on file size or some sort of hashing scheme rather than file names. I could easily have the same song twice with different file sizes so that won't work.

Up to now, I've moved such a file to Windows, and used MS Access to identify the duplicates. For obvious reasons, I'd like to stop doing that. Suggestions would be appreciated.

Last edited by rickh; 06-06-2006 at 12:13 AM.
 
Old 06-06-2006, 01:12 AM   #2
CroMagnon
Member
 
Registered: Sep 2004
Location: New Zealand
Distribution: Debian
Posts: 900

Rep: Reputation: 33
I'm pretty sure you're going to need human intervention to positively ID all the dupes, but if the artist and titles are identical for many of them, you could do this:
Code:
cut -d '|' -f 2,3 mp3listfile | sort > list1.txt
cut -d '|' -f 2,3 mp3listfile | sort | uniq > list2.txt
then "diff list1.txt list2.txt" would show you the duplicates (if there were three of a given mp3, you would see two lines).
 
Old 06-06-2006, 04:06 AM   #3
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
I've got a perl script at home that positively
identifies duplicate files in directory trees using checksums.
it will work for this too.

I'll dig it out later if you like.
 
Old 06-06-2006, 06:52 AM   #4
rickh
Senior Member
 
Registered: May 2004
Location: Albuquerque, NM USA
Distribution: Debian-Lenny/Sid 32/64 Desktop: Generic AMD64-EVGA 680i Laptop: Generic Intel SIS-AC97
Posts: 4,250

Original Poster
Rep: Reputation: 62
Quote:
...if the artist and titles are identical for many of them, you could do this:

cut -d '|' -f 2,3 mp3listfile | sort > list1.txt
cut -d '|' -f 2,3 mp3listfile | sort | uniq > list2.txt

then "diff list1.txt list2.txt" would show you the duplicates
There are not many. Maybe 2 or 3 at the most out of 5000+ files. This technique may work tho ... I'm studying 'man cut' now.

Quote:
identifies duplicate files in directory trees using checksums.
This will definitely not work for reasons described in the original post.
 
Old 06-06-2006, 07:01 AM   #5
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
something like this, you can easy do
this will find duplicate lines
Code:
sort | uniq -d
you'll have to strip some fields out you don't want
 
Old 06-06-2006, 07:13 AM   #6
rickh
Senior Member
 
Registered: May 2004
Location: Albuquerque, NM USA
Distribution: Debian-Lenny/Sid 32/64 Desktop: Generic AMD64-EVGA 680i Laptop: Generic Intel SIS-AC97
Posts: 4,250

Original Poster
Rep: Reputation: 62
Hmmm! ...and, according to the man page, sort has a --unique option. May even be able to skip the '| uniq'

I'll have to rewrite the file to include only Artist/Title, but it looks like I can use some the 'cut' command described above to do that.

Thanks.
 
Old 06-06-2006, 07:15 AM   #7
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
no, you are trying to find the duplicates aren't you?
sort -u will remove the duplicates.

something like this:
Code:
 cut -f2,3 -d\| files.lis | sort| uniq -d  > dupes
 grep -f dupes files.lis

Last edited by bigearsbilly; 06-06-2006 at 07:16 AM.
 
Old 06-06-2006, 07:25 AM   #8
rickh
Senior Member
 
Registered: May 2004
Location: Albuquerque, NM USA
Distribution: Debian-Lenny/Sid 32/64 Desktop: Generic AMD64-EVGA 680i Laptop: Generic Intel SIS-AC97
Posts: 4,250

Original Poster
Rep: Reputation: 62
All right! Those commands, exactly as you gave them, work perfectly.
Code:
debian:~$ grep -f dupes mp3.txt
3612|June Carter Cash|Meeting in the Air|18|1.88 MB|128 kbps|44.1 kHz|02:02
3769|June Carter Cash|Meeting in the Air|19|1.88 MB|128 kbps|44.1 kHz|02:03
4684|Cluster Pluckers|Keep on the Sunny Side|25|2.64 MB|128 kbps|44.1 kHz|02:52
4981|Cluster Pluckers|Keep on the Sunny Side|27|2.65 MB|128 kbps|44.1 kHz|02:53
debian:~$
Thanks again.

Last edited by rickh; 06-06-2006 at 07:31 AM.
 
Old 06-21-2006, 03:04 PM   #9
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 234Reputation: 234Reputation: 234
They sure do.

Here is a possibly simpler way, no stripping fields, no intermediate files:
Code:
sort -t\| -k2,3 files.lis  | uniq -Dt\| -f1 -W2  | less
And here is my test corpus:
Code:
0001|April Stevens|Teach Me Tiger|01|2.19 MB|128 kbps|44.1 kHz|02:23
0002|Billy Joe Shaver|I'm Going Crazy in 3-4 Time|01|3.23 MB|128 kbps|44.1 kHz|03:31
0003|Billy Joe Shaver|Old Chunk of Coal|01|2.34 MB|128 kbps|44.1 kHz|02:33
0004|Billy Joe Shaver|Serious Souls|01|2.00 MB|128 kbps|44.1 kHz|02:10
0013|Bonnie Raitt|Let me In|01|3.36 MB|128 kbps|44.1 kHz|03:40
0014|Burl Ives|Time|01|2.70 MB|128 kbps|44.1 kHz|02:57
3612|June Carter Cash|Meeting in the Air|18|1.88 MB|128 kbps|44.1 kHz|02:02
3769|June Carter Cash|Meeting in the Air|19|1.88 MB|128 kbps|44.1 kHz|02:03
xxxx|June Carter Cash|Keep on the Sunny Side|25|2.64 MB|128 kbps|44.1 kHz|02:52
4684|Cluster Pluckers|Keep on the Sunny Side|25|2.64 MB|128 kbps|44.1 kHz|02:52
4981|Cluster Pluckers|Keep on the Sunny Side|27|2.65 MB|128 kbps|44.1 kHz|02:53
0011|Blue Velvet Band|Weary Blues From Waitin'|01|3.51 MB|160 kbps|44.1 kHz|03:03
0012|Blue Velvet Band|You'll Find Her Name Written There|01|3.16 MB|160 kbps|44.1 kHz|02:45
Notice the made up line "xxxx" to ensure that 2 artists doing the same title isn't called a duplicate.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
editors and duplicate files printf Linux - Newbie 7 11-22-2005 03:54 AM
is it possible to identify the kind of file descriptor? Thinking Programming 2 08-11-2005 05:43 AM
How to identify files in /sys/bus/i2c/devices/ koyi Linux - Hardware 0 07-18-2005 01:55 AM
Why when editing files does Linux create a duplicate file with a ~ in the extension? bugbite99 Linux - General 6 01-17-2005 02:21 PM
How to identify a tar file? rkircher Linux - General 8 02-24-2003 09:28 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:32 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration