duplicate file identification

rblampain · 01-12-2016, 01:28 AM

As I understand, if I copy "file_1" to "file_2" the OS does not create a second file but gives 2 references (names) to the same file so that "/maindir/dir1/this_file" and "~other_dir/that_file" point to the same and unique file, only when one of the files become different than the other does the OS create a second file.

Is there a Linux command to find when that is the case?

Thank you for your help.

dugan · 01-12-2016, 01:35 AM

Quote:

Originally Posted by rblampain

As I understand, if I copy "file_1" to "file_2" the OS does not create a second file but gives 2 references (names) to the same file so that "/maindir/dir1/this_file" and "~other_dir/that_file" point to the same and unique file, only when one of the files become different than the other does the OS create a second file.

Well, that's obviously not the case. But...

Remember that *nix has a technical term for "pointing to the same file", and that term is "hard link".

How to tell if two files are hardlinked? I googled it, and this was a prominent hit:

http://unix.stackexchange.com/a/24139

bigearsbilly · 01-12-2016, 02:55 AM

If you do

Code:

ln this that

as opposed to

Code:

ln -s

you create another link (name) in the directory to that file. This is pretty rarely used.
You can find such files with multiple hard links by

Code:

find . -type f ! -links 1

and find the matches by:

Code:

find . -samefile this

syg00 · 01-12-2016, 03:18 AM

Let's not forget copy-on-write and de-dup filesystems. Technically that is multiple files comprised of the same data. Probably not what the OP is asking, but possibly still sufficiently relevant to muddy the waters.

pan64 · 01-12-2016, 04:57 AM

and also check fat32 and similar filesystems where files cannot be hardlinked to each other at all.

Ramurd · 01-12-2016, 07:42 AM

Let me break down the OP and let's see where it leads...

Quote:

As I understand, if I copy "file_1" to "file_2" the OS does not create a second file but gives 2 references (names) to the same file so that "/maindir/dir1/this_file" and "~other_dir/that_file" point to the same and unique file, only when one of the files become different than the other does the OS create a second file.

No; when you copy a file a new file is create with the same content. If you refer to linking, as above replies indicate: if file1 changes, file2 changes along; they're the same physical file. Then, when referring to hard linking (rather than soft linking): this is not possible on different filesystems at all.

Some filesystems may exist where your description is implemented, but I'm not aware one such exists. If at all, both files should still reside on the same filesystem.
The question arises if such a thing would be desirable actually... given a multi-gigabyte file, if you would make such a copy, and then change the first file only slightly, it would take a great amount of time, because then the new file would have to be written to disk. Also, something would have to be implemented that keeps track of all the files on that filesystem and see what changes are made and how to act on those changes.

e.g. what would/should happen if you change file1 after your 'copy' and then revert it back?

As for the difference between a hard and a soft link:
Each file can be considered (for ease of understanding) a single hardlink to a physical file. The location is stored in the file which is called a directory, where the name and location on disk (inode) are stored. Creating a (new) hard link to the file adds to the special file 'directory' a new filename with the same location. The file is only physically removed from disk if no hard links direct to said location (actually, the inode is marked as 'free' so new files can start writing there).

A soft link is a new physical (special) file (=new inode), containing a path to an existing file. If that exsiting file is removed, the link remains but is broken. Since a soft link contains the path to the file, it can be made across different filesystems. Since a hard link points to an inode, it has to be on the same filesystem (as on another filesystem the same inode number will be occupied by another file or it is free yet)

Quote:

Is there a Linux command to find when that is the case?

Thank you for your help.

So, this would not be done with a linux command; but should be implemented in the filesystem that stores and can keep track of the changes.