I have two hard-drives with data. There are some unique files on disk 1 and there are some unique files on disk 2, but most of the data is present on both disks.
The problem however is that not all the files have been placed in the equivalent directories.
Now I want to make sure that all files from disk 2 are stored on disk 1, without wasting space on files that are already present on that disk.
Checking all the files and directories manually is impossible. There are to many files and directories.
I am therefore looking for a way to automate the laborious part of this task.
If the directory structure and names on both disks were identical, then I could have used rsync to copy only the new and updated files to disk 1.
Unfortunately that is not the case, because some directories have a different name and some of the files are placed in different directories.
example:
Code:
Disk 1/beverages/black_coffee.txt
Disk 1/cars/4x4/jeep.txt
Disk 1/cars/sports/porsche.txt
Disk 1/food/onion.txt
Disk 1/food/carrot.txt
Disk 2/drinks/black_coffee.txt
Disk 2/car/jeep.txt
Disk 2/porsche.txt
Disk 2/food/onion.txt
Disk 2/food/potatoes.txt
Disk 2/vegetables/broccoli.txt
Then the file black_coffee.txt already exists on both disks, but in a different directory. The same goes for the file jeep.txt and porsche.txt as they have been placed in different directories on a different level.
Only the files potatoes.txt and broccoli.txt would be unique when comparing the two disks. Therefore the name and absolute path of the potatoes.txt and broccoli.txt files should be added to a list, which can be used to copy these unique files to disk 1 in a fitting directory.
I need some way to check which files on disk 2 do not yet exist on disk 1.
For now I assume that the filenames are unique identifiers, which would make it easier to compare files. (perhaps md5 hashing would be an alternative, but I fear that might be to complex and heavy for the many files)
The union of both sets covers about 80% of the data:
I am however interested in the relative complement of the first set:
That would exactly be the set of files that only exist on disk 2 and not yet on disk 1.
What would be the best approach to tackle this problem?