How to Sync and omit duplicates between directories

leoio2 · 03-28-2020, 07:03 PM

Context:
I have a URL that has 8 directories. In Directory #1, there are 80 files. In Directory #2, there are 90 files. 80 files in Directory #2 are the exact same as 80 files in Directory #1. But also in Directory #4, 6 and 8. There are also duplicates present from other Directories in the URL. I only want one copy of a file that has a unique name. So, after it is downloaded the first time in any particular directory, any file of the same name should not be downloaded again.

This will not work:

Code:

rclone sync URL dest:

Another user mentioned this:

Quote:

This script serves a different purpose. But you can use some of the same logic and commands to achieve what you want to do. Also jump up one level, look at orig difflist with the rclone check command.

https://github.com/88lex/diffmove/blob/master/difflist2

Code:

#!/usr/bin/env bash
# Requires installation of moreutils to run combine.
# sudo apt install moreutils
# The default command  below compares directories, not files
# Adjust --level to control how deeply you recurse into the tree

rclone tree -di --level 2 $1 | sort >tmp1
rclone tree -di --level 2 $2 | sort >tmp2

# use these commands below if you want to compare files, not dirs
#rclone tree -i --full-path $1 | sort >tmp1
#rclone tree -i --full-path $2 | sort >tmp2

combine tmp1 not tmp2>not_in_2
combine tmp2 not tmp1>not_in_1

rm tmp1
rm tmp2

I have no idea how to use a script. But I wanted to at least make an attempt, so I came up with this:

Code:

#!/usr/bin/env bash

rclone tree -i URL --level 2 $1 | sort >tmp1
rclone tree -i HDD_destination --level 2 $2 | sort >tmp2

combine tmp1 not tmp2>not_in_2
combine tmp2 not tmp1>not_in_1

rm tmp1
rm tmp2

Is that correct or is it, at least, on the right track to fix this issue?

Alternative:
Someone also mentioned using symlinks with rclone, OR using 'wget -nc', but I'm not sure how they would prevent duplicates between directories.

Thank you.

Guttorm · 03-30-2020, 03:57 AM

Hi

Using "wget -nc" will simply download if the files don't exist locally, otherwise skip it. And if you add -nd or --no-directories, it will not make any sub-directories, so all files will end up in the current directory. Try this:

Code:

wget -nc -nd URL

leoio2 · 03-30-2020, 10:56 AM

Hello Guttorm,

I see how the command you wrote works. It places everything in a single directory and based on that, prevents duplicates. Is there any way to maintain the directory structure with the URL and still prevent duplicates though?

Guttorm · 03-31-2020, 03:03 AM

Hi

I doubt you will find a ready made solution. The reason is that this is very unusual. It is very normal to have lots of files with the same name in different directories which are not duplicates - like index.html, README and so on. That's a major reason we have directories in the first place.

So I guess you either can download all, and check if they really are duplicates and delete them.

Other way is to make some kind of script.

A third option is to make a text file with an URL for each file - one per line. Then use a spreadsheet/texteditor or something to filter out the unwanted duplicates. Then you could use wget with --force-directories or x. Something like this:

Code:

wget -x $(cat urls.txt)

ondoho · 03-31-2020, 04:08 AM

Quote:

Originally Posted by Guttorm

download all, and check if they really are duplicates and delete them.

For this, various solutions exist.
Search for fslint or rmlint.

leoio2 · 03-31-2020, 04:36 AM

Quote:

Originally Posted by Guttorm

Other way is to make some kind of script.

As I am sure the duplicates are duplicates, this is the only option, I think. With that said, how could I modify the script in the OP I posted above to accomplish this?