How to Sync and omit duplicates between directories
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
How to Sync and omit duplicates between directories
Context:
I have a URL that has 8 directories. In Directory #1, there are 80 files. In Directory #2, there are 90 files. 80 files in Directory #2 are the exact same as 80 files in Directory #1. But also in Directory #4, 6 and 8. There are also duplicates present from other Directories in the URL. I only want one copy of a file that has a unique name. So, after it is downloaded the first time in any particular directory, any file of the same name should not be downloaded again.
This will not work:
Code:
rclone sync URL dest:
Another user mentioned this:
Quote:
This script serves a different purpose. But you can use some of the same logic and commands to achieve what you want to do. Also jump up one level, look at orig difflist with the rclone check command.
#!/usr/bin/env bash
# Requires installation of moreutils to run combine.
# sudo apt install moreutils
# The default command below compares directories, not files
# Adjust --level to control how deeply you recurse into the tree
rclone tree -di --level 2 $1 | sort >tmp1
rclone tree -di --level 2 $2 | sort >tmp2
# use these commands below if you want to compare files, not dirs
#rclone tree -i --full-path $1 | sort >tmp1
#rclone tree -i --full-path $2 | sort >tmp2
combine tmp1 not tmp2>not_in_2
combine tmp2 not tmp1>not_in_1
rm tmp1
rm tmp2
I have no idea how to use a script. But I wanted to at least make an attempt, so I came up with this:
Code:
#!/usr/bin/env bash
rclone tree -i URL --level 2 $1 | sort >tmp1
rclone tree -i HDD_destination --level 2 $2 | sort >tmp2
combine tmp1 not tmp2>not_in_2
combine tmp2 not tmp1>not_in_1
rm tmp1
rm tmp2
Is that correct or is it, at least, on the right track to fix this issue?
Alternative:
Someone also mentioned using symlinks with rclone, OR using 'wget -nc', but I'm not sure how they would prevent duplicates between directories.
Using "wget -nc" will simply download if the files don't exist locally, otherwise skip it. And if you add -nd or --no-directories, it will not make any sub-directories, so all files will end up in the current directory. Try this:
I see how the command you wrote works. It places everything in a single directory and based on that, prevents duplicates. Is there any way to maintain the directory structure with the URL and still prevent duplicates though?
I doubt you will find a ready made solution. The reason is that this is very unusual. It is very normal to have lots of files with the same name in different directories which are not duplicates - like index.html, README and so on. That's a major reason we have directories in the first place.
So I guess you either can download all, and check if they really are duplicates and delete them.
Other way is to make some kind of script.
A third option is to make a text file with an URL for each file - one per line. Then use a spreadsheet/texteditor or something to filter out the unwanted duplicates. Then you could use wget with --force-directories or x. Something like this:
As I am sure the duplicates are duplicates, this is the only option, I think. With that said, how could I modify the script in the OP I posted above to accomplish this?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.