LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 03-28-2020, 07:03 PM   #1
leoio2
LQ Newbie
 
Registered: Mar 2011
Posts: 21

Rep: Reputation: 0
How to Sync and omit duplicates between directories


Context:
I have a URL that has 8 directories. In Directory #1, there are 80 files. In Directory #2, there are 90 files. 80 files in Directory #2 are the exact same as 80 files in Directory #1. But also in Directory #4, 6 and 8. There are also duplicates present from other Directories in the URL. I only want one copy of a file that has a unique name. So, after it is downloaded the first time in any particular directory, any file of the same name should not be downloaded again.

This will not work:
Code:
rclone sync URL dest:
Another user mentioned this:
Quote:
This script serves a different purpose. But you can use some of the same logic and commands to achieve what you want to do. Also jump up one level, look at orig difflist with the rclone check command.

https://github.com/88lex/diffmove/blob/master/difflist2
Code:
#!/usr/bin/env bash
# Requires installation of moreutils to run combine.
# sudo apt install moreutils
# The default command  below compares directories, not files
# Adjust --level to control how deeply you recurse into the tree

rclone tree -di --level 2 $1 | sort >tmp1
rclone tree -di --level 2 $2 | sort >tmp2

# use these commands below if you want to compare files, not dirs
#rclone tree -i --full-path $1 | sort >tmp1
#rclone tree -i --full-path $2 | sort >tmp2

combine tmp1 not tmp2>not_in_2
combine tmp2 not tmp1>not_in_1

rm tmp1
rm tmp2
I have no idea how to use a script. But I wanted to at least make an attempt, so I came up with this:

Code:
#!/usr/bin/env bash

rclone tree -i URL --level 2 $1 | sort >tmp1
rclone tree -i HDD_destination --level 2 $2 | sort >tmp2

combine tmp1 not tmp2>not_in_2
combine tmp2 not tmp1>not_in_1

rm tmp1
rm tmp2
Is that correct or is it, at least, on the right track to fix this issue?

Alternative:
Someone also mentioned using symlinks with rclone, OR using 'wget -nc', but I'm not sure how they would prevent duplicates between directories.

Thank you.

Last edited by leoio2; 03-28-2020 at 07:06 PM.
 
Old 03-30-2020, 03:57 AM   #2
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,453

Rep: Reputation: 448Reputation: 448Reputation: 448Reputation: 448Reputation: 448
Hi

Using "wget -nc" will simply download if the files don't exist locally, otherwise skip it. And if you add -nd or --no-directories, it will not make any sub-directories, so all files will end up in the current directory. Try this:

Code:
wget -nc -nd URL
 
Old 03-30-2020, 10:56 AM   #3
leoio2
LQ Newbie
 
Registered: Mar 2011
Posts: 21

Original Poster
Rep: Reputation: 0
Hello Guttorm,

I see how the command you wrote works. It places everything in a single directory and based on that, prevents duplicates. Is there any way to maintain the directory structure with the URL and still prevent duplicates though?
 
Old 03-31-2020, 03:03 AM   #4
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,453

Rep: Reputation: 448Reputation: 448Reputation: 448Reputation: 448Reputation: 448
Hi

I doubt you will find a ready made solution. The reason is that this is very unusual. It is very normal to have lots of files with the same name in different directories which are not duplicates - like index.html, README and so on. That's a major reason we have directories in the first place.

So I guess you either can download all, and check if they really are duplicates and delete them.

Other way is to make some kind of script.

A third option is to make a text file with an URL for each file - one per line. Then use a spreadsheet/texteditor or something to filter out the unwanted duplicates. Then you could use wget with --force-directories or x. Something like this:

Code:
wget -x $(cat urls.txt)
 
Old 03-31-2020, 04:08 AM   #5
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
Quote:
Originally Posted by Guttorm View Post
download all, and check if they really are duplicates and delete them.
For this, various solutions exist.
Search for fslint or rmlint.
 
Old 03-31-2020, 04:36 AM   #6
leoio2
LQ Newbie
 
Registered: Mar 2011
Posts: 21

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by Guttorm View Post
Other way is to make some kind of script.
As I am sure the duplicates are duplicates, this is the only option, I think. With that said, how could I modify the script in the OP I posted above to accomplish this?
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Look for duplicates in folder tree A and folder tree B, then delete the duplicates only from A. grumpyskeptic Linux - Software 7 10-27-2018 10:23 PM
[SOLVED] Transfer files and omit duplicates? Cocolate Linux - Newbie 13 09-04-2014 05:14 AM
Transfer files and omit duplicates? Cocolate Linux - Newbie 1 09-03-2014 07:41 PM
Chemistry problem: Identify duplicates and non-duplicates within TWO sdf files robertselwyne Programming 5 12-09-2011 06:20 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 05:07 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration