Share your knowledge at the LQ Wiki.
Go Back > Blogs > jere21
User Name


Rate this Entry

(cloud) storage with git-annex

Posted 07-15-2013 at 01:42 PM by jere21
Updated 12-04-2015 at 10:25 PM by jere21 (Typo)

I've installed the following packages because they sound interesting for the task. For now of all of them I just use git git-annex.

ii bup        0.25-1     amd64 highly efficient file backup system based on git
ii git-annex  5.20140927 amd64 manage files with git, without checking their contents into git
ii graphviz   2.38.0-6   amd64 rich set of graph drawing tools
ii tahoe-lafs 1.10.0-2   all   Secure distributed filesystem

git-annex allows to manage your (big) files over several repositories (local laptop, external harddrive, or remote cloud and ssh services with encryption (not done here).

Every repository has the same folder and file structure. "Files" are only links to objects in .git/annex/objects/. With git-annex you can easily drop these objects to save space and get them back whenever you need them. But even if you dropped an object, you'll still have the now broken link left, so that you know that a file exists in this path with that name (somewhere in your distributed repositories).

If you switch from the default mode to "direct mode" files that are currently present stay in your working directory as real files (instead of being links). Only if you drop them they will become (broken) links as described above. So "direct mode" is more convenient, but less reliable from a backup perspective, because it is easier to lose content.

git-annex offers a git-annex assistant which comes with a GUI and does the file adding and syncing stuff automatically. Further it handles files in direct mode. I don't use the git-annex assistant.
And I suspect that trying the assistant on a new repository transitioned my existing repository's config to "direct" mode. So if git-annex suddenly claims that all your data is missing just check your config.)

Creating a repository
mkdir ~/annex
cd ~/annex
git init
# Choose name of the repo freely:
git annex init "jens@hope"
# Wanted/group are for the automatic handling (optional, explained later):
git annex wanted . standard
git annex group . manual    # use "manual" or "client"
# Untrust/direct match my daily work habits (optional, explained directly below):
git annex untrust .
git annex direct
mkdir archive
"git annex untrust/direct" are for the direct mode. I use it on my laptop (where I work) to avoid the hassle with the links (e.g. otherwise I used the "--dereference" option with e.g. "ls" and "cp"). It could also come handy to transfer files with a git-annex remote external drive to a computer with no git-annex . I will NOT use "direct" with the external drive that I use as kind of backup.
Because humans make errors, especially in their daily work, I "untrusted" the direct mode repo.

My directory layout
The folder annex/ contains all or similar folders with media previously located directly in home.
The folder annex/archive contains content not originating from this folder structure (especially backups, see below).

Files I know of that I don't want to keep end up directly in home, or subfolders (but not in annex/).
If I (may) want to keep files I move them to ~/annex/foo
If I don't need them anymore for now I move them to ~/annex/foo/archive/ and "git annex move|drop" them (see below).

I don't put documents, vcs repositories and other modifiable files (~/Documents and ~/development) in git-annex (sometimes git-annex even refuses to check those in; you're notified about this on "git annex commit"). Instead I make regular tar.xz backups of home (without ~/annex/ and some other stuff), /etc and /usr/local/bin and save those in ~/annex/archive/backup/.

Similarly I archive projects (e.g. ~/development/project_x) that I currently don't work on as tar.xz and move those to ~/annex/archive/development/
(Then I remove project_x/. I use no date in the filename, because this is not a snapshot but the one and only version of this project, to be reconsidered).

Remotes (e.g. external USB 3 drives)
  • Label them (max length 16 characters) in gparted.
    VendorSize_Connection (6,5,_,4 characters).
    Later on I use the label as git-annex repository name.
    (Of course this is all optional, and as you can see I (still) have repos with another naming.)
  • Make the remote writable:
    sudo chown jens:jens /media/jens/Platinum1TB_USB3/

Adding a remote (as full archive called Platinum1TB_USB3")
cd /media/jens/Platinum1TB_USB3/
git clone ~/annex/
cd annex/
git annex init "Platinum1TB_USB3"
git annex wanted . standard
git annex group . archive
# Tell the new repository about existing repos:
git remote add hope ~/annex
# Make your new remote known to every single other repository,
# if you want to exchange data directly between them:
cd ~/annex
git remote add Platinum1TB_USB3 /media/jens/Platinum1TB_USB3/annex
Adding a remote (as full backup called TOURO_1TB_ext4")
cd /media/jens/TOURO_1TB_ext4/
git clone ~/annex/
cd annex/
git annex init "TOURO_1TB_ext4"
git annex wanted . standard
git annex group . backup
git remote add hope ~/annex
# Make your new remote known to every single other repository,
# if you want to exchange data directly between them:
cd ~/annex
git remote add TOURO_1TB_ext4 /media/jens/TOURO_1TB_ext4/annex
cd /media/jens/6c6dc456-904b-4a13-b9f0-80e6128e3c5c/annex/
git remote add TOURO_1TB_ext4 /media/jens/TOURO_1TB_ext4/annex
Keep the repository updated (manually)
  • Do the following in every repository (or at least in those that are interested in the change) whenever any repository was modified :
    git annex add
    git annex sync -J4
    "add" adds all new files, catches moved around files (fixes their symlinks to the object) and removed files (not the related objects! But keep in mind that direct mode has no objects, so a removed file there is immediately lost, unless it is is backuped somewhere else). For file operations the regular git commands like "git rm FILE" is possible but not necessary.

    "sync" commits already added files, so no need for a separate "git commit -a -m added".
    Then it syncs the index (not the content) with other repositories.

    "-J4": 4 jobs in parallel

    With "sync --content" also the content is transferred. Preferred once the automatic handling is set up (see below).

    Details: "When you run git annex sync, it merges the synced/master branch into master, receiving anything that's been pushed to it. (If there is a conflict in this merge, automatic conflict resolution is used to resolve it). Then it fetches from each remote, and merges in any changes that have been made to the remotes too. Finally, it updates synced/master to reflect the new state of master, and pushes it out to each of the remotes."
  • Remove content from your local repository to save space:
    git annex drop FILE|DIRECTORY
    ... also available in the file manager nautilus - right click - scripts.

    You can always drop files safely. Git-annex checks that some other repository still has the file before removing it.
    Once dropped, the file will still appear in your work tree as a broken symlink.
  • Get the content:
    git annex get FILE|DIRECTORY
    ... also available in the file manager nautilus - right click - scripts.
  • Move content to remote archive repository, e.g. the complete annex/archive/ folder:
    cd ~/annex
    git annex move archive/ --to Platinum1TB_USB3
    This is like "git annex get" in the remote repository and "git annex drop" in the local, while ignoring any special wanted/group configuration of git-annex.
  • Move unused (i.e. you removed the link pointing to the object) content to remote archive repository:
    git annex move --unused --to Platinum1TB_USB3
  • Getting rid of old files and their content:
    rm FILE
    git annex sync -J4
    This is enough for a reository in direct mode. For indirect mode you'll also need
    git annex unused
    git annex dropunused 1-NNNN

Automatic handling of the repositories:
  • See the current settings of a repository:
    git annex wanted .
    git annex group .
  • Add/remove repository to/from a group of preferred content:
    git annex group manual|client|archive|backup|...
    git annex ungroup manual|client|archive|backup|...
  • Try to have at least one redundant copy of everything:
    git annex numcopies 2
  • Edit configuration settings:
    git annex vicfg
  • Use after setting preferred content and numcopies:
    git annex get --auto
    git annex drop --auto
    git annex sync --content -J4
    ... or use the git-annex assistant after setting preferred content.

Have a look at the REPOSITORY MAINTENANCE COMMANDS section in man git annex.
Note that the assistant takes care of (most?) maintenance automatically.
Most importantly run
git annex fsck
from time to time.

EDIT 2015-12-04:
- Combine add and sync commands, as I use them together usually.
- Add "Getting rid of old files and their content".
- Renamed a remote.
Posted in Uncategorized
Views 420 Comments 0
« Prev     Main     Next »
Total Comments 0




All times are GMT -5. The time now is 02:20 PM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration