Also see backup hardware and backup cloud. Also see personal knowledge base.
There are overlaps between these points.
Also, temporary files might/should be excluded (e.g. /tmp/*
).
- full Linux system including all files, such that recovering that will give me a fully functional system
- home directories
- archives (which are themselves backups of older computers/discs, where I don't have direct access to anymore)
- projects (programming but also other things) (usually Git repos)
- my pictures
- my music collection
- maybe movies (not so important)
- contact details (e.g. Google contacts)
- Tweets
- movie rankings (score11.de, imdb)
- documents / texts / mails
- Certain files are more important than others (e.g. home directory files are quite important, programming projects, mails).
- I want to keep several copies around (on different media, PCs, online services). Not all copies must be complete, they could maybe just have the important files.
- Some overview where (media, PCs, online service) I have what backup, and which files are there, and maybe even the version of them.
- Some of the projects etc have their own versioning (e.g. Git repo). But this probably does not matter too much. Although it is not totally clear whether the backup system should have its own versioning support. It probably should not keep the full history of everything. Also it should be possible to permanently delete things from the whole system.
- Copying will take long, because of huge amounts of data. There has to be some continuous update.
- Backups on external cloud storage would be nice as well, but should be encrypted there.
- Some projects / pictures are published elsewhere (GitHub, Google Photos or so). It would be good if the system knows that. And maybe provides an easy way to publish further directories.
In any case, I want to have an index of all the files, and that index should contain meta information, e.g. what backups contain the file, and other things.
The index could be part of the backup solution, or external (but it should know about the multiple copies).
Git-Annex might be an option.
Baloo or others are maybe relevant for a search index.
- Wikipedia software list
- ArchLinux software list
- Ubuntu software list
- rsync
- duplicity. Encrypted tar-format volumes, uploading them to a remote or local file server. No central index.
- Bacula
- Perkeep (previously Camlistore). Also for indexing of pics, etc. Looks close to what I want. Comparison.
- Upspin (HN). Similar to Perkeep, but different focus. Also very relevant.
- SeaweedFS stores blobs, objects, files.
- Syncthing (HN)
- FreeFileSync
- bup
- restic
- Kopia (HN)
- Box Backup
- BorgBackup (HN). Used by rsync.net. Special cheap rsync.net cloud storage support. Remotely encrypted backups. No central index.
- Rclone. "rsync for cloud storage". No central index of stored data.
- BackupPC
- Bareos
- Areca Backup
- Burp
- git-annex
- Datalad. On top of git-annex.
- Dat
- Unison File Synchronizer. project dead.
- Seafile
- albertz/backup_system: incomplete
- Resilio. commercial
- Dropbox, Google Drive, etc. commercial
- imap-backup
A lot of the software can be divided into:
-
Standard backup software: Choose what files to backup, and where. You are responsible for how many copies there are, and to keep track what files are backuped where. I.e. there is no global index of all files. They might be simpler to use, though.
-
Global index based systems, like Perkeep or Upspin. They are not designed to work with lots of small files (e.g. Git object files, whole Linux systems, etc) but more for media files (images, documents).
The stored backup can have its own custom format (e.g. for efficient incremental backups) or it can be stored as-is. A custom format means that accessing it needs custom tools, might support custom FUSE, but is not as efficient.
It might make sense to decouple the storage of the files (maybe just as-is) from the index (to keep track which backup or remote contains what files, etc).
Note that not all software seems to be maintained anymore. Check the corresponding Git repo, whether it is still active.
- albertz/google-contacts-sync: syncs Google contacts
- albertz/memos: collects Tweets. similar is Timeliner
- albertz/personal_assistant: personal assistant. backup, or a knowledge base, is kind of an integrated part of this; or knowing where to find what data
- albertz/system-tools, albertz/helpers: small tools to sync/download things, or create projects, etc
- albertz/iphone-backup
- Baloo or others for indexing
- Solid project, e.g. Solid Google Takeout importer
- unixsheikh.com: How I store my files and why you should not rely on fancy tools for backup (HN): using ZFS
- stavros.io: I found the Holy Grail of backups: using Borg
What's missing from Perkeep for the outlined use case? How would the workflow look like?
- The index of objects/files:
- Is it easily synchronized, so always up-to-date?
- Reasonable small enough, so every backup instance can have the full index? Or do we need partial index support?
- Does it contain information on what media/PC we have the data? If not, can we add that? (I want to see, how many copies of some objects (or tree) are there, have control over that.)
- Good idea to just push all Git object files into it?
- Should we then also push the checked out files into it? We already have all the data from the Git objects.
- Can Perkeep directly read and understand the Git object files? Directly accessible (read-only) via FS?
- Would that work well with once-written/offline backup media (DVD, tape)?
- Automatic backup schedules:
- Some trees (e.g. home dir) should automatically be synced to multiple online media.
What's missing from Bup for the outlined use case? How would the workflow look like? Simpler than Perkeep (no concept of user, access control) but that might not be a dealbreaker.
- Python 2?
- The index of objects/files:
- Is it easily synchronized, so always up-to-date?
- Index is a single file? Can it be distributed? Partial?
- Does it contain information on what media/PC we have the data? If not, can we add that?
If a single repo (maybe more convenient), how would I link existing files into it?
E.g. my current picture collections, which is already distributed, and partial in each copy.
Would I just go in one of the copies
and do git init
and git annex init
?
E.g. via local sensitive hashing (LSH).
Good also in the sense of how easy it is to use.