Replies: 1 comment 2 replies
-
I suggest you read the borg internals docs to get more information about how borg works and also the terminology used. It's quite a bit different from what you think how it works. What you think is an "archive file" isn't: it is a so-called segment file and it can contain all sorts of chunks: file content chunks, archive metadata stream chunks, manifest chunks. If you delete such a file, the damage and what borg check --repair can do depends on what was actually in there:
Also, if an old borg version is involved (< 1.2, always compacting) or if a new borg version ran Not sure what you mean by "non-synchronous repo backup copies". See our FAQ why we do not recommend making repo copies (e.g. with rsync). But of course, if you do, you need to do it while no borg process is modifying the repo to get a consistent state, see Redundancy: a borg repo is a content-addressable key/value store, the key (== address) is H(value) (== H(content)), so it can't store the same value multiple times. That's how the deduplication works and (among other reasons) why you should have multiple backups at different places. Having multiple repos being able to repair each other (assuming that a chunk got lost and some other repo has a copy of that chunk still): this will be possible in borg2, where the concept of "related repos" exists (making sure chunks get cut in the same way, H is doing the same computation). Even there, that is not yet implemented as a command. |
Beta Was this translation helpful? Give feedback.
-
This became quite lengthy/dense to follow so I've posted it here for discussion rather than an issue since also wasn't sure if I was misunderstanding something with my conclusions. It involves two sets of tests with a small data set, each of which have slightly different behavior.
It's unclear to what extent (if any) Borg has redundancy for the metadata of the history of valid chunk checksums and also file path mapping (that is, which files are associated to the data in which chunks), which raised some concerns noted at the bottom if one has made their own non-synchronous repo backup copies but needs to repair a primary repo.
I realize Borg doesn't have any redundancy for data per se but that's expected.
Test 1:
Result: found only the latest backup contained any files. All other mounted backup directories were entirely empty (not even any nested sub-directory paths from the source data, such as
/home/
, etc).It would seem among the 64 archive files in
data
all info about those five files was seemingly contained solely in that largest single ~460KB archive file. At least, judging from the results.Then I reverted to an earlier valid copy of the repo before I deleted the archive file for the test above. Then I added 6 extra, different PNG files to the source data (totalling 7.6MB extra filesize).
Test 2:
Result: when using
mount
to check the backups I found the snapshots before adding the extra PNG files were entirely empty (no PNG or TXT files) as they were in the prior test, however all the snapshots after the point I added the new PNG files to the source data did contain the complete set of files despite having similarly deleted what seemed like the equivalent archive file.I'm not sure what explains this discrepancy as I would have thought that if the original PNG were 'healed' in this latter test that any of the prior snapshots it appeared it would have been fixed. Or if it weren't healed (like the former test) that all snapshots would lack the files.
Either way, what interested me was if Borg could have a way of adding redundancy for metadata specifically, so copies of chunk checksums and file association mappings could be spread across more archive files, to allow more opportunity for
--repair
healing to work.From the tests it's unclear how Borg handles this since the second test was able to heal the effect of the removed archive file for some backups but not for others (the entirely empty ones), while the first test with 64 archive files was un-healable after just one was corrupt/missing.
To clarify, the data loss per se isn't the concern just the inability in some cases to heal chunks in snapshots based on data restored to the repo via new snapshots after limited whole archive file corruption.
Since even if one had a secondary, non-synchronous repo copy (either via periodic sync'ing or a separate repo ID based on the same source data like the docs suggest) an issue I'm picturing is if it's not a perfect sync the repo archive files wouldn't be interchangeable should something bad occur to either (or am I mistaken?), which would put the onus on the user to do diff checks across multiple snapshots to determine what was absent (rather than just being able to identify missing files automatically via
--repair
and restore them easily).(Apologies for how verbose this ended up being btw!)
Beta Was this translation helpful? Give feedback.
All reactions