-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust Icechunk learns how to collect garbage #368
Conversation
ff1e718
to
5f59eeb
Compare
We call garbage collection to the process where dangling objects are deleted from object store. We define a dangling object as a chunk, manifest, snapshot, attributes or transaction log file that cannot be reached by navigating the parent relationship starting from all possible refs. There are currently two mechanisms that create dangling objects: - Abandoning a session without committing it - Resetting a branch leaving snapshots behind In the future, we'll introduce more mechanisms that "generate" garbage, with the objective of reducing storage costs. One example, would be squashing commits when version resolution is not relevant. Garbage collection is an inherently dangerous process. It's the only time at which Icechunk actually deletes data from object store. As such, it must be executed carefully. There is an unavoidable race condition in garbage collection: Icechunk has no way to distinguish a new object from a dangling one, if that object was created after the garbage collection process has traced the refs. To solve that issue, the garbage collection process only deletes objects that have been created some time ago. Users can pass a timestamp as configuration to the collection process. This timestamp must be older than the start time of the oldest possible writing session open. For example, if the longest writing sessions last 48 hours, a safe timestamp would be `now - 7 days`.
5f59eeb
to
0b2e929
Compare
chunks_age: DateTime<Utc>, | ||
metadata_age: DateTime<Utc>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about the rationale for having two different ages here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought, giving that metadata files are far less numerous than chunks, users would be willing to preserve them longer. If we someday support some form of reflog (or even without) it could be useful to have them around for forensics or understanding. It wouldn't allow to recover data, but you could at least recover "structure". For example, a reasonable thing to do could be to GC chunks older than 1 month and metadata older than 1 year.
icechunk/src/storage/mod.rs
Outdated
|
||
async fn list_chunks( | ||
&self, | ||
) -> StorageResult<BoxStream<StorageResult<ListInfo<ChunkId>>>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these impls be defaulted? The impls look the same mostly (except for the nested storage systems)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great point, will try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One tricky thing is that traits cannot be object safe if they have generic methods .... 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhhhh right... not worth it then probably
This is a bit less efficient, because we have two allocations per item: we need to go item -> string -> ObjectId, instead of item -> ObjectId. But it's also less code.
We call garbage collection to the process where dangling objects are
deleted from object store.
We define a dangling object as a chunk, manifest, snapshot, attributes
or transaction log file that cannot be reached by navigating the
parent relationship starting from all possible refs.
There are currently two mechanisms that create dangling objects:
In the future, we'll introduce more mechanisms that "generate" garbage,
with the objective of reducing storage costs. One example, would be
squashing commits when version resolution is not relevant.
Garbage collection is an inherently dangerous process. It's the only
time at which Icechunk actually deletes data from object store. As such,
it must be executed carefully.
There is an unavoidable race condition in garbage collection: Icechunk
has no way to distinguish a new object from a dangling one, if that
object was created after the garbage collection process has traced the refs.
To solve that issue, the garbage collection process only deletes objects
that have been created some time ago. Users can pass a timestamp as
configuration to the collection process. This timestamp must be older
than the start time of the oldest possible writing session open. For
example, if the longest writing sessions last 48 hours, a safe timestamp
would be
now - 7 days
.