Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust Icechunk learns how to collect garbage #368

Merged
merged 2 commits into from
Nov 4, 2024
Merged

Conversation

paraseba
Copy link
Contributor

@paraseba paraseba commented Nov 3, 2024

We call garbage collection to the process where dangling objects are
deleted from object store.

We define a dangling object as a chunk, manifest, snapshot, attributes
or transaction log file that cannot be reached by navigating the
parent relationship starting from all possible refs.

There are currently two mechanisms that create dangling objects:

  • Abandoning a session without committing it
  • Resetting a branch leaving snapshots behind

In the future, we'll introduce more mechanisms that "generate" garbage,
with the objective of reducing storage costs. One example, would be
squashing commits when version resolution is not relevant.

Garbage collection is an inherently dangerous process. It's the only
time at which Icechunk actually deletes data from object store. As such,
it must be executed carefully.

There is an unavoidable race condition in garbage collection: Icechunk
has no way to distinguish a new object from a dangling one, if that
object was created after the garbage collection process has traced the refs.

To solve that issue, the garbage collection process only deletes objects
that have been created some time ago. Users can pass a timestamp as
configuration to the collection process. This timestamp must be older
than the start time of the oldest possible writing session open. For
example, if the longest writing sessions last 48 hours, a safe timestamp
would be now - 7 days.

@paraseba paraseba force-pushed the push-lzpznloopqsw branch 2 times, most recently from ff1e718 to 5f59eeb Compare November 3, 2024 19:58
We call garbage collection to the process where dangling objects are
deleted from object store.

We define a dangling object as a chunk, manifest, snapshot, attributes
or transaction log file that cannot be reached by navigating the
parent relationship starting from all possible refs.

There are currently two mechanisms that create dangling objects:

- Abandoning a session without committing it
- Resetting a branch leaving snapshots behind

In the future, we'll introduce more mechanisms that "generate" garbage,
with the objective of reducing storage costs. One example, would be
squashing commits when version resolution is not relevant.

Garbage collection is an inherently dangerous process. It's the only
time at which Icechunk actually deletes data from object store. As such,
it must be executed carefully.

There is an unavoidable race condition in garbage collection: Icechunk
has no way to distinguish a new object from a dangling one, if that
object was created after the garbage collection process has traced the refs.

To solve that issue, the garbage collection process only deletes objects
that have been created some time ago. Users can pass a timestamp as
configuration to the collection process. This timestamp must be  older
than the start time of the oldest possible writing session open. For
example, if the longest writing sessions last 48 hours, a safe timestamp
would be `now - 7 days`.
Comment on lines +50 to +51
chunks_age: DateTime<Utc>,
metadata_age: DateTime<Utc>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about the rationale for having two different ages here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought, giving that metadata files are far less numerous than chunks, users would be willing to preserve them longer. If we someday support some form of reflog (or even without) it could be useful to have them around for forensics or understanding. It wouldn't allow to recover data, but you could at least recover "structure". For example, a reasonable thing to do could be to GC chunks older than 1 month and metadata older than 1 year.


async fn list_chunks(
&self,
) -> StorageResult<BoxStream<StorageResult<ListInfo<ChunkId>>>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these impls be defaulted? The impls look the same mostly (except for the nested storage systems)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point, will try

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tricky thing is that traits cannot be object safe if they have generic methods .... 🤦

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhhh right... not worth it then probably

This is a bit less efficient, because we have two allocations per item:
we need to go item -> string -> ObjectId, instead of item -> ObjectId.
But it's also less code.
@paraseba paraseba requested a review from mpiannucci November 4, 2024 17:18
@paraseba paraseba merged commit fc62650 into main Nov 4, 2024
3 checks passed
@paraseba paraseba deleted the push-lzpznloopqsw branch November 4, 2024 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants