Exactly once support for clickhouse sink #147
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #146
Solution
The
ExactlyOnceSink
is a wrapper layer that provides exactly-once delivery semantics for a data pipeline. This ensures that each batch of data is processed and delivered to the destination exactly once, even in the presence of retries or failures.The core components of this implementation are:
Key assumptions to support for exactly once in Clickhouse:
KeeperMap
table engine is enabled at target database.PartID
-metadata, to split logical table into sequence of related changes_offset
-column, each offset should be consistently growing.The diagrams below explain the flow, structure, and interactions within the system.
1. Sequence Diagram
Purpose: The sequence diagram illustrates the interaction between components (
deduper
,store
,sink
, andbatch
) over time. It focuses on the sequence of operations performed to achieve exactly-once semantics.Key Highlights:
deduper
retrieves the last processed block and evaluates the current batch to determine actions.store
occur before and after data is pushed to the sink.2. Class Diagram
Purpose: This diagram shows the relationships and dependencies between the key components (
deduper
,sink
,store
, andbatch
).Key Highlights:
Deduper
is the central orchestrator, responsible for processing batches and coordinating actions.InsertBlockStore
is used to store the state of data blocks to support deduplication.Sink
is the final destination for processed data.3. Component Diagram
Purpose: The component diagram highlights the interaction between the
ExactlyOnceSink
wrapper and the existing system components.Key Highlights:
ExactlyOnceSink
manages the deduplication and state tracking for incoming batches.deduper
andInsertBlockStore
to ensure exactly-once delivery.deduper
performs the actual processing for each data partition.4. Data Flow Diagram
Purpose: The data flow diagram visualizes how data (e.g., batches and blocks) flows through the system and is processed.
Key Highlights:
before
andafter
) are stored in theInsertBlockStore
to track progress.5. Decision Tree Diagram
Purpose: The decision tree diagram outlines the branching logic of the
splitBatch
function, which determines how items in the batch are categorized.Key Highlights:
Let me know if further refinements are needed!