Kafka tiered storage concept content

aiven · Sep 12, 2023 · a40aec9 · a40aec9
1 parent fe2b8fa
commit a40aec9
Show file tree

Hide file tree

Showing 5 changed files with 118 additions and 0 deletions.
diff --git a/_toc.yml b/_toc.yml
@@ -318,6 +318,17 @@ entries:
           - file: docs/products/kafka/concepts/monitor-consumer-group
           - file: docs/products/kafka/concepts/kafka-quotas
             title: Quotas
+          - file: docs/products/kafka/concepts/kafka-tiered-storage
+            title: Tiered storage
+            entries: 
+              - file: docs/products/kafka/concepts/tiered-storage-how-it-works
+                title: How it works
+              - file: docs/products/kafka/concepts/tiered-storage-guarantees
+                title: Guarantees
+              - file: docs/products/kafka/concepts/tiered-storage-limitations
+                title: Limitations
+
+
       - file: docs/products/kafka/howto
         title: HowTo
         entries:

diff --git a/docs/products/kafka/concepts/kafka-tiered-storage.rst b/docs/products/kafka/concepts/kafka-tiered-storage.rst
@@ -0,0 +1,38 @@
+Tiered storage in Aiven for Apache Kafka®
+===========================================
+
+Discover the tiered storage capability in Aiven for Apache Kafka®. Learn how it works and explore its use cases. Check why you might need it and what benefits you get using it.
+
+Overview
+---------
+
+Tiered storage provides the ability to use multiple storage types to store data, such as local disk and cloud storage, based on how frequently it is accessed. With Aiven for Apache Kafka, you can use tiered storage to allocate some of your data to high-speed local disks and move the rest to more cost-efficient remote storage options like AWS S3, Google Cloud Storage, or Azure blob storage. 
+
+Tiered storage offers multiple benefits, including:
+
+* **Scalability**: Tiered storage allows Aiven for Apache Kafka instances to scale almost infinitely with cloud solutions, eliminating concerns about storage limitations.
+* **Cost efficiency**: By moving less frequently accessed data to cost-effective storage tiers, you can realize significant financial savings.
+* **Operational speed**: With the bulk of data offloaded to remote storage, service rebalancing in Aiven for Apache Kafka becomes faster, making for a smoother operational experience.
+* **Infinite data retention**: With the scalability of cloud storage, you can achieve unlimited data retention, valuable for analytics and compliance.
+* **Flexibility**: Data can be easily moved between storage tiers depending on usage and requirements, offering more flexibility.
+
+When and why to use it
+------------------------
+
+Understanding when and why to use tiered storage in Aiven for Apache Kafka will help you maximize its benefits, particularly around cost savings and system performance. 
+
+**Scenarios for use:**
+
+* **Long-term data retention**: Many organizations require large-scale data storage for extended periods, either for regulatory compliance or historical data analysis. Cloud services provide an almost limitless storage capacity, making it possible to keep data accessible for as long as required at a reasonable cost. This is where tiered storage becomes especially valuable.
+* **High-speed data ingestion**: Tiered storage can offer a solution when dealing with unpredictable or sudden influxes of data. By supplementing the local disks with cloud storage, sudden increases in incoming data can be managed, ensuring optimum system performance. 
+
+
+Security
+--------
+Segments are encrypted with 256-bit AES encryption before being uploaded to the remote storage. The encryption keys are not shared with the cloud storage provider and generally do not leave Aiven machines.
+
+Pricing
+-------
+Tiered storage users are billed for the remote storage usage in GB/hour, using the highest usage in each hour.
+
+
diff --git a/docs/products/kafka/concepts/tiered-storage-guarantees.rst b/docs/products/kafka/concepts/tiered-storage-guarantees.rst
@@ -0,0 +1,18 @@
+Guarantees
+============
+With tiered storage in Aiven for Apache Kafka®, there are two primary types of data retention guarantees: total retention and local retention.
+
+**Total retention**: Tiered storage ensures that your data will be available up to the limit defined by the total retention threshold, regardless of whether it is stored locally or remotely. This means that your data will not be deleted until the total retention threshold, whether on local or remote storage, is reached.
+
+**Local retention**: Log segments are only removed from local storage after successfully being uploaded to remote storage, even if the data exceeds the local retention threshold.
+
+
+Example
+--------
+
+Let's say you have a topic with a **total retention threshold** of **1000 bytes** and a **local retention threshold** of **200 bytes**. This means that:
+
+* All data for the topic will be retained, regardless of whether it is stored locally or remotely, as long as the total size of the data does not exceed 1000 bytes.
+* If the total size of the data exceeds 1000 bytes, Aiven for Apache Kafka will begin deleting the oldest data from remote storage.
+* No data will be deleted from local storage until it has been safely transferred to remote storage.
+
diff --git a/docs/products/kafka/concepts/tiered-storage-how-it-works.rst b/docs/products/kafka/concepts/tiered-storage-how-it-works.rst
@@ -0,0 +1,38 @@
+How tiered storage works in Aiven for Apache Kafka®
+===================================================
+
+Aiven for Apache Kafka® tiered storage is a feature that optimizes data management across two distinct storage tiers:
+
+* **Local tier**: Primarily consists of faster and typically more expensive storage solutions like solid-state drives (SSDs).
+* **Remote tier**: Relies on slower, cost-effective options like cloud object storage.
+
+In Aiven for Apache Kafka's tiered storage architecture, **remote storage** refers to storage options external to the Kafka broker's local disk. This typically includes cloud-based or self-hosted object storage solutions like AWS S3, Google Cloud,  and Azure Blob Storage. Although network-attached block storage solutions like AWS EBS are technically external to the broker machine, Apache Kafka considers them local storage within its tiered storage architecture.
+
+Tiered storage operates in a way that is seamless for both Apache Kafka producers and consumers. This means that producers and consumers interact with Apache Kafka in the same way, regardless of whether tiered storage is enabled or not. 
+
+Administrators can configure Tiered storage per topic by defining the retention period and retention bytes to specify how much data should be retained on the local disk as opposed to remote storage.
+
+
+Local vs. remote data retention
+---------------------------------
+
+When tiered storage is enabled, data is initially stored on the local disk of the Kafka broker. Data is then asynchronously transferred to remote storage based on the pre-defined local retention threshold. During periods of high data ingestion or transient errors, such as network connectivity issues, the local storage might temporarily hold more data than specified by the local retention threshold.
+
+Segment management
+-------------------
+Data is organized into segments, which are uploaded to remote storage individually. The active (newest) segment remains in local storage, which means that the segment size can also influence local data retention. For instance, if the local retention threshold is 1 GB, but the segment size is 2 GB, the local storage will exceed the 1 GB limit until the active segment is rolled over and uploaded to remote storage.
+
+
+Asynchronous uploads and replication
+--------------------------------------
+Data is transferred to remote storage asynchronously and does not interfere with the producer activity. While the broker aims to move data as swiftly as possible, certain conditions, such as high-throughput or connectivity issues, may cause more data to be stored in the local storage than the specified local retention policy.
+Any data exceeding the local retention threshold will not be purged by the log cleaner until it is successfully uploaded to remote storage.
+The replication factor is not considered during the upload process, and only one copy of each segment is uploaded to the remote storage. Most remote storage options have their own measures, including data replication, to ensure data durability.
+
+Data retrieval
+-----------------
+When consumers fetch records stored in remote storage, the broker downloads and caches these records locally. This allows for quicker access in subsequent retrieval operations.
+The retention time and the maximum size of the cache can be configured.
+
+
+
diff --git a/docs/products/kafka/concepts/tiered-storage-limitations.rst b/docs/products/kafka/concepts/tiered-storage-limitations.rst
@@ -0,0 +1,13 @@
+Trade-offs and limitations
+============================
+
+The main trade-off of tiered storage in Aivne for Apache Kafka® is the higher latency while accessing and reading data from remote storage compared to local disk storage. While adding local caching can partially solve this problem, it cannot eliminate the latency.
+
+Limitations
+-------------
+
+* Tiered storage currently does not support compacted topics.
+* If you enable tiered storage for a topic, you cannot deactivate it without losing data in the remote storage. To deactivate tiered storage, contact `Aiven support <mailto:[email protected]>`_. 
+* Increasing the local retention threshold won't move segments already uploaded to remote storage back to local storage. This change only affects new data segments.
+* If you enable tiered storage on a service, you can't migrate the service to a different region or cloud, except for moving to a virtual cloud in the same region. For migration to a different region or cloud, contact `Aiven support <mailto:[email protected]>`_.
+