Support S3 sink via FileSink interface

quixio · Nov 27, 2024 · 1882078 · 1882078
1 parent 0c9b399
commit 1882078
Show file tree

Hide file tree

Showing 10 changed files with 454 additions and 138 deletions.
diff --git a/docs/build/build.py b/docs/build/build.py
@@ -123,6 +123,9 @@
             "quixstreams.sinks.community.iceberg",
             "quixstreams.sinks.community.bigquery",
             "quixstreams.sinks.community.file.sink",
+            "quixstreams.sinks.community.file.destinations.base",
+            "quixstreams.sinks.community.file.destinations.local",
+            "quixstreams.sinks.community.file.destinations.s3",
             "quixstreams.sinks.community.file.formats.base",
             "quixstreams.sinks.community.file.formats.json",
             "quixstreams.sinks.community.file.formats.parquet",

diff --git a/docs/connectors/sinks/amazon-s3-sink.md b/docs/connectors/sinks/amazon-s3-sink.md
@@ -0,0 +1,106 @@
+# AmazonS3 Sink
+
+!!! info
+
+    This is a **Community** connector. Test it before using in production.
+
+    To learn more about differences between Core and Community connectors, see the [Community and Core Connectors](../community-and-core.md) page.
+
+This sink writes batches of data to Amazon S3 in various formats.  
+By default, the data will include the kafka message key, value, and timestamp.  
+
+## How To Install
+
+To use the S3 sink, you need to install the required dependencies:
+
+```bash
+pip install quixstreams[s3]
+```
+
+## How It Works
+
+`FileSink` with `S3Destination` is a batching sink that writes data directly to Amazon S3.  
+
+It batches processed records in memory per topic partition and writes them to S3 objects in a specified bucket and prefix structure. Objects are organized by topic and partition, with each batch being written to a separate object named by its starting offset.
+
+!!! note
+
+    The S3 bucket must already exist and be accessible. The sink does not create the bucket automatically. If the bucket does not exist or access is denied, an error will be raised when initializing the sink.
+
+## How To Use
+
+Create an instance of `FileSink` with `S3Destination` and pass it to the `StreamingDataFrame.sink()` method.
+
+```python
+from quixstreams import Application
+from quixstreams.sinks.community.file import FileSink
+from quixstreams.sinks.community.file.destinations import S3Destination
+
+
+# Configure the sink to write JSON files to S3
+file_sink = FileSink(
+    # Optional: defaults to current working directory
+    directory="data",
+    # Optional: defaults to "json"
+    # Available formats: "json", "parquet" or an instance of Format
+    format=JSONFormat(compress=True),
+    destination=S3Destination(
+        bucket="my-bucket",
+        # Optional: AWS credentials
+        aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
+        aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
+        region_name="eu-west-2",
+        # Optional: Additional keyword arguments are passed to the boto3 client
+        endpoint_url="http://localhost:4566",  # for LocalStack testing
+    )
+)
+
+app = Application(broker_address='localhost:9092', auto_offset_reset="earliest")
+topic = app.topic('sink-topic')
+
+sdf = app.dataframe(topic=topic)
+sdf.sink(file_sink)
+
+if __name__ == "__main__":
+    app.run()
+```
+
+!!! note
+    Instead of passing AWS credentials explicitly, you can set them using environment variables:
+    ```bash
+    export AWS_ACCESS_KEY_ID="your_access_key"
+    export AWS_SECRET_ACCESS_KEY="your_secret_key"
+    export AWS_DEFAULT_REGION="eu-west-2"
+    ```
+    Then you can create the destination with just the bucket name:
+    ```python
+    s3_sink = S3Destination(bucket="my-bucket")
+    ```
+
+## S3 Object Organization
+
+Objects in S3 follow this structure:
+```
+my-bucket/
+└── data/
+    └── sink_topic/
+        ├── 0/
+        │   ├── 0000000000000000000.jsonl
+        │   ├── 0000000000000000123.jsonl
+        │   └── 0000000000000001456.jsonl
+        └── 1/
+            ├── 0000000000000000000.jsonl
+            ├── 0000000000000000789.jsonl
+            └── 0000000000000001012.jsonl
+```
+
+Each object is named using the batch's starting offset (padded to 19 digits) and the appropriate file extension for the chosen format.
+
+## Supported Formats
+
+- **JSON**: Supports appending to existing files
+- **Parquet**: Does not support appending (new file created for each batch)
+
+## Delivery Guarantees
+
+`FileSink` provides at-least-once guarantees, and the results may contain duplicated data if there were errors during processing.
diff --git a/docs/connectors/sinks/file-sink.md → docs/connectors/sinks/local-file-sink.md b/docs/connectors/sinks/file-sink.md → docs/connectors/sinks/local-file-sink.md
@@ -1,78 +1,76 @@
-# File Sink
+# Local File Sink
 
 !!! info
 
     This is a **Community** connector. Test it before using in production.
 
     To learn more about differences between Core and Community connectors, see the [Community and Core Connectors](../community-and-core.md) page.
 
-This sink writes batches of data to files on disk in various formats.  
-By default, the data will include the kafka message key, value, and timestamp.  
-
-Currently, it supports the following formats:
-
-- JSON
-- Parquet
+This sink writes batches of data to local files in various formats.
+By default, the data will include the kafka message key, value, and timestamp.
 
 ## How It Works
 
-`FileSink` is a batching sink.  
+`FileSink` with `LocalDestination` is a batching sink that writes data to your local filesystem.
 
-It batches processed records in memory per topic partition and writes them to files in a specified directory structure. Files are organized by topic and partition, with each batch being written to a separate file named by its starting offset.
+It batches processed records in memory per topic partition and writes them to files in a specified directory structure. Files are organized by topic and partition. When append mode is disabled (default), each batch is written to a separate file named by its starting offset. When append mode is enabled, all records for a partition are written to a single file.
 
 The sink can either create new files for each batch or append to existing files (when using formats that support appending).
 
 ## How To Use
 
 Create an instance of `FileSink` and pass it to the `StreamingDataFrame.sink()` method.
 
-For the full description of expected parameters, see the [File Sink API](../../api-reference/sinks.md#filesink) page.  
-
 ```python
 from quixstreams import Application
 from quixstreams.sinks.community.file import FileSink
+from quixstreams.sinks.community.file.destinations import LocalDestination
+from quixstreams.sinks.community.file.formats import JSONFormat
 
 # Configure the sink to write JSON files
 file_sink = FileSink(
-    output_dir="./output",
-    format="json",
-    append=False  # Set to True to append to existing files when possible
+    # Optional: defaults to current working directory
+    directory="data",
+    # Optional: defaults to "json"
+    # Available formats: "json", "parquet" or an instance of Format
+    format=JSONFormat(compress=True),
+    # Optional: defaults to LocalDestination(append=False)
+    destination=LocalDestination(append=True),
 )
 
 app = Application(broker_address='localhost:9092', auto_offset_reset="earliest")
-topic = app.topic('sink_topic')
-
-# Do some processing here
-sdf = app.dataframe(topic=topic).print(metadata=True)
+topic = app.topic('sink-topic')
 
-# Sink results to the FileSink
+sdf = app.dataframe(topic=topic)
 sdf.sink(file_sink)
 
 if __name__ == "__main__":
-    # Start the application
     app.run()
 ```
 
 ## File Organization
+
 Files are organized in the following directory structure:
 ```
-output_dir/
-├── sink_topic/
-│   ├── 0/
-│   │   ├── 0000000000000000000.json
-│   │   ├── 0000000000000000123.json
-│   │   └── 0000000000000001456.json
-│   └── 1/
-│       ├── 0000000000000000000.json
-│       ├── 0000000000000000789.json
-│       └── 0000000000000001012.json
+data/
+└── sink_topic/
+    ├── 0/
+    │   ├── 0000000000000000000.jsonl
+    │   ├── 0000000000000000123.jsonl
+    │   └── 0000000000000001456.jsonl
+    └── 1/
+        ├── 0000000000000000000.jsonl
+        ├── 0000000000000000789.jsonl
+        └── 0000000000000001012.jsonl
 ```
 
 Each file is named using the batch's starting offset (padded to 19 digits) and the appropriate file extension for the chosen format.
 
 ## Supported Formats
+
 - **JSON**: Supports appending to existing files
 - **Parquet**: Does not support appending (new file created for each batch)
 
 ## Delivery Guarantees
+
 `FileSink` provides at-least-once guarantees, and the results may contain duplicated data if there were errors during processing.
diff --git a/pyproject.toml b/pyproject.toml
@@ -47,7 +47,9 @@ iceberg_aws = ["pyiceberg[pyarrow,glue]>=0.7,<0.8"]
 bigquery = ["google-cloud-bigquery>=3.26.0,<3.27"]
 pubsub = ["google-cloud-pubsub>=2.23.1,<3"]
 postgresql = ["psycopg2-binary>=2.9.9,<3"]
-kinesis = ["boto3>=1.35.65,<2.0", "boto3-stubs[kinesis]>=1.35.65,<2.0"]
+aws = ["boto3>=1.35.65,<2.0", "boto3-stubs>=1.35.65,<2.0"]
+kinesis = ["quixstreams[aws]"]
+s3 = ["quixstreams[aws]"]
 
 [tool.setuptools.packages.find]
 include = ["quixstreams*"]

diff --git a/quixstreams/sinks/community/file/__init__.py b/quixstreams/sinks/community/file/__init__.py
@@ -1,9 +1,14 @@
-from .formats import InvalidFormatError, JSONFormat, ParquetFormat
+from .destinations import Destination, LocalDestination, S3Destination
+from .formats import Format, InvalidFormatError, JSONFormat, ParquetFormat
 from .sink import FileSink
 
 __all__ = [
-    "FileSink",
+    "Destination",
+    "LocalDestination",
+    "S3Destination",
+    "Format",
     "InvalidFormatError",
     "JSONFormat",
     "ParquetFormat",
+    "FileSink",
 ]
diff --git a/quixstreams/sinks/community/file/destinations/__init__.py b/quixstreams/sinks/community/file/destinations/__init__.py
@@ -0,0 +1,5 @@
+from .base import Destination
+from .local import LocalDestination
+from .s3 import S3Destination
+
+__all__ = ("Destination", "LocalDestination", "S3Destination")
diff --git a/quixstreams/sinks/community/file/destinations/base.py b/quixstreams/sinks/community/file/destinations/base.py
@@ -0,0 +1,96 @@
+import logging
+import re
+from abc import ABC, abstractmethod
+from pathlib import Path
+
+from quixstreams.sinks.base import SinkBatch
+from quixstreams.sinks.community.file.formats import Format
+
+__all__ = ("Destination",)
+
+logger = logging.getLogger(__name__)
+
+_UNSAFE_CHARACTERS_REGEX = re.compile(r"[^a-zA-Z0-9 ._]")
+
+
+class Destination(ABC):
+    """Abstract base class for defining where and how data should be stored.
+
+    Destinations handle the storage of serialized data, whether that's to local
+    disk, cloud storage, or other locations. They manage the physical writing of
+    data while maintaining a consistent directory/path structure based on topics
+    and partitions.
+    """
+
+    _base_directory: str = ""
+    _extension: str = ""
+
+    def set_directory(self, directory: str) -> None:
+        """Configure the base directory for storing files.
+
+        :param directory: The base directory path where files will be stored.
+        :raises ValueError: If the directory path contains invalid characters.
+            Only alphanumeric characters (a-zA-Z0-9), spaces, dots, and
+            underscores are allowed.
+        """
+        if _UNSAFE_CHARACTERS_REGEX.search(directory):
+            raise ValueError(
+                f"Invalid characters in directory path: {directory}. "
+                f"Only alphanumeric characters (a-zA-Z0-9), spaces ( ), "
+                "dots (.), and underscores (_) are allowed."
+            )
+        self._base_directory = directory
+        logger.info("Directory set to '%s'", directory)
+
+    def set_extension(self, format: Format) -> None:
+        """Set the file extension based on the format.
+
+        :param format: The Format instance that defines the file extension.
+        """
+        self._extension = format.file_extension
+        logger.info("File extension set to '%s'", self._extension)
+
+    @abstractmethod
+    def write(self, data: bytes, batch: SinkBatch) -> None:
+        """Write the serialized data to storage.
+
+        :param data: The serialized data to write.
+        :param batch: The batch information containing topic, partition and offset
+            details.
+        """
+        ...
+
+    def _path(self, batch: SinkBatch) -> Path:
+        """Generate the full path where the batch data should be stored.
+
+        :param batch: The batch information containing topic, partition and offset
+            details.
+        :return: A Path object representing the full file path.
+        """
+        return self._directory(batch) / self._filename(batch)
+
+    def _directory(self, batch: SinkBatch) -> Path:
+        """Generate the full directory path for a batch.
+
+        Creates a directory structure using the base directory, sanitized topic
+        name, and partition number.
+
+        :param batch: The batch information containing topic and partition details.
+        :return: A Path object representing the directory where files should be
+            stored.
+        """
+        topic = _UNSAFE_CHARACTERS_REGEX.sub("_", batch.topic)
+        return Path(self._base_directory) / topic / str(batch.partition)
+
+    def _filename(self, batch: SinkBatch) -> str:
+        """Generate the filename for a batch.
+
+        Creates a filename using the batch's starting offset as a zero-padded
+        number to ensure correct ordering. The offset is padded to cover the max
+        length of a signed 64-bit integer (19 digits), e.g., '0000000000000123456'.
+
+        :param batch: The batch information containing start_offset details.
+        :return: A string representing the filename with zero-padded offset and
+            extension.
+        """
+        return f"{batch.start_offset:019d}{self._extension}"