Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sink: Amazon S3 #654

Merged
merged 6 commits into from
Nov 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/build/build.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,9 @@
"quixstreams.sinks.community.iceberg",
"quixstreams.sinks.community.bigquery",
"quixstreams.sinks.community.file.sink",
"quixstreams.sinks.community.file.destinations.base",
"quixstreams.sinks.community.file.destinations.local",
"quixstreams.sinks.community.file.destinations.s3",
"quixstreams.sinks.community.file.formats.base",
"quixstreams.sinks.community.file.formats.json",
"quixstreams.sinks.community.file.formats.parquet",
Expand Down
108 changes: 108 additions & 0 deletions docs/connectors/sinks/amazon-s3-sink.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# AmazonS3 Sink

!!! info

This is a **Community** connector. Test it before using in production.

To learn more about differences between Core and Community connectors, see the [Community and Core Connectors](../community-and-core.md) page.

This sink writes batches of data to Amazon S3 in various formats.
By default, the data will include the kafka message key, value, and timestamp.

## How To Install

To use the S3 sink, you need to install the required dependencies:

```bash
pip install quixstreams[s3]
```

## How It Works

`FileSink` with `S3Destination` is a batching sink that writes data directly to Amazon S3.

It batches processed records in memory per topic partition and writes them to S3 objects in a specified bucket and prefix structure. Objects are organized by topic and partition, with each batch being written to a separate object named by its starting offset.

Batches are written to S3 during the commit phase of processing. This means the size of each batch (and therefore each S3 object) is influenced by your application's commit settings - either through `commit_interval` or the `commit_every` parameters.

!!! note

The S3 bucket must already exist and be accessible. The sink does not create the bucket automatically. If the bucket does not exist or access is denied, an error will be raised when initializing the sink.

## How To Use

Create an instance of `FileSink` with `S3Destination` and pass it to the `StreamingDataFrame.sink()` method.

```python
from quixstreams import Application
from quixstreams.sinks.community.file import FileSink
from quixstreams.sinks.community.file.destinations import S3Destination


# Configure the sink to write JSON files to S3
file_sink = FileSink(
# Optional: defaults to current working directory
directory="data",
# Optional: defaults to "json"
# Available formats: "json", "parquet" or an instance of Format
format=JSONFormat(compress=True),
destination=S3Destination(
bucket="my-bucket",
# Optional: AWS credentials
aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
region_name="eu-west-2",
# Optional: Additional keyword arguments are passed to the boto3 client
endpoint_url="http://localhost:4566", # for LocalStack testing
)
)

app = Application(broker_address='localhost:9092', auto_offset_reset="earliest")
topic = app.topic('sink-topic')

sdf = app.dataframe(topic=topic)
sdf.sink(file_sink)

if __name__ == "__main__":
app.run()
```

!!! note
Instead of passing AWS credentials explicitly, you can set them using environment variables:
```bash
export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_DEFAULT_REGION="eu-west-2"
```
Then you can create the destination with just the bucket name:
```python
s3_sink = S3Destination(bucket="my-bucket")
```

## S3 Object Organization

Objects in S3 follow this structure:
```
my-bucket/
└── data/
└── sink_topic/
├── 0/
│ ├── 0000000000000000000.jsonl
│ ├── 0000000000000000123.jsonl
│ └── 0000000000000001456.jsonl
└── 1/
├── 0000000000000000000.jsonl
├── 0000000000000000789.jsonl
└── 0000000000000001012.jsonl
```

Each object is named using the batch's starting offset (padded to 19 digits) and the appropriate file extension for the chosen format.

## Supported Formats

- **JSON**: Supports appending to existing files
- **Parquet**: Does not support appending (new file created for each batch)

## Delivery Guarantees

`FileSink` provides at-least-once guarantees, and the results may contain duplicated data if there were errors during processing.
Original file line number Diff line number Diff line change
@@ -1,78 +1,76 @@
# File Sink
# Local File Sink

!!! info

This is a **Community** connector. Test it before using in production.

To learn more about differences between Core and Community connectors, see the [Community and Core Connectors](../community-and-core.md) page.

This sink writes batches of data to files on disk in various formats.
By default, the data will include the kafka message key, value, and timestamp.

Currently, it supports the following formats:

- JSON
- Parquet
This sink writes batches of data to local files in various formats.
By default, the data will include the kafka message key, value, and timestamp.

## How It Works

`FileSink` is a batching sink.
`FileSink` with `LocalDestination` is a batching sink that writes data to your local filesystem.

It batches processed records in memory per topic partition and writes them to files in a specified directory structure. Files are organized by topic and partition, with each batch being written to a separate file named by its starting offset.
It batches processed records in memory per topic partition and writes them to files in a specified directory structure. Files are organized by topic and partition. When append mode is disabled (default), each batch is written to a separate file named by its starting offset. When append mode is enabled, all records for a partition are written to a single file.

The sink can either create new files for each batch or append to existing files (when using formats that support appending).

## How To Use

Create an instance of `FileSink` and pass it to the `StreamingDataFrame.sink()` method.

For the full description of expected parameters, see the [File Sink API](../../api-reference/sinks.md#filesink) page.

```python
from quixstreams import Application
from quixstreams.sinks.community.file import FileSink
from quixstreams.sinks.community.file.destinations import LocalDestination
from quixstreams.sinks.community.file.formats import JSONFormat

# Configure the sink to write JSON files
file_sink = FileSink(
output_dir="./output",
format="json",
append=False # Set to True to append to existing files when possible
# Optional: defaults to current working directory
directory="data",
# Optional: defaults to "json"
# Available formats: "json", "parquet" or an instance of Format
format=JSONFormat(compress=True),
# Optional: defaults to LocalDestination(append=False)
destination=LocalDestination(append=True),
)

app = Application(broker_address='localhost:9092', auto_offset_reset="earliest")
topic = app.topic('sink_topic')

# Do some processing here
sdf = app.dataframe(topic=topic).print(metadata=True)
topic = app.topic('sink-topic')

# Sink results to the FileSink
sdf = app.dataframe(topic=topic)
sdf.sink(file_sink)

if __name__ == "__main__":
# Start the application
app.run()
```

## File Organization

Files are organized in the following directory structure:
```
output_dir/
── sink_topic/
├── 0/
│ ├── 0000000000000000000.json
│ ├── 0000000000000000123.json
│ └── 0000000000000001456.json
└── 1/
├── 0000000000000000000.json
├── 0000000000000000789.json
└── 0000000000000001012.json
data/
── sink_topic/
├── 0/
│ ├── 0000000000000000000.jsonl
│ ├── 0000000000000000123.jsonl
│ └── 0000000000000001456.jsonl
└── 1/
├── 0000000000000000000.jsonl
├── 0000000000000000789.jsonl
└── 0000000000000001012.jsonl
```

Each file is named using the batch's starting offset (padded to 19 digits) and the appropriate file extension for the chosen format.

## Supported Formats

- **JSON**: Supports appending to existing files
- **Parquet**: Does not support appending (new file created for each batch)

## Delivery Guarantees

`FileSink` provides at-least-once guarantees, and the results may contain duplicated data if there were errors during processing.
11 changes: 9 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,11 @@ iceberg_aws = ["pyiceberg[pyarrow,glue]>=0.7,<0.8"]
bigquery = ["google-cloud-bigquery>=3.26.0,<3.27"]
pubsub = ["google-cloud-pubsub>=2.23.1,<3"]
postgresql = ["psycopg2-binary>=2.9.9,<3"]
kinesis = ["boto3>=1.35.65,<2.0", "boto3-stubs[kinesis]>=1.35.65,<2.0"]
aws = ["boto3>=1.35.65,<2.0", "boto3-stubs>=1.35.65,<2.0"]
# AWS dependencies are separated by service to support
# different requirements in the future.
kinesis = ["quixstreams[aws]"]
s3 = ["quixstreams[aws]"]
redis=["redis[hiredis]>=5.2.0,<6"]

[tool.setuptools.packages.find]
Expand Down Expand Up @@ -100,7 +104,10 @@ log_cli_format = "[%(levelname)s] %(name)s: %(message)s"
markers = ["timeit"]

[[tool.mypy.overrides]]
module = "confluent_kafka.*"
module = [
"confluent_kafka.*",
"pyarrow.*",
]
ignore_missing_imports = true

[[tool.mypy.overrides]]
Expand Down
11 changes: 8 additions & 3 deletions quixstreams/sinks/community/file/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
from .formats import JSONFormat, ParquetFormat
from .sink import FileSink, InvalidFormatError
from .destinations import Destination, LocalDestination, S3Destination
from .formats import Format, InvalidFormatError, JSONFormat, ParquetFormat
from .sink import FileSink

__all__ = [
"FileSink",
"Destination",
"LocalDestination",
"S3Destination",
"Format",
"InvalidFormatError",
"JSONFormat",
"ParquetFormat",
"FileSink",
]
5 changes: 5 additions & 0 deletions quixstreams/sinks/community/file/destinations/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from .base import Destination
from .local import LocalDestination
from .s3 import S3Destination

__all__ = ("Destination", "LocalDestination", "S3Destination")
96 changes: 96 additions & 0 deletions quixstreams/sinks/community/file/destinations/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
import logging
import re
from abc import ABC, abstractmethod
from pathlib import Path

from quixstreams.sinks.base import SinkBatch
from quixstreams.sinks.community.file.formats import Format

__all__ = ("Destination",)

logger = logging.getLogger(__name__)

_UNSAFE_CHARACTERS_REGEX = re.compile(r"[^a-zA-Z0-9 ._]")


class Destination(ABC):
"""Abstract base class for defining where and how data should be stored.

Destinations handle the storage of serialized data, whether that's to local
disk, cloud storage, or other locations. They manage the physical writing of
data while maintaining a consistent directory/path structure based on topics
and partitions.
"""

_base_directory: str = ""
_extension: str = ""

def set_directory(self, directory: str) -> None:
"""Configure the base directory for storing files.

:param directory: The base directory path where files will be stored.
:raises ValueError: If the directory path contains invalid characters.
Only alphanumeric characters (a-zA-Z0-9), spaces, dots, and
underscores are allowed.
"""
if _UNSAFE_CHARACTERS_REGEX.search(directory):
raise ValueError(
f"Invalid characters in directory path: {directory}. "
f"Only alphanumeric characters (a-zA-Z0-9), spaces ( ), "
"dots (.), and underscores (_) are allowed."
)
self._base_directory = directory
logger.info("Directory set to '%s'", directory)

def set_extension(self, format: Format) -> None:
"""Set the file extension based on the format.

:param format: The Format instance that defines the file extension.
"""
self._extension = format.file_extension
logger.info("File extension set to '%s'", self._extension)

@abstractmethod
def write(self, data: bytes, batch: SinkBatch) -> None:
"""Write the serialized data to storage.

:param data: The serialized data to write.
:param batch: The batch information containing topic, partition and offset
details.
"""
...

def _path(self, batch: SinkBatch) -> Path:
"""Generate the full path where the batch data should be stored.

:param batch: The batch information containing topic, partition and offset
details.
:return: A Path object representing the full file path.
"""
return self._directory(batch) / self._filename(batch)

def _directory(self, batch: SinkBatch) -> Path:
"""Generate the full directory path for a batch.

Creates a directory structure using the base directory, sanitized topic
name, and partition number.

:param batch: The batch information containing topic and partition details.
:return: A Path object representing the directory where files should be
stored.
"""
topic = _UNSAFE_CHARACTERS_REGEX.sub("_", batch.topic)
return Path(self._base_directory) / topic / str(batch.partition)

def _filename(self, batch: SinkBatch) -> str:
"""Generate the filename for a batch.

Creates a filename using the batch's starting offset as a zero-padded
number to ensure correct ordering. The offset is padded to cover the max
length of a signed 64-bit integer (19 digits), e.g., '0000000000000123456'.

:param batch: The batch information containing start_offset details.
:return: A string representing the filename with zero-padded offset and
extension.
"""
return f"{batch.start_offset:019d}{self._extension}"
Loading