Skip to content

Commit

Permalink
reflow the spec document
Browse files Browse the repository at this point in the history
  • Loading branch information
wjones127 committed Oct 10, 2023
1 parent a35dfed commit 42d46f1
Show file tree
Hide file tree
Showing 2 changed files with 149 additions and 126 deletions.
267 changes: 143 additions & 124 deletions docs/source/format/CDataInterface/PyCapsuleInterface.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,43 +52,17 @@ Non-goals
* Standardize what public APIs should be used for import. This is left up to
individual libraries.


Comparison to DataFrame Interchange Protocol
--------------------------------------------

`The DataFrame Interchange Protocol <https://data-apis.org/dataframe-protocol/latest/>`_
is another protocol in Python that allows for the sharing of data between libraries.
This protocol is complementary to the DataFrame Interchange Protocol. Many of
the objects that implement this protocol will also implement the DataFrame
Interchange Protocol.

This protocol is specific to Arrow-based data structures, while the DataFrame
Interchange Protocol allows non-Arrow data frames and arrays to be shared as well.
Because of this, these PyCapsules can support Arrow-specific features such as
nested columns.

This protocol is also much more minimal than the DataFrame Interchange Protocol.
It just handles data export, rather than defining accessors for details like
number of rows or columns.

In summary, if you are implementing this protocol, you should also consider
implementing the DataFrame Interchange Protocol.


Comparison to ``__arrow_array__`` protocol
------------------------------------------

The ``__arrow_array__`` protocol is a dunder method that defines how PyArrow
should import an object as an Arrow array. Unlike this protocol, it is
specific to PyArrow and isn't used by other libraries. It is also limited to
arrays and does not support schemas, tabular structures, or streams.

PyCapsule Standard
==================

When exporting Arrow data through Python, the C Data Interface / C Stream Interface
structures should be wrapped in capsules. Capsules avoid invalid access by
attaching a name to the pointer and avoid memory leaks by attaching a destructor.
Thus, they are much safer than passing pointers as integers.

`PyCapsule`_ allows for a ``name`` to be associated with the capsule, allowing
consumers to verify that the capsule contains the expected kind of data. To make sure
Arrow structs are recognized, the following names must be used:
Arrow structures are recognized, the following names must be used:

.. list-table::
:widths: 25 25
Expand Down Expand Up @@ -120,98 +94,67 @@ releasing data the consumer is using.
Just like in the C Data Interface, the PyCapsule objects defined here can only
be consumed once.

For an example of a PyCapsule with a destructor, see `Create a PyCapsule`_.


Export Protocol
===============

The interface is three separate protocols:
The interface consists of three separate protocols:

* ``ArrowSchemaExportable``, which defines the ``__arrow_c_schema__`` method.
* ``ArrowArrayExportable``, which defines the ``__arrow_c_array__`` method.
* ``ArrowStreamExportable``, which defines the ``__arrow_c_stream__`` method.

The protocols are defined below in terms of ``typing.Protocol``. These may be
copied into a library for the purposes of static type checking, but this is not
required to implement the protocol.
ArrowSchema Export
------------------

Schemas, fields, and data types can implement the method ``__arrow_c_schema__``.

.. code-block:: python
.. py:method:: __arrow_c_schema__(self) -> object
from typing import Tuple, Protocol
from typing_extensions import Self
Export the object as an ArrowSchema.

class ArrowSchemaExportable(Protocol):
def __arrow_c_schema__(self) -> object:
"""
Get a PyCapsule containing a C ArrowSchema representation of the object.
:return: A PyCapsule containing a C ArrowSchema representation of the
object. The capsule must have a name of ``"arrow_schema"``.

The capsule will have a name of "arrow_schema".
"""
...

class ArrowArrayExportable(Protocol):
def __arrow_c_array__(
self,
requested_schema: object | None = None
) -> Tuple[object, object]:
"""
Export array as a pair of PyCapsules for the ArrowSchema and ArrowArray.
The ArrowArray capsule will have a name of "arrow_array".
If requested_schema is passed, the callee should attempt to provide the
data in the requested schema. However, this is best-effort, and the
callee may return a PyCapsule containing an ArrowArray with a different
schema. This parameter is useful for cases where the underlying data
could be represented in multiple ways, and the caller has a preference
for how it is represented. For example, some systems have a single
integer type, but Arrow has multiple integer types with different
sizes and sign.
Parameters
----------
requested_schema : PyCapsule | None
A PyCapsule containing a C ArrowSchema representation of a requested
schema. Conversion to this schema is best-effort.
Returns
-------
Tuple[PyCapsule, PyCapsule]
A pair of PyCapsules containing a C ArrowSchema and ArrowArray,
respectively.
"""
...
ArrowArray Export
-----------------

Arrays and record batches (contiguous tables) can implement the method
``__arrow_c_array__``.

class ArrowStreamExportable(Protocol):
def __arrow_c_stream__(self, requested_schema: object | None = None) -> object:
"""
Get a PyCapsule containing a C ArrowArrayStream representation of the object.
The capsule will have a name of "arrow_array_stream".
If requested_schema is passed, the callee should attempt to provide the
data in the requested schema. However, this is best-effort, and the
callee may return a PyCapsule containing an ArrowArray with a different
schema. This parameter is useful for cases where the underlying data
could be represented in multiple ways, and the caller has a preference
for how it is represented. For example, some systems have a single
integer type, but Arrow has multiple integer types with different
sizes and sign.
Parameters
----------
requested_schema : PyCapsule | None
A PyCapsule containing a C ArrowSchema representation of a requested
schema. Conversion to this schema is best-effort.
Returns
-------
PyCapsule
A PyCapsule containing a C ArrowArrayStream representation of the
object.
"""
...
.. py:method:: __arrow_c_array__(self, requested_schema: object | None = None) -> Tuple[object, object]
Export the object as a pair of ArrowSchema and ArrowArray structures.

:param requested_schema: A PyCapsule containing a C ArrowSchema representation
of a requested schema. Conversion to this schema is best-effort. See
`Schema Requests`_.
:type requested_schema: PyCapsule or None

:return: A pair of PyCapsules containing a C ArrowSchema and ArrowArray,
respectively. The schema capsule should have the name ``"arrow_schema"``
and the array capsule should have the name ``"arrow_array"``.


ArrowStream Export
------------------

Tables / DataFrames and streams can implement the method ``__arrow_c_stream__``.

.. py:method:: __arrow_c_stream__(self, requested_schema: object | None = None) -> object
Export the object as an ArrowArrayStream.

:param requested_schema: A PyCapsule containing a C ArrowSchema representation
of a requested schema. Conversion to this schema is best-effort. See
`Schema Requests`_.
:type requested_schema: PyCapsule or None

:return: A PyCapsule containing a C ArrowArrayStream representation of the
object. The capsule must have a name of ``"arrow_array_stream"``.

Schema Requests
---------------
Expand All @@ -224,7 +167,7 @@ Arrow has several possible encodings for an array of strings: 32-bit offsets,
export to any one of these Arrow representations.

In order to allow the caller to request a specific representation, the
``__arrow_c_array__`` and ``__arrow_c_stream__`` methods take an optional
:meth:`__arrow_c_array__` and :meth:`__arrow_c_stream__` methods take an optional
``requested_schema`` parameter. This parameter is a PyCapsule containing an
``ArrowSchema``.

Expand All @@ -242,12 +185,41 @@ schema transformations.
.. _PyCapsule: https://docs.python.org/3/c-api/capsule.html


Protocol Typehints
------------------

The following typehints can be copied into your library to annotate that a
function accepts an object implementing one of these protocols.

.. code-block:: python
from typing import Tuple, Protocol
from typing_extensions import Self
class ArrowSchemaExportable(Protocol):
def __arrow_c_schema__(self) -> object: ...
class ArrowArrayExportable(Protocol):
def __arrow_c_array__(
self,
requested_schema: object | None = None
) -> Tuple[object, object]:
...
class ArrowStreamExportable(Protocol):
def __arrow_c_stream__(
self,
requested_schema: object | None = None
) -> object:
...
Examples
========

Create a PyCapsule
------------------


To create a PyCapsule, use the `PyCapsule_New <https://docs.python.org/3/c-api/capsule.html#c.PyCapsule_New>`_
function. The function must be passed a destructor function that will be called
to release the data the capsule points to. It must first call the release
Expand All @@ -265,23 +237,28 @@ Below is the code to create a PyCapsule for an ``ArrowSchema``. The code for
#include <Python.h>
void ReleaseArrowSchemaPyCapsule(PyObject* capsule) {
ArrowSchema* schema =
(ArrowSchema*)PyCapsule_GetPointer(capsule, "arrow_schema");
struct ArrowSchema* schema =
(struct ArrowSchema*)PyCapsule_GetPointer(capsule, "arrow_schema");
if (schema->release != NULL) {
schema->release(schema);
}
free(schema);
}
PyObject* MakeArrowSchemaPyCapsule(ArrowSchema* schema) {
return PyCapsule_New(schema, "arrow_schema", ReleaseArrowSchemaPyCapsule);
PyObject* ExportArrowSchemaPyCapsule() {
struct ArrowSchema* schema =
(struct ArrowSchema*)malloc(sizeof(struct ArrowSchema));
// Fill in ArrowSchema fields
// ...
return PyCapsule_New(schema, "arrow_schema", ReleaseArrowSchemaPyCapsule);
}
.. tab-item:: Cython

.. code-block:: cython
import cpython
cimport cpython
from libc.stdlib cimport malloc, free
cdef void release_arrow_schema_py_capsule(object schema_capsule):
cdef ArrowSchema* schema = <ArrowSchema*>cpython.PyCapsule_GetPointer(
Expand All @@ -292,7 +269,10 @@ Below is the code to create a PyCapsule for an ``ArrowSchema``. The code for
free(schema)
cdef object make_arrow_schema_py_capsule(ArrowSchema* schema):
cdef object export_arrow_schema_py_capsule():
cdef ArrowSchema* schema = <ArrowSchema*>malloc(sizeof(ArrowSchema))
# Fill in ArrowSchema fields
# ...
return cpython.PyCapsule_New(
<void*>schema, 'arrow_schema', release_arrow_schema_py_capsule
)
Expand All @@ -318,18 +298,18 @@ code for ``ArrowArray`` and ``ArrowArrayStream`` is similar.
#include <Python.h>
// If the capsule is not an ArrowSchema, will return NULL.
ArrowSchema* GetArrowSchemaPyCapsule(PyObject* capsule) {
return PyCapsule_GetPointer(schema, "arrow_schema");
struct ArrowSchema* GetArrowSchemaPyCapsule(PyObject* capsule) {
return PyCapsule_GetPointer(capsule, "arrow_schema");
}
.. tab-item:: Cython

.. code-block:: cython
import cpython
cimport cpython
cdef object get_arrow_schema_py_capsule(PyObject* capsule):
return cpython.PyCapsule_GetPointer(capsule, 'arrow_schema')
cdef ArrowSchema* get_arrow_schema_py_capsule(object capsule):
return <ArrowSchema*>cpython.PyCapsule_GetPointer(capsule, 'arrow_schema')
Backwards Compatibility with PyArrow
------------------------------------
Expand All @@ -356,6 +336,8 @@ implement the PyCapsule interface:
# NEW METHOD
def from_arrow(arr)
# Newer versions of PyArrow as well as other libraries with Arrow data
# implement this method, so prefer it over _export_to_c.
if hasattr(arr, "__arrow_c_array__"):
schema_ptr, array_ptr = arr.__arrow_c_array__()
return import_c_capsule_data(schema_ptr, array_ptr)
Expand All @@ -369,10 +351,10 @@ implement the PyCapsule interface:
raise TypeError(f"Cannot import {type(arr)} as Arrow array data.")
You may also wish to accept objects implementing the protocol in your
constructors. For example, in PyArrow, the :py:func:`array` and :py:func:`record_batch`
constructors accept any object that implements the ``__arrow_c_array__`` method
protocol. Similarly, the PyArrow's :py:func:`schema` constructor accepts any object
that implements the ``__arrow_c_schema__`` method.
constructors. For example, in PyArrow, the :func:`array` and :func:`record_batch`
constructors accept any object that implements the :meth:`__arrow_c_array__` method
protocol. Similarly, the PyArrow's :func:`schema` constructor accepts any object
that implements the :meth:`__arrow_c_schema__` method.

Now if your library has an export to PyArrow function, such as:

Expand All @@ -395,11 +377,48 @@ that implements the protocol:
# NEW METHOD
def to_arrow(self) -> pa.Array:
# PyArrow added support for constructing arrays from objects implementing
# __arrow_c_array__ in the same version it added the method for it's own
# arrays. So we can use hasattr to check if the method is available as
# a proxy for checking the PyArrow version.
if hasattr(pa.Array, "__arrow_c_array__"):
warnings.warn("to_arrow() is deprecated. Instead, simply pass the array to pyarrow.array().")
return pa.array(self)
else:
array_export_ptr = make_array_export_ptr()
schema_export_ptr = make_schema_export_ptr()
self.export_c_data(array_export_ptr, schema_export_ptr)
return pa.Array._import_from_c(array_export_ptr, schema_export_ptr)
Comparison with Other Protocols
===============================

Comparison to DataFrame Interchange Protocol
--------------------------------------------

`The DataFrame Interchange Protocol <https://data-apis.org/dataframe-protocol/latest/>`_
is another protocol in Python that allows for the sharing of data between libraries.
This protocol is complementary to the DataFrame Interchange Protocol. Many of
the objects that implement this protocol will also implement the DataFrame
Interchange Protocol.

This protocol is specific to Arrow-based data structures, while the DataFrame
Interchange Protocol allows non-Arrow data frames and arrays to be shared as well.
Because of this, these PyCapsules can support Arrow-specific features such as
nested columns.

This protocol is also much more minimal than the DataFrame Interchange Protocol.
It just handles data export, rather than defining accessors for details like
number of rows or columns.

In summary, if you are implementing this protocol, you should also consider
implementing the DataFrame Interchange Protocol.


Comparison to ``__arrow_array__`` protocol
------------------------------------------

The :ref:`arrow_array_protocol` protocol is a dunder method that
defines how PyArrow should import an object as an Arrow array. Unlike this
protocol, it is specific to PyArrow and isn't used by other libraries. It is
also limited to arrays and does not support schemas, tabular structures, or streams.
Loading

0 comments on commit 42d46f1

Please sign in to comment.