reflow the spec document

wjones127 · Oct 10, 2023 · 42d46f1 · 42d46f1
1 parent a35dfed
commit 42d46f1
Show file tree

Hide file tree

Showing 2 changed files with 149 additions and 126 deletions.
diff --git a/docs/source/format/CDataInterface/PyCapsuleInterface.rst b/docs/source/format/CDataInterface/PyCapsuleInterface.rst
@@ -52,43 +52,17 @@ Non-goals
 * Standardize what public APIs should be used for import. This is left up to
   individual libraries.
 
-
-Comparison to DataFrame Interchange Protocol
---------------------------------------------
-
-`The DataFrame Interchange Protocol <https://data-apis.org/dataframe-protocol/latest/>`_
-is another protocol in Python that allows for the sharing of data between libraries.
-This protocol is complementary to the DataFrame Interchange Protocol. Many of
-the objects that implement this protocol will also implement the DataFrame
-Interchange Protocol.
-
-This protocol is specific to Arrow-based data structures, while the DataFrame
-Interchange Protocol allows non-Arrow data frames and arrays to be shared as well.
-Because of this, these PyCapsules can support Arrow-specific features such as
-nested columns.
-
-This protocol is also much more minimal than the DataFrame Interchange Protocol.
-It just handles data export, rather than defining accessors for details like
-number of rows or columns.
-
-In summary, if you are implementing this protocol, you should also consider
-implementing the DataFrame Interchange Protocol.
-
-
-Comparison to ``__arrow_array__`` protocol
-------------------------------------------
-
-The ``__arrow_array__`` protocol is a dunder method that defines how PyArrow
-should import an object as an Arrow array. Unlike this protocol, it is
-specific to PyArrow and isn't used by other libraries. It is also limited to
-arrays and does not support schemas, tabular structures, or streams.
-
 PyCapsule Standard
 ==================
 
+When exporting Arrow data through Python, the C Data Interface / C Stream Interface
+structures should be wrapped in capsules. Capsules avoid invalid access by
+attaching a name to the pointer and avoid memory leaks by attaching a destructor.
+Thus, they are much safer than passing pointers as integers.
+
 `PyCapsule`_ allows for a ``name`` to be associated with the capsule, allowing 
 consumers to verify that the capsule contains the expected kind of data. To make sure
-Arrow structs are recognized, the following names must be used:
+Arrow structures are recognized, the following names must be used:
 
 .. list-table::
    :widths: 25 25
@@ -120,98 +94,67 @@ releasing data the consumer is using.
 Just like in the C Data Interface, the PyCapsule objects defined here can only
 be consumed once.
 
+For an example of a PyCapsule with a destructor, see `Create a PyCapsule`_.
+
 
 Export Protocol
 ===============
 
-The interface is three separate protocols:
+The interface consists of three separate protocols:
 
 * ``ArrowSchemaExportable``, which defines the ``__arrow_c_schema__`` method.
 * ``ArrowArrayExportable``, which defines the ``__arrow_c_array__`` method.
 * ``ArrowStreamExportable``, which defines the ``__arrow_c_stream__`` method.
 
-The protocols are defined below in terms of ``typing.Protocol``. These may be
-copied into a library for the purposes of static type checking, but this is not
-required to implement the protocol.
+ArrowSchema Export
+------------------
 
+Schemas, fields, and data types can implement the method ``__arrow_c_schema__``.
 
-.. code-block:: python
+.. py:method:: __arrow_c_schema__(self) -> object
 
-    from typing import Tuple, Protocol
-    from typing_extensions import Self
+    Export the object as an ArrowSchema.
 
-    class ArrowSchemaExportable(Protocol):
-        def __arrow_c_schema__(self) -> object:
-            """
-            Get a PyCapsule containing a C ArrowSchema representation of the object.
+    :return: A PyCapsule containing a C ArrowSchema representation of the
+        object. The capsule must have a name of ``"arrow_schema"``.
 
-            The capsule will have a name of "arrow_schema".
-            """
-            ...
 
-    class ArrowArrayExportable(Protocol):
-        def __arrow_c_array__(
-            self,
-            requested_schema: object | None = None
-        ) -> Tuple[object, object]:
-            """
-            Export array as a pair of PyCapsules for the ArrowSchema and ArrowArray.
-
-            The ArrowArray capsule will have a name of "arrow_array".
-
-            If requested_schema is passed, the callee should attempt to provide the
-            data in the requested schema. However, this is best-effort, and the
-            callee may return a PyCapsule containing an ArrowArray with a different
-            schema. This parameter is useful for cases where the underlying data
-            could be represented in multiple ways, and the caller has a preference
-            for how it is represented. For example, some systems have a single
-            integer type, but Arrow has multiple integer types with different 
-            sizes and sign.
-
-            Parameters
-            ----------
-            requested_schema : PyCapsule | None
-                A PyCapsule containing a C ArrowSchema representation of a requested
-                schema. Conversion to this schema is best-effort.
-
-            Returns
-            -------
-            Tuple[PyCapsule, PyCapsule]
-                A pair of PyCapsules containing a C ArrowSchema and ArrowArray,
-                respectively.
-            """
-            ...
+ArrowArray Export
+-----------------
 
+Arrays and record batches (contiguous tables) can implement the method
+``__arrow_c_array__``.
 
-    class ArrowStreamExportable(Protocol):
-        def __arrow_c_stream__(self, requested_schema: object | None = None) -> object:
-            """
-            Get a PyCapsule containing a C ArrowArrayStream representation of the object.
-
-            The capsule will have a name of "arrow_array_stream".
-
-            If requested_schema is passed, the callee should attempt to provide the
-            data in the requested schema. However, this is best-effort, and the
-            callee may return a PyCapsule containing an ArrowArray with a different
-            schema. This parameter is useful for cases where the underlying data
-            could be represented in multiple ways, and the caller has a preference
-            for how it is represented. For example, some systems have a single
-            integer type, but Arrow has multiple integer types with different 
-            sizes and sign.
-
-            Parameters
-            ----------
-            requested_schema : PyCapsule | None
-                A PyCapsule containing a C ArrowSchema representation of a requested
-                schema. Conversion to this schema is best-effort.
-
-            Returns
-            -------
-            PyCapsule
-                A PyCapsule containing a C ArrowArrayStream representation of the
-                object.
-            """
-            ...
+.. py:method:: __arrow_c_array__(self, requested_schema: object | None = None) -> Tuple[object, object]
+
+    Export the object as a pair of ArrowSchema and ArrowArray structures.
+
+    :param requested_schema: A PyCapsule containing a C ArrowSchema representation 
+        of a requested schema. Conversion to this schema is best-effort. See 
+        `Schema Requests`_.
+    :type requested_schema: PyCapsule or None
+
+    :return: A pair of PyCapsules containing a C ArrowSchema and ArrowArray,
+        respectively. The schema capsule should have the name ``"arrow_schema"``
+        and the array capsule should have the name ``"arrow_array"``.
+
+
+ArrowStream Export
+------------------
+
+Tables / DataFrames and streams can implement the method ``__arrow_c_stream__``.
+
+.. py:method:: __arrow_c_stream__(self, requested_schema: object | None = None) -> object
+
+    Export the object as an ArrowArrayStream.
+
+    :param requested_schema: A PyCapsule containing a C ArrowSchema representation 
+        of a requested schema. Conversion to this schema is best-effort. See 
+        `Schema Requests`_.
+    :type requested_schema: PyCapsule or None
+
+    :return: A PyCapsule containing a C ArrowArrayStream representation of the
+        object. The capsule must have a name of ``"arrow_array_stream"``.
 
 Schema Requests
 ---------------
@@ -224,7 +167,7 @@ Arrow has several possible encodings for an array of strings: 32-bit offsets,
 export to any one of these Arrow representations.
 
 In order to allow the caller to request a specific representation, the
-``__arrow_c_array__`` and ``__arrow_c_stream__`` methods take an optional
+:meth:`__arrow_c_array__` and :meth:`__arrow_c_stream__` methods take an optional
 ``requested_schema`` parameter. This parameter is a PyCapsule containing an
 ``ArrowSchema``.
 
@@ -242,12 +185,41 @@ schema transformations.
 .. _PyCapsule: https://docs.python.org/3/c-api/capsule.html
 
 
+Protocol Typehints
+------------------
+
+The following typehints can be copied into your library to annotate that a 
+function accepts an object implementing one of these protocols.
+
+.. code-block:: python
+
+    from typing import Tuple, Protocol
+    from typing_extensions import Self
+
+    class ArrowSchemaExportable(Protocol):
+        def __arrow_c_schema__(self) -> object: ...
+
+    class ArrowArrayExportable(Protocol):
+        def __arrow_c_array__(
+            self,
+            requested_schema: object | None = None
+        ) -> Tuple[object, object]:
+            ...
+
+    class ArrowStreamExportable(Protocol):
+        def __arrow_c_stream__(
+            self,
+            requested_schema: object | None = None
+        ) -> object:
+            ...
+
 Examples
 ========
 
 Create a PyCapsule
 ------------------
 
+
 To create a PyCapsule, use the `PyCapsule_New <https://docs.python.org/3/c-api/capsule.html#c.PyCapsule_New>`_
 function. The function must be passed a destructor function that will be called
 to release the data the capsule points to. It must first call the release
@@ -265,23 +237,28 @@ Below is the code to create a PyCapsule for an ``ArrowSchema``. The code for
          #include <Python.h>
 
          void ReleaseArrowSchemaPyCapsule(PyObject* capsule) {
-             ArrowSchema* schema =
-                 (ArrowSchema*)PyCapsule_GetPointer(capsule, "arrow_schema");
+             struct ArrowSchema* schema =
+                 (struct ArrowSchema*)PyCapsule_GetPointer(capsule, "arrow_schema");
              if (schema->release != NULL) {
                  schema->release(schema);
              }
              free(schema);
          }
          
-         PyObject* MakeArrowSchemaPyCapsule(ArrowSchema* schema) {
-           return PyCapsule_New(schema, "arrow_schema", ReleaseArrowSchemaPyCapsule);
+         PyObject* ExportArrowSchemaPyCapsule() {
+             struct ArrowSchema* schema =
+                 (struct ArrowSchema*)malloc(sizeof(struct ArrowSchema));
+             // Fill in ArrowSchema fields
+             // ...
+             return PyCapsule_New(schema, "arrow_schema", ReleaseArrowSchemaPyCapsule);
          }
 
    .. tab-item:: Cython
 
       .. code-block:: cython
 
-         import cpython
+         cimport cpython
+         from libc.stdlib cimport malloc, free
 
          cdef void release_arrow_schema_py_capsule(object schema_capsule):
              cdef ArrowSchema* schema = <ArrowSchema*>cpython.PyCapsule_GetPointer(
@@ -292,7 +269,10 @@ Below is the code to create a PyCapsule for an ``ArrowSchema``. The code for
          
              free(schema)
          
-         cdef object make_arrow_schema_py_capsule(ArrowSchema* schema):
+         cdef object export_arrow_schema_py_capsule():
+             cdef ArrowSchema* schema = <ArrowSchema*>malloc(sizeof(ArrowSchema))
+             # Fill in ArrowSchema fields
+             # ...
              return cpython.PyCapsule_New(
                  <void*>schema, 'arrow_schema', release_arrow_schema_py_capsule
              )
@@ -318,18 +298,18 @@ code for ``ArrowArray`` and ``ArrowArrayStream`` is similar.
          #include <Python.h>
          
          // If the capsule is not an ArrowSchema, will return NULL.
-         ArrowSchema* GetArrowSchemaPyCapsule(PyObject* capsule) {
-           return PyCapsule_GetPointer(schema, "arrow_schema");
+         struct ArrowSchema* GetArrowSchemaPyCapsule(PyObject* capsule) {
+           return PyCapsule_GetPointer(capsule, "arrow_schema");
          }
 
    .. tab-item:: Cython
 
       .. code-block:: cython
 
-         import cpython
+         cimport cpython
         
-         cdef object get_arrow_schema_py_capsule(PyObject* capsule):
-             return cpython.PyCapsule_GetPointer(capsule, 'arrow_schema')
+         cdef ArrowSchema* get_arrow_schema_py_capsule(object capsule):
+             return <ArrowSchema*>cpython.PyCapsule_GetPointer(capsule, 'arrow_schema')
 
 Backwards Compatibility with PyArrow
 ------------------------------------
@@ -356,6 +336,8 @@ implement the PyCapsule interface:
 
    # NEW METHOD
    def from_arrow(arr)
+       # Newer versions of PyArrow as well as other libraries with Arrow data
+       # implement this method, so prefer it over _export_to_c.
        if hasattr(arr, "__arrow_c_array__"):
             schema_ptr, array_ptr = arr.__arrow_c_array__()
             return import_c_capsule_data(schema_ptr, array_ptr)
@@ -369,10 +351,10 @@ implement the PyCapsule interface:
            raise TypeError(f"Cannot import {type(arr)} as Arrow array data.")
 
 You may also wish to accept objects implementing the protocol in your
-constructors. For example, in PyArrow, the :py:func:`array` and :py:func:`record_batch`
-constructors accept any object that implements the ``__arrow_c_array__`` method
-protocol. Similarly, the PyArrow's :py:func:`schema` constructor accepts any object
-that implements the ``__arrow_c_schema__`` method.
+constructors. For example, in PyArrow, the :func:`array` and :func:`record_batch`
+constructors accept any object that implements the :meth:`__arrow_c_array__` method
+protocol. Similarly, the PyArrow's :func:`schema` constructor accepts any object
+that implements the :meth:`__arrow_c_schema__` method.
 
 Now if your library has an export to PyArrow function, such as:
 
@@ -395,11 +377,48 @@ that implements the protocol:
 
   # NEW METHOD
   def to_arrow(self) -> pa.Array:
+      # PyArrow added support for constructing arrays from objects implementing
+      # __arrow_c_array__ in the same version it added the method for it's own
+      # arrays. So we can use hasattr to check if the method is available as
+      # a proxy for checking the PyArrow version.
       if hasattr(pa.Array, "__arrow_c_array__"):
-          warnings.warn("to_arrow() is deprecated. Instead, simply pass the array to pyarrow.array().")
           return pa.array(self)
       else:
           array_export_ptr = make_array_export_ptr()
           schema_export_ptr = make_schema_export_ptr()
           self.export_c_data(array_export_ptr, schema_export_ptr)
           return pa.Array._import_from_c(array_export_ptr, schema_export_ptr)
+
+
+Comparison with Other Protocols
+===============================
+
+Comparison to DataFrame Interchange Protocol
+--------------------------------------------
+
+`The DataFrame Interchange Protocol <https://data-apis.org/dataframe-protocol/latest/>`_
+is another protocol in Python that allows for the sharing of data between libraries.
+This protocol is complementary to the DataFrame Interchange Protocol. Many of
+the objects that implement this protocol will also implement the DataFrame
+Interchange Protocol.
+
+This protocol is specific to Arrow-based data structures, while the DataFrame
+Interchange Protocol allows non-Arrow data frames and arrays to be shared as well.
+Because of this, these PyCapsules can support Arrow-specific features such as
+nested columns.
+
+This protocol is also much more minimal than the DataFrame Interchange Protocol.
+It just handles data export, rather than defining accessors for details like
+number of rows or columns.
+
+In summary, if you are implementing this protocol, you should also consider
+implementing the DataFrame Interchange Protocol.
+
+
+Comparison to ``__arrow_array__`` protocol
+------------------------------------------
+
+The :ref:`arrow_array_protocol` protocol is a dunder method that 
+defines how PyArrow should import an object as an Arrow array. Unlike this
+protocol, it is specific to PyArrow and isn't used by other libraries. It is
+also limited to arrays and does not support schemas, tabular structures, or streams.