Skip to content

Commit

Permalink
Merge branch 'ershi/update-docs' into 'main'
Browse files Browse the repository at this point in the history
Documentation cleanup pass

See merge request omniverse/warp!866
  • Loading branch information
mmacklin committed Nov 19, 2024
2 parents 0a996f4 + cbf2f67 commit 55c03c3
Show file tree
Hide file tree
Showing 15 changed files with 253 additions and 130 deletions.
13 changes: 5 additions & 8 deletions docs/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ To launch a 2D grid of threads to process a 1024x1024 image, we could write::

wp.launch(kernel=compute_image, dim=(1024, 1024), inputs=[img], device="cuda")

We retrieve a 2D thread index inside the kernel by using multiple assignment when calling ``wp.tid()``:
We retrieve a 2D thread index inside the kernel by using multiple assignment when calling :func:`wp.tid() <tid>`:

.. code-block:: python
Expand All @@ -92,7 +92,7 @@ We retrieve a 2D thread index inside the kernel by using multiple assignment whe
Arrays
------

Memory allocations are exposed via the ``wp.array`` type. Arrays wrap an underlying memory allocation that may live in
Memory allocations are exposed via the :class:`wp.array <array>` type. Arrays wrap an underlying memory allocation that may live in
either host (CPU), or device (GPU) memory. Arrays are strongly typed and store a linear sequence of built-in values
(``float``, ``int``, ``vec3``, ``matrix33``, etc).

Expand Down Expand Up @@ -199,13 +199,10 @@ Users can define their own structures using the ``@wp.struct`` decorator, for ex
active: int
indices: wp.array(dtype=int)

As with kernel parameters, all attributes of a struct must have valid type hints at class definition time.

Structs may be used as a ``dtype`` for ``wp.arrays``, and may be passed to kernels directly as arguments,
please see :ref:`Structs Reference <Structs>` for more details.

.. note::

As with kernel parameters, all attributes of a struct must have valid type hints at class definition time.
Structs may be used as a ``dtype`` for ``wp.arrays`` and may be passed to kernels directly as arguments.
See :ref:`Structs Reference <Structs>` for more details on structs.

.. _Compilation Model:

Expand Down
4 changes: 2 additions & 2 deletions docs/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ Basic Global Settings
|``verify_fp`` | Boolean | ``False`` | If ``True``, Warp will check that inputs and outputs are finite before |
| | | | and/or after various operations. **Has performance implications.** |
+------------------------------------------------+---------+-------------+--------------------------------------------------------------------------+
|``verify_cuda`` | Boolean | ``False`` | If ``True``, Warp will check for CUDA errors after every launch and |
| | | | memory operation. CUDA error verification cannot be used during graph |
|``verify_cuda`` | Boolean | ``False`` | If ``True``, Warp will check for CUDA errors after every launch |
| | | | operation. CUDA error verification cannot be used during graph |
| | | | capture. **Has performance implications.** |
+------------------------------------------------+---------+-------------+--------------------------------------------------------------------------+
|``print_launches`` | Boolean | ``False`` | If ``True``, Warp will print details of every kernel launch to standard |
Expand Down
8 changes: 4 additions & 4 deletions docs/debugging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ In addition, formatted C-style printing is available through the ``wp.printf()``
Printing Launches
-----------------

For complex applications it can be difficult to understand the order-of-operations that lead to a bug. To help diagnose
these issues Warp supports a simple option to print out all launches and arguments to the console::
For complex applications, it can be difficult to understand the order-of-operations that lead to a bug. To help diagnose
these issues, Warp supports a simple option to print out all launches and arguments to the console::

wp.config.print_launches = True

Expand Down Expand Up @@ -89,7 +89,7 @@ If a CUDA error is suspected a simple verification method is to enable::

wp.config.verify_cuda = True

This setting will check the CUDA context after every operation to ensure that it is still valid. If an error is
encountered it will raise an exception that often helps to narrow down the problematic kernel.
This setting will check the CUDA context after every :func:`wp.launch() <warp.launch>` to ensure that it is still valid.
If an error is encountered, an exception will be raised that often helps to narrow down the problematic kernel.

.. note:: Verifying CUDA state at each launch requires synchronizing CPU and GPU which has a significant overhead. Users should ensure this setting is only used during debugging.
28 changes: 21 additions & 7 deletions docs/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,23 +12,32 @@ and implementation differences.

Compared to Numba, Warp supports a smaller subset of Python, but
offering auto-differentiation of kernel programs, which is useful for
machine learning. Compared to Taichi Warp uses C++/CUDA as an
machine learning. Unlike Numba and Taichi, Warp uses C++/CUDA as an
intermediate representation, which makes it convenient to implement and
expose low-level routines. In addition, we are building in
data structures to support geometry processing (meshes, sparse volumes,
expose low-level routines and leverage existing C++ libraries in kernels.
In addition, Warp has built in data structures to support geometry processing (meshes, sparse volumes,
point clouds, USD data) as first-class citizens that are not exposed in
other runtimes.

Warp does not offer a full tensor-based programming model like PyTorch
and JAX, but is designed to work well with these frameworks through data
sharing mechanisms like ``__cuda_array_interface__``. For computations
that map well to tensors (e.g.: neural-network inference) it makes sense
to use these existing tools. For problems with a lot of e.g.: sparsity,
to use these existing tools. For problems with a lot of sparsity,
conditional logic, heterogeneous workloads (like the ones we often find in
simulation and graphics), then the kernel-based programming model like
simulation and graphics), etc., then kernel-based programming models like
the one in Warp are often more convenient since users have control over
individual threads.

What are some examples of projects that use Warp?
-------------------------------------------------

* `NCLaw <https://github.com/PingchuanMa/NCLaw>`__: Implements a differentiable MPM simulator using Warp.
* `XLB <https://github.com/Autodesk/XLB>`__: A lattice Boltzmann solver with a backend option using Warp.
* `warp-mpm <https://github.com/zeshunzong/warp-mpm>`__: An MPM simulator using Warp and used in
`Neural Stress Fields for Reduced-order Elastoplasticity and Fracture <https://zeshunzong.github.io/reduced-order-mpm/>`__
and `PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics <https://xpandora.github.io/PhysGaussian/>`__.

Does Warp support all of the Python language?
---------------------------------------------

Expand All @@ -42,7 +51,7 @@ When should I call ``wp.synchronize()``?
----------------------------------------

One of the common sources of confusion for new users is when calls to
``wp.synchronize()`` are necessary. The answer is “almost never”!
:func:`wp.synchronize() <warp.synchronize>` are necessary. The answer is “almost never”!
Synchronization is quite expensive and should generally be avoided
unless necessary. Warp naturally takes care of synchronization between
operations (e.g.: kernel launches, device memory copies).
Expand Down Expand Up @@ -103,7 +112,12 @@ Does Warp support multi-GPU programming?
Yes! Since version ``0.4.0`` we support allocating, launching, and
copying between multiple GPUs in a single process. We follow the naming
conventions of PyTorch and use aliases such as ``cuda:0``, ``cuda:1``,
``cpu`` to identify individual devices.
``cpu`` to identify individual devices. For more information, see the
:doc:`modules/devices` documentation.

Warp applications can also be parallelized over multiple GPUs using
`mpi4py <https://github.com/mpi4py/mpi4py>`__. Warp arrays on the GPU may be
passed directly to MPI calls if mpi4py is built against a CUDA-aware MPI installation.

Should I switch to Warp over IsaacGym/PhysX?
----------------------------------------------
Expand Down
1 change: 1 addition & 0 deletions docs/modules/allocators.rst
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,7 @@ Threshold values between 0 and 1 are interpreted as fractions of available memor

This is a simple optimization that can improve the performance of programs without modifying the existing code in any way.

.. autofunction:: warp.get_mempool_release_threshold
.. autofunction:: warp.set_mempool_release_threshold

Graph Allocations
Expand Down
2 changes: 1 addition & 1 deletion docs/modules/concurrency.rst
Original file line number Diff line number Diff line change
Expand Up @@ -554,7 +554,7 @@ Stream synchronization can be a tricky business, even for experienced CUDA devel
wp.launch(kernel, dim=a.size, inputs=[a], stream=s)
This snippet has a stream synchronization problem that is difficult to detect at first glance.
It's quite possible that the code will work just fine, but it introduces undefined behaviour,
It's quite possible that the code will work just fine, but it introduces undefined behavior,
which may lead to incorrect results that manifest only once in a while. The issue is that the kernel is launched
on stream ``s``, which is different than the stream used for creating array ``a``. The array is allocated and
initialized on the current stream of device ``cuda:0``, which means that it might not be ready when stream ``s``
Expand Down
46 changes: 34 additions & 12 deletions docs/modules/devices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,12 @@ It is possible to explicitly target a specific device with each Warp API call us

.. autoclass:: warp.context.Device
:members:
.. autofunction:: set_device
.. autofunction:: get_device

Warp also provides functions that can be used to query the available devices on the system:

.. autofunction:: get_devices
.. autofunction:: get_cuda_devices
.. autofunction:: get_cuda_device_count

Default Device
--------------
Expand All @@ -36,6 +40,8 @@ Calling :func:`wp.get_device() <warp.get_device>` without an argument
will return an instance of :class:`warp.context.Device` for the default device.

During Warp initialization, the default device is set to ``"cuda:0"`` if CUDA is available. Otherwise, the default device is ``"cpu"``.
If the default device is changed, :func:`wp.get_preferred_device() <warp.get_preferred_device>` can be used to get
the *original* default device.

:func:`wp.set_device() <warp.set_device>` can be used to change the default device::

Expand All @@ -53,9 +59,10 @@ During Warp initialization, the default device is set to ``"cuda:0"`` if CUDA is

.. note::

For CUDA devices, ``wp.set_device()`` does two things: it sets the Warp default device and it makes the device's CUDA context current. This helps to minimize the number of CUDA context switches in blocks of code targeting a single device.
For CUDA devices, :func:`wp.set_device() <warp.set_device>` does two things: It sets the Warp default device and it makes the device's CUDA context current. This helps to minimize the number of CUDA context switches in blocks of code targeting a single device.

For PyTorch users, this function is similar to ``torch.cuda.set_device()``. It is still possible to specify a different device in individual API calls, like in this snippet::
For PyTorch users, this function is similar to :func:`torch.cuda.set_device()`.
It is still possible to specify a different device in individual API calls, like in this snippet::

# set default device
wp.set_device("cuda:0")
Expand All @@ -73,10 +80,15 @@ For PyTorch users, this function is similar to ``torch.cuda.set_device()``. It
wp.copy(b, a)
wp.copy(c, a)

.. autofunction:: set_device
.. autofunction:: get_device
.. autofunction:: get_preferred_device

Scoped Devices
--------------

Another way to manage the default device is using ``wp.ScopedDevice`` objects. They can be arbitrarily nested and restore the previous default device on exit::
Another way to manage the default device is using :class:`wp.ScopedDevice <ScopedDevice>` objects.
They can be arbitrarily nested and restore the previous default device on exit::

with wp.ScopedDevice("cpu"):
# alloc and launch on "cpu"
Expand All @@ -95,9 +107,7 @@ Another way to manage the default device is using ``wp.ScopedDevice`` objects.
# launch on "cuda:0"
wp.launch(kernel, dim=b.size, inputs=[b])

.. note::

For CUDA devices, ``wp.ScopedDevice`` makes the device's CUDA context current and restores the previous CUDA context on exit. This is handy when running Warp scripts as part of a bigger pipeline, because it avoids any side effects of changing the CUDA context in the enclosed code.
.. autoclass:: ScopedDevice

Example: Using ``wp.ScopedDevice`` with multiple GPUs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -176,11 +186,17 @@ In this snippet, we use PyTorch to manage the current CUDA device and invoke a W
Device Synchronization
----------------------

CUDA kernel launches and memory operations can execute asynchronously. This allows for overlapping compute and memory operations on different devices. Warp allows synchronizing the host with outstanding asynchronous operations on a specific device::
CUDA kernel launches and memory operations can execute asynchronously.
This allows for overlapping compute and memory operations on different devices.
Warp allows synchronizing the host with outstanding asynchronous operations on a specific device::

wp.synchronize_device("cuda:1")

The ``wp.synchronize_device()`` function offers more fine-grained synchronization than ``wp.synchronize()``, as the latter waits for *all* devices to complete their work.
:func:`wp.synchronize_device() <synchronize_device>` offers more fine-grained synchronization than
:func:`wp.synchronize() <synchronize>`, as the latter waits for *all* devices to complete their work.

.. autofunction:: synchronize_device
.. autofunction:: synchronize

Custom CUDA Contexts
--------------------
Expand All @@ -193,7 +209,9 @@ Applications built on the CUDA Driver API work with CUDA contexts directly and c

The special device alias ``"cuda"`` can be used to target the current CUDA context, whether this is a primary or custom context.

In addition, Warp allows registering new device aliases for custom CUDA contexts, so that they can be explicitly targeted by name. If the ``CUcontext`` pointer is available, it can be used to create a new device alias like this::
In addition, Warp allows registering new device aliases for custom CUDA contexts using
:func:`wp.map_cuda_device() <map_cuda_device>` so that they can be explicitly targeted by name.
If the ``CUcontext`` pointer is available, it can be used to create a new device alias like this::

wp.map_cuda_device("foo", ctypes.c_void_p(context_ptr))

Expand All @@ -207,6 +225,9 @@ In either case, mapping the custom CUDA context allows us to target the context
a = wp.zeros(n)
wp.launch(kernel, dim=a.size, inputs=[a])

.. autofunction:: map_cuda_device
.. autofunction:: unmap_cuda_device

.. _peer_access:

CUDA Peer Access
Expand Down Expand Up @@ -253,7 +274,8 @@ It's possible to temporarily enable or disable peer access using a scoped manage
.. note::

Peer access does not accelerate memory transfers between arrays allocated using the :ref:`stream-ordered memory pool allocators<mempool_allocators>` introduced in Warp 0.14.0. To accelerate memory pool transfers, :ref:`memory pool access<mempool_access>` should be enabled instead.
Peer access does not accelerate memory transfers between arrays allocated using the :ref:`stream-ordered memory pool allocators<mempool_allocators>` introduced in Warp 0.14.0.
To accelerate memory pool transfers, :ref:`memory pool access<mempool_access>` should be enabled instead.

.. autofunction:: warp.is_peer_access_supported
.. autofunction:: warp.is_peer_access_enabled
Expand Down
15 changes: 8 additions & 7 deletions docs/modules/differentiability.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,12 @@ Differentiability
By default, Warp generates a forward and backward (adjoint) version of each kernel definition. The backward version of a kernel can be used
to compute gradients of loss functions that can be back propagated to machine learning frameworks like PyTorch.

Arrays that participate in the chain of computation which require gradients should be created with ``requires_grad=True``, for example::
Arrays that participate in the chain of computation which require gradients must be created with ``requires_grad=True``, for example::

a = wp.zeros(1024, dtype=wp.vec3, device="cuda", requires_grad=True)

The ``wp.Tape`` class can then be used to record kernel launches, and replay them to compute the gradient of a scalar loss function with respect to the kernel inputs::
The :class:`wp.Tape <Tape>` class can then be used to record kernel launches and replay them to compute the gradient of
a scalar loss function with respect to the kernel inputs::

tape = wp.Tape()

Expand All @@ -25,22 +26,22 @@ The ``wp.Tape`` class can then be used to record kernel launches, and replay the
# reverse pass
tape.backward(l)

After the backward pass has completed, the gradients with respect to the inputs are available from the ``array.grad`` attribute::
After the backward pass has completed, the gradients with respect to the inputs are available from the :py:attr:`array.grad` attribute::

# gradient of loss with respect to input a
print(a.grad)

Note that gradients are accumulated on the participating buffers, so if you wish to reuse the same buffers for multiple backward passes you should first zero the gradients::

tape.zero()
Note that gradients are accumulated on the participating buffers, so if you wish to reuse the same buffers for multiple
backward passes you should first zero the gradients using :meth:`Tape.zero()`.

.. autoclass:: Tape
:members:

Copying is Differentiable
#########################

``wp.copy()``, ``wp.clone()``, and ``array.assign()`` are differentiable functions and can participate in the computation graph recorded on the tape. Consider the following examples and their
:func:`wp.copy() <copy>`, :func:`wp.clone() <clone>`, and :meth:`array.assign()` are differentiable functions and can
participate in the computation graph recorded on the tape. Consider the following examples and their
PyTorch equivalents (for comparison):

``wp.copy()``:
Expand Down
Loading

0 comments on commit 55c03c3

Please sign in to comment.