Merge branch 'ershi/update-docs' into 'main'

Documentation cleanup pass See merge request omniverse/warp!866
NVIDIA · Nov 19, 2024 · 55c03c3 · 55c03c3
2 parents 0a996f4 + cbf2f67
commit 55c03c3
Show file tree

Hide file tree

Showing 15 changed files with 253 additions and 130 deletions.
diff --git a/docs/basics.rst b/docs/basics.rst
@@ -80,7 +80,7 @@ To launch a 2D grid of threads to process a 1024x1024 image, we could write::
 
     wp.launch(kernel=compute_image, dim=(1024, 1024), inputs=[img], device="cuda")
 
-We retrieve a 2D thread index inside the kernel by using multiple assignment when calling ``wp.tid()``:
+We retrieve a 2D thread index inside the kernel by using multiple assignment when calling :func:`wp.tid() <tid>`:
 
 .. code-block:: python
 
@@ -92,7 +92,7 @@ We retrieve a 2D thread index inside the kernel by using multiple assignment whe
 Arrays
 ------
 
-Memory allocations are exposed via the ``wp.array`` type. Arrays wrap an underlying memory allocation that may live in
+Memory allocations are exposed via the :class:`wp.array <array>` type. Arrays wrap an underlying memory allocation that may live in
 either host (CPU), or device (GPU) memory. Arrays are strongly typed and store a linear sequence of built-in values
 (``float``, ``int``, ``vec3``, ``matrix33``, etc).
 
@@ -199,13 +199,10 @@ Users can define their own structures using the ``@wp.struct`` decorator, for ex
         active: int
         indices: wp.array(dtype=int)
 
+As with kernel parameters, all attributes of a struct must have valid type hints at class definition time.
 
-Structs may be used as a ``dtype`` for ``wp.arrays``, and may be passed to kernels directly as arguments,
-please see :ref:`Structs Reference <Structs>` for more details.
-
-.. note:: 
-
-    As with kernel parameters, all attributes of a struct must have valid type hints at class definition time.
+Structs may be used as a ``dtype`` for ``wp.arrays`` and may be passed to kernels directly as arguments.
+See :ref:`Structs Reference <Structs>` for more details on structs.
 
 .. _Compilation Model:
 

diff --git a/docs/configuration.rst b/docs/configuration.rst
@@ -41,8 +41,8 @@ Basic Global Settings
 |``verify_fp``                                   | Boolean | ``False``   | If ``True``, Warp will check that inputs and outputs are finite before   |
 |                                                |         |             | and/or after various operations. **Has performance implications.**       |
 +------------------------------------------------+---------+-------------+--------------------------------------------------------------------------+
-|``verify_cuda``                                 | Boolean | ``False``   | If ``True``, Warp will check for CUDA errors after every launch and      |
-|                                                |         |             | memory operation. CUDA error verification cannot be used during graph    |
+|``verify_cuda``                                 | Boolean | ``False``   | If ``True``, Warp will check for CUDA errors after every launch          |
+|                                                |         |             | operation. CUDA error verification cannot be used during graph           |
 |                                                |         |             | capture. **Has performance implications.**                               |              
 +------------------------------------------------+---------+-------------+--------------------------------------------------------------------------+
 |``print_launches``                              | Boolean | ``False``   | If ``True``, Warp will print details of every kernel launch to standard  |

diff --git a/docs/debugging.rst b/docs/debugging.rst
@@ -29,8 +29,8 @@ In addition, formatted C-style printing is available through the ``wp.printf()``
 Printing Launches
 -----------------
 
-For complex applications it can be difficult to understand the order-of-operations that lead to a bug. To help diagnose
-these issues Warp supports a simple option to print out all launches and arguments to the console::
+For complex applications, it can be difficult to understand the order-of-operations that lead to a bug. To help diagnose
+these issues, Warp supports a simple option to print out all launches and arguments to the console::
 
     wp.config.print_launches = True
 
@@ -89,7 +89,7 @@ If a CUDA error is suspected a simple verification method is to enable::
 
     wp.config.verify_cuda = True
 
-This setting will check the CUDA context after every operation to ensure that it is still valid. If an error is
-encountered it will raise an exception that often helps to narrow down the problematic kernel.
+This setting will check the CUDA context after every :func:`wp.launch() <warp.launch>` to ensure that it is still valid.
+If an error is encountered, an exception will be raised that often helps to narrow down the problematic kernel.
 
 .. note:: Verifying CUDA state at each launch requires synchronizing CPU and GPU which has a significant overhead. Users should ensure this setting is only used during debugging.
diff --git a/docs/faq.rst b/docs/faq.rst
@@ -12,23 +12,32 @@ and implementation differences.
 
 Compared to Numba, Warp supports a smaller subset of Python, but
 offering auto-differentiation of kernel programs, which is useful for
-machine learning. Compared to Taichi Warp uses C++/CUDA as an
+machine learning. Unlike Numba and Taichi, Warp uses C++/CUDA as an
 intermediate representation, which makes it convenient to implement and
-expose low-level routines. In addition, we are building in
-data structures to support geometry processing (meshes, sparse volumes,
+expose low-level routines and leverage existing C++ libraries in kernels.
+In addition, Warp has built in data structures to support geometry processing (meshes, sparse volumes,
 point clouds, USD data) as first-class citizens that are not exposed in
 other runtimes.
 
 Warp does not offer a full tensor-based programming model like PyTorch
 and JAX, but is designed to work well with these frameworks through data
 sharing mechanisms like ``__cuda_array_interface__``. For computations
 that map well to tensors (e.g.: neural-network inference) it makes sense
-to use these existing tools. For problems with a lot of e.g.: sparsity,
+to use these existing tools. For problems with a lot of sparsity,
 conditional logic, heterogeneous workloads (like the ones we often find in
-simulation and graphics), then the kernel-based programming model like
+simulation and graphics), etc., then kernel-based programming models like
 the one in Warp are often more convenient since users have control over
 individual threads.
 
+What are some examples of projects that use Warp?
+-------------------------------------------------
+
+* `NCLaw <https://github.com/PingchuanMa/NCLaw>`__: Implements a differentiable MPM simulator using Warp.
+* `XLB <https://github.com/Autodesk/XLB>`__: A lattice Boltzmann solver with a backend option using Warp.
+* `warp-mpm <https://github.com/zeshunzong/warp-mpm>`__: An MPM simulator using Warp and used in
+  `Neural Stress Fields for Reduced-order Elastoplasticity and Fracture <https://zeshunzong.github.io/reduced-order-mpm/>`__
+  and `PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics <https://xpandora.github.io/PhysGaussian/>`__.
+
 Does Warp support all of the Python language?
 ---------------------------------------------
 
@@ -42,7 +51,7 @@ When should I call ``wp.synchronize()``?
 ----------------------------------------
 
 One of the common sources of confusion for new users is when calls to
-``wp.synchronize()`` are necessary. The answer is “almost never”!
+:func:`wp.synchronize() <warp.synchronize>` are necessary. The answer is “almost never”!
 Synchronization is quite expensive and should generally be avoided
 unless necessary. Warp naturally takes care of synchronization between
 operations (e.g.: kernel launches, device memory copies).
@@ -103,7 +112,12 @@ Does Warp support multi-GPU programming?
 Yes! Since version ``0.4.0`` we support allocating, launching, and
 copying between multiple GPUs in a single process. We follow the naming
 conventions of PyTorch and use aliases such as ``cuda:0``, ``cuda:1``,
-``cpu`` to identify individual devices.
+``cpu`` to identify individual devices. For more information, see the
+:doc:`modules/devices` documentation.
+
+Warp applications can also be parallelized over multiple GPUs using
+`mpi4py <https://github.com/mpi4py/mpi4py>`__. Warp arrays on the GPU may be
+passed directly to MPI calls if mpi4py is built against a CUDA-aware MPI installation.
 
 Should I switch to Warp over IsaacGym/PhysX?
 ----------------------------------------------

diff --git a/docs/modules/allocators.rst b/docs/modules/allocators.rst
@@ -177,6 +177,7 @@ Threshold values between 0 and 1 are interpreted as fractions of available memor
 
 This is a simple optimization that can improve the performance of programs without modifying the existing code in any way.
 
+.. autofunction:: warp.get_mempool_release_threshold
 .. autofunction:: warp.set_mempool_release_threshold
 
 Graph Allocations

diff --git a/docs/modules/concurrency.rst b/docs/modules/concurrency.rst
@@ -554,7 +554,7 @@ Stream synchronization can be a tricky business, even for experienced CUDA devel
     wp.launch(kernel, dim=a.size, inputs=[a], stream=s)
 
 This snippet has a stream synchronization problem that is difficult to detect at first glance.
-It's quite possible that the code will work just fine, but it introduces undefined behaviour,
+It's quite possible that the code will work just fine, but it introduces undefined behavior,
 which may lead to incorrect results that manifest only once in a while.  The issue is that the kernel is launched
 on stream ``s``, which is different than the stream used for creating array ``a``.  The array is allocated and
 initialized on the current stream of device ``cuda:0``, which means that it might not be ready when stream ``s``

diff --git a/docs/modules/devices.rst b/docs/modules/devices.rst
@@ -24,8 +24,12 @@ It is possible to explicitly target a specific device with each Warp API call us
 
 .. autoclass:: warp.context.Device
     :members:
-.. autofunction:: set_device
-.. autofunction:: get_device
+
+Warp also provides functions that can be used to query the available devices on the system:
+
+.. autofunction:: get_devices
+.. autofunction:: get_cuda_devices
+.. autofunction:: get_cuda_device_count
 
 Default Device
 --------------
@@ -36,6 +40,8 @@ Calling :func:`wp.get_device() <warp.get_device>` without an argument
 will return an instance of :class:`warp.context.Device` for the default device.
 
 During Warp initialization, the default device is set to ``"cuda:0"`` if CUDA is available.  Otherwise, the default device is ``"cpu"``.
+If the default device is changed, :func:`wp.get_preferred_device() <warp.get_preferred_device>` can be used to get
+the *original* default device.
 
 :func:`wp.set_device() <warp.set_device>` can be used to change the default device::
 
@@ -53,9 +59,10 @@ During Warp initialization, the default device is set to ``"cuda:0"`` if CUDA is
 
 .. note::
 
-    For CUDA devices, ``wp.set_device()`` does two things: it sets the Warp default device and it makes the device's CUDA context current.  This helps to minimize the number of CUDA context switches in blocks of code targeting a single device.
+    For CUDA devices, :func:`wp.set_device() <warp.set_device>` does two things: It sets the Warp default device and it makes the device's CUDA context current.  This helps to minimize the number of CUDA context switches in blocks of code targeting a single device.
 
-For PyTorch users, this function is similar to ``torch.cuda.set_device()``.  It is still possible to specify a different device in individual API calls, like in this snippet::
+For PyTorch users, this function is similar to :func:`torch.cuda.set_device()`.
+It is still possible to specify a different device in individual API calls, like in this snippet::
 
     # set default device
     wp.set_device("cuda:0")
@@ -73,10 +80,15 @@ For PyTorch users, this function is similar to ``torch.cuda.set_device()``.  It
     wp.copy(b, a)
     wp.copy(c, a)
 
+.. autofunction:: set_device
+.. autofunction:: get_device
+.. autofunction:: get_preferred_device
+
 Scoped Devices
 --------------
 
-Another way to manage the default device is using ``wp.ScopedDevice`` objects.  They can be arbitrarily nested and restore the previous default device on exit::
+Another way to manage the default device is using :class:`wp.ScopedDevice <ScopedDevice>` objects.
+They can be arbitrarily nested and restore the previous default device on exit::
 
     with wp.ScopedDevice("cpu"):
         # alloc and launch on "cpu"
@@ -95,9 +107,7 @@ Another way to manage the default device is using ``wp.ScopedDevice`` objects.
         # launch on "cuda:0"
         wp.launch(kernel, dim=b.size, inputs=[b])
 
-.. note::
-
-    For CUDA devices, ``wp.ScopedDevice`` makes the device's CUDA context current and restores the previous CUDA context on exit.  This is handy when running Warp scripts as part of a bigger pipeline, because it avoids any side effects of changing the CUDA context in the enclosed code.
+.. autoclass:: ScopedDevice
 
 Example: Using ``wp.ScopedDevice`` with multiple GPUs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -176,11 +186,17 @@ In this snippet, we use PyTorch to manage the current CUDA device and invoke a W
 Device Synchronization
 ----------------------
 
-CUDA kernel launches and memory operations can execute asynchronously.  This allows for overlapping compute and memory operations on different devices.  Warp allows synchronizing the host with outstanding asynchronous operations on a specific device::
+CUDA kernel launches and memory operations can execute asynchronously.
+This allows for overlapping compute and memory operations on different devices.
+Warp allows synchronizing the host with outstanding asynchronous operations on a specific device::
 
     wp.synchronize_device("cuda:1")
 
-The ``wp.synchronize_device()`` function offers more fine-grained synchronization than ``wp.synchronize()``, as the latter waits for *all* devices to complete their work.
+:func:`wp.synchronize_device() <synchronize_device>` offers more fine-grained synchronization than
+:func:`wp.synchronize() <synchronize>`, as the latter waits for *all* devices to complete their work.
+
+.. autofunction:: synchronize_device
+.. autofunction:: synchronize
 
 Custom CUDA Contexts
 --------------------
@@ -193,7 +209,9 @@ Applications built on the CUDA Driver API work with CUDA contexts directly and c
 
 The special device alias ``"cuda"`` can be used to target the current CUDA context, whether this is a primary or custom context.
 
-In addition, Warp allows registering new device aliases for custom CUDA contexts, so that they can be explicitly targeted by name.  If the ``CUcontext`` pointer is available, it can be used to create a new device alias like this::
+In addition, Warp allows registering new device aliases for custom CUDA contexts using
+:func:`wp.map_cuda_device() <map_cuda_device>` so that they can be explicitly targeted by name.
+If the ``CUcontext`` pointer is available, it can be used to create a new device alias like this::
 
     wp.map_cuda_device("foo", ctypes.c_void_p(context_ptr))
 
@@ -207,6 +225,9 @@ In either case, mapping the custom CUDA context allows us to target the context
         a = wp.zeros(n)
         wp.launch(kernel, dim=a.size, inputs=[a])
 
+.. autofunction:: map_cuda_device
+.. autofunction:: unmap_cuda_device
+
 .. _peer_access:
 
 CUDA Peer Access
@@ -253,7 +274,8 @@ It's possible to temporarily enable or disable peer access using a scoped manage
 
 .. note::
 
-    Peer access does not accelerate memory transfers between arrays allocated using the :ref:`stream-ordered memory pool allocators<mempool_allocators>` introduced in Warp 0.14.0.  To accelerate memory pool transfers, :ref:`memory pool access<mempool_access>` should be enabled instead.
+    Peer access does not accelerate memory transfers between arrays allocated using the :ref:`stream-ordered memory pool allocators<mempool_allocators>` introduced in Warp 0.14.0.
+    To accelerate memory pool transfers, :ref:`memory pool access<mempool_access>` should be enabled instead.
 
 .. autofunction:: warp.is_peer_access_supported
 .. autofunction:: warp.is_peer_access_enabled

diff --git a/docs/modules/differentiability.rst b/docs/modules/differentiability.rst
@@ -8,11 +8,12 @@ Differentiability
 By default, Warp generates a forward and backward (adjoint) version of each kernel definition. The backward version of a kernel can be used 
 to compute gradients of loss functions that can be back propagated to machine learning frameworks like PyTorch.
 
-Arrays that participate in the chain of computation which require gradients should be created with ``requires_grad=True``, for example::
+Arrays that participate in the chain of computation which require gradients must be created with ``requires_grad=True``, for example::
 
     a = wp.zeros(1024, dtype=wp.vec3, device="cuda", requires_grad=True)
 
-The ``wp.Tape`` class can then be used to record kernel launches, and replay them to compute the gradient of a scalar loss function with respect to the kernel inputs::
+The :class:`wp.Tape <Tape>` class can then be used to record kernel launches and replay them to compute the gradient of
+a scalar loss function with respect to the kernel inputs::
 
     tape = wp.Tape()
 
@@ -25,22 +26,22 @@ The ``wp.Tape`` class can then be used to record kernel launches, and replay the
     # reverse pass
     tape.backward(l)
 
-After the backward pass has completed, the gradients with respect to the inputs are available from the ``array.grad`` attribute::
+After the backward pass has completed, the gradients with respect to the inputs are available from the :py:attr:`array.grad` attribute::
 
     # gradient of loss with respect to input a
     print(a.grad)
 
-Note that gradients are accumulated on the participating buffers, so if you wish to reuse the same buffers for multiple backward passes you should first zero the gradients::
-
-    tape.zero()
+Note that gradients are accumulated on the participating buffers, so if you wish to reuse the same buffers for multiple
+backward passes you should first zero the gradients using :meth:`Tape.zero()`.
 
 .. autoclass:: Tape
     :members:
 
 Copying is Differentiable
 #########################
 
-``wp.copy()``, ``wp.clone()``, and ``array.assign()`` are differentiable functions and can participate in the computation graph recorded on the tape. Consider the following examples and their
+:func:`wp.copy() <copy>`, :func:`wp.clone() <clone>`, and :meth:`array.assign()` are differentiable functions and can
+participate in the computation graph recorded on the tape. Consider the following examples and their
 PyTorch equivalents (for comparison):
 
 ``wp.copy()``: