Skip to content

Commit

Permalink
Merge branch 'dcp_async_save' of github.com:pytorch/tutorials into dc…
Browse files Browse the repository at this point in the history
…p_async_save
  • Loading branch information
LucasLLC committed Jul 18, 2024
2 parents 51a9b61 + 0a483b1 commit 6acfa55
Show file tree
Hide file tree
Showing 5 changed files with 13 additions and 9 deletions.
2 changes: 1 addition & 1 deletion advanced_source/cpp_export.rst
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ minimal ``CMakeLists.txt`` to build it could look as simple as:
add_executable(example-app example-app.cpp)
target_link_libraries(example-app "${TORCH_LIBRARIES}")
set_property(TARGET example-app PROPERTY CXX_STANDARD 14)
set_property(TARGET example-app PROPERTY CXX_STANDARD 17)
The last thing we need to build the example application is the LibTorch
distribution. You can always grab the latest stable release from the `download
Expand Down
2 changes: 1 addition & 1 deletion advanced_source/super_resolution_with_onnxruntime.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
* ``torch.onnx.export`` is based on TorchScript backend and has been available since PyTorch 1.2.0.
In this tutorial, we describe how to convert a model defined
in PyTorch into the ONNX format using the TorchScript ``torch.onnx.export` ONNX exporter.
in PyTorch into the ONNX format using the TorchScript ``torch.onnx.export`` ONNX exporter.
The exported model will be executed with ONNX Runtime.
ONNX Runtime is a performance-focused engine for ONNX models,
Expand Down
4 changes: 2 additions & 2 deletions intermediate_source/inductor_debug_cpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,9 +87,9 @@ def neg1(x):
# +-----------------------------+----------------------------------------------------------------+
# | ``fx_graph_transformed.py`` | Transformed FX graph, after pattern match |
# +-----------------------------+----------------------------------------------------------------+
# | ``ir_post_fusion.txt`` | Inductor IR before fusion |
# | ``ir_pre_fusion.txt`` | Inductor IR before fusion |
# +-----------------------------+----------------------------------------------------------------+
# | ``ir_pre_fusion.txt`` | Inductor IR after fusion |
# | ``ir_post_fusion.txt`` | Inductor IR after fusion |
# +-----------------------------+----------------------------------------------------------------+
# | ``output_code.py`` | Generated Python code for graph, with C++/Triton kernels |
# +-----------------------------+----------------------------------------------------------------+
Expand Down
12 changes: 8 additions & 4 deletions recipes_source/distributed_async_checkpoint_recipe.rst
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
Asynchronous Saving with Distributed Checkpoint (DCP)
=====================================================

**Author:** `Lucas Pasqualin <https://github.com/lucasllc>`__, `Iris Zhang <https://github.com/wz337>`__, `Rodrigo Kumpera <https://github.com/kumpera>`__, `Chien-Chin Huang <https://github.com/fegin>`__

Checkpointing is often a bottle-neck in the critical path for distributed training workloads, incurring larger and larger costs as both model and world sizes grow.
One excellent strategy for offsetting this cost is to checkpoint in parallel, asynchronously. Below, we expand the save example
from the `Getting Started with Distributed Checkpoint Tutorial <https://github.com/pytorch/tutorials/blob/main/recipes_source/distributed_checkpoint_recipe.rst>`__
to show how this can be integrated quite easily with ``torch.distributed.checkpoint.async_save``.

**Author**: , `Lucas Pasqualin <https://github.com/lucasllc>`__, `Iris Zhang <https://github.com/wz337>`__, `Rodrigo Kumpera <https://github.com/kumpera>`__, `Chien-Chin Huang <https://github.com/fegin>`__

.. grid:: 2

Expand Down Expand Up @@ -156,9 +157,12 @@ If the above optimization is still not performant enough, you can take advantage
Specifically, this optimization attacks the main overhead of asynchronous checkpointing, which is the in-memory copying to checkpointing buffers. By maintaining a pinned memory buffer between
checkpoint requests users can take advantage of direct memory access to speed up this copy.

.. note:: The main drawback of this optimization is the persistence of the buffer in between checkpointing steps. Without the pinned memory optimization (as demonstrated above),
any checkpointing buffers are released as soon as checkpointing is finished. With the pinned memory implementation, this buffer is maintained between steps, leading to the same
peak memory pressure being sustained through the application life.
.. note::
The main drawback of this optimization is the persistence of the buffer in between checkpointing steps. Without
the pinned memory optimization (as demonstrated above), any checkpointing buffers are released as soon as
checkpointing is finished. With the pinned memory implementation, this buffer is maintained between steps,
leading to the same
peak memory pressure being sustained through the application life.


.. code-block:: python
Expand Down
2 changes: 1 addition & 1 deletion recipes_source/distributed_device_mesh.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,4 +156,4 @@ they can be used to describe the layout of devices across the cluster.
For more information, please see the following:

- `2D parallel combining Tensor/Sequance Parallel with FSDP <https://github.com/pytorch/examples/blob/main/distributed/tensor_parallelism/fsdp_tp_example.py>`__
- `Composable PyTorch Distributed with PT2 <chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://static.sched.com/hosted_files/pytorch2023/d1/%5BPTC%2023%5D%20Composable%20PyTorch%20Distributed%20with%20PT2.pdf>`__
- `Composable PyTorch Distributed with PT2 <https://static.sched.com/hosted_files/pytorch2023/d1/%5BPTC%2023%5D%20Composable%20PyTorch%20Distributed%20with%20PT2.pdf>`__

0 comments on commit 6acfa55

Please sign in to comment.