From 057232755b52422f4f0c8040154c90ede33ade2d Mon Sep 17 00:00:00 2001 From: Aravinda Kumar <76619616+surprisedPikachu007@users.noreply.github.com> Date: Thu, 11 Jul 2024 01:29:18 +0530 Subject: [PATCH 1/7] Update distributed_device_mesh.rst (#2965) fixed a typo in the link --- recipes_source/distributed_device_mesh.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/recipes_source/distributed_device_mesh.rst b/recipes_source/distributed_device_mesh.rst index dbc4a81043..d41d6c1df1 100644 --- a/recipes_source/distributed_device_mesh.rst +++ b/recipes_source/distributed_device_mesh.rst @@ -156,4 +156,4 @@ they can be used to describe the layout of devices across the cluster. For more information, please see the following: - `2D parallel combining Tensor/Sequance Parallel with FSDP `__ -- `Composable PyTorch Distributed with PT2 `__ +- `Composable PyTorch Distributed with PT2 `__ From 25ea481f26589f6259e9409b1487581c4bde7e00 Mon Sep 17 00:00:00 2001 From: Lucas Pasqualin Date: Wed, 10 Jul 2024 18:35:11 -0400 Subject: [PATCH 2/7] Update recipes_source/distributed_async_checkpoint_recipe.rst Co-authored-by: Svetlana Karslioglu --- recipes_source/distributed_async_checkpoint_recipe.rst | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/recipes_source/distributed_async_checkpoint_recipe.rst b/recipes_source/distributed_async_checkpoint_recipe.rst index 7d81a53c37..a8a7d35de6 100644 --- a/recipes_source/distributed_async_checkpoint_recipe.rst +++ b/recipes_source/distributed_async_checkpoint_recipe.rst @@ -156,9 +156,12 @@ If the above optimization is still not performant enough, you can take advantage Specifically, this optimization attacks the main overhead of asynchronous checkpointing, which is the in-memory copying to checkpointing buffers. By maintaining a pinned memory buffer between checkpoint requests users can take advantage of direct memory access to speed up this copy. -.. note:: The main drawback of this optimization is the persistence of the buffer in between checkpointing steps. Without the pinned memory optimization (as demonstrated above), -any checkpointing buffers are released as soon as checkpointing is finished. With the pinned memory implementation, this buffer is maintained between steps, leading to the same -peak memory pressure being sustained through the application life. +.. note:: + The main drawback of this optimization is the persistence of the buffer in between checkpointing steps. Without + the pinned memory optimization (as demonstrated above), any checkpointing buffers are released as soon as + checkpointing is finished. With the pinned memory implementation, this buffer is maintained between steps, + leading to the same + peak memory pressure being sustained through the application life. .. code-block:: python From e6b3ac2e964f76350cf2422f7c12fea393112952 Mon Sep 17 00:00:00 2001 From: Lucas Pasqualin Date: Wed, 10 Jul 2024 18:35:28 -0400 Subject: [PATCH 3/7] Update recipes_source/distributed_async_checkpoint_recipe.rst Co-authored-by: Svetlana Karslioglu --- recipes_source/distributed_async_checkpoint_recipe.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/recipes_source/distributed_async_checkpoint_recipe.rst b/recipes_source/distributed_async_checkpoint_recipe.rst index a8a7d35de6..11e7dadeb6 100644 --- a/recipes_source/distributed_async_checkpoint_recipe.rst +++ b/recipes_source/distributed_async_checkpoint_recipe.rst @@ -1,6 +1,8 @@ Asynchronous Saving with Distributed Checkpoint (DCP) ===================================================== +**Author:** `Lucas Pasqualin `__, `Iris Zhang `__, `Rodrigo Kumpera `__, `Chien-Chin Huang `__ + Checkpointing is often a bottle-neck in the critical path for distributed training workloads, incurring larger and larger costs as both model and world sizes grow. One excellent strategy for offsetting this cost is to checkpoint in parallel, asynchronously. Below, we expand the save example from the `Getting Started with Distributed Checkpoint Tutorial `__ From f4ec793acaf0349d9a543beebaf2a6bbde012696 Mon Sep 17 00:00:00 2001 From: Lucas Pasqualin Date: Wed, 10 Jul 2024 18:35:41 -0400 Subject: [PATCH 4/7] Update recipes_source/distributed_async_checkpoint_recipe.rst Co-authored-by: Svetlana Karslioglu --- recipes_source/distributed_async_checkpoint_recipe.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/recipes_source/distributed_async_checkpoint_recipe.rst b/recipes_source/distributed_async_checkpoint_recipe.rst index 11e7dadeb6..712e7dce42 100644 --- a/recipes_source/distributed_async_checkpoint_recipe.rst +++ b/recipes_source/distributed_async_checkpoint_recipe.rst @@ -8,7 +8,6 @@ One excellent strategy for offsetting this cost is to checkpoint in parallel, as from the `Getting Started with Distributed Checkpoint Tutorial `__ to show how this can be integrated quite easily with ``torch.distributed.checkpoint.async_save``. -**Author**: , `Lucas Pasqualin `__, `Iris Zhang `__, `Rodrigo Kumpera `__, `Chien-Chin Huang `__ .. grid:: 2 From c32ce5883b3b9f67f0f345325c69436ece8446bb Mon Sep 17 00:00:00 2001 From: ZincCat <52513999+zinccat@users.noreply.github.com> Date: Mon, 15 Jul 2024 09:43:50 -0700 Subject: [PATCH 5/7] Update cpp_export.rst (#2970) Updated specified c++ version from 14 to 17 --- advanced_source/cpp_export.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/advanced_source/cpp_export.rst b/advanced_source/cpp_export.rst index 5dedbdaaa6..45556a5320 100644 --- a/advanced_source/cpp_export.rst +++ b/advanced_source/cpp_export.rst @@ -203,7 +203,7 @@ minimal ``CMakeLists.txt`` to build it could look as simple as: add_executable(example-app example-app.cpp) target_link_libraries(example-app "${TORCH_LIBRARIES}") - set_property(TARGET example-app PROPERTY CXX_STANDARD 14) + set_property(TARGET example-app PROPERTY CXX_STANDARD 17) The last thing we need to build the example application is the LibTorch distribution. You can always grab the latest stable release from the `download From 5efa2e52aafdd94ef9ae6fbfa8c63fe888a15374 Mon Sep 17 00:00:00 2001 From: Bas Krahmer Date: Wed, 17 Jul 2024 17:22:21 +0200 Subject: [PATCH 6/7] Typo (#2974) Typo --- advanced_source/super_resolution_with_onnxruntime.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/advanced_source/super_resolution_with_onnxruntime.py b/advanced_source/super_resolution_with_onnxruntime.py index ecb0ba4fe4..264678ee17 100644 --- a/advanced_source/super_resolution_with_onnxruntime.py +++ b/advanced_source/super_resolution_with_onnxruntime.py @@ -9,7 +9,7 @@ * ``torch.onnx.export`` is based on TorchScript backend and has been available since PyTorch 1.2.0. In this tutorial, we describe how to convert a model defined -in PyTorch into the ONNX format using the TorchScript ``torch.onnx.export` ONNX exporter. +in PyTorch into the ONNX format using the TorchScript ``torch.onnx.export`` ONNX exporter. The exported model will be executed with ONNX Runtime. ONNX Runtime is a performance-focused engine for ONNX models, From 2f2db747605e73e168d614c4f3a680ab6a286f78 Mon Sep 17 00:00:00 2001 From: Haechan An <48047392+AnHaechan@users.noreply.github.com> Date: Thu, 18 Jul 2024 00:24:08 +0900 Subject: [PATCH 7/7] FIX: typo in inductor_debug_cpu.py (#2938) Co-authored-by: Svetlana Karslioglu --- intermediate_source/inductor_debug_cpu.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/intermediate_source/inductor_debug_cpu.py b/intermediate_source/inductor_debug_cpu.py index 94dee3ba15..370180d968 100644 --- a/intermediate_source/inductor_debug_cpu.py +++ b/intermediate_source/inductor_debug_cpu.py @@ -87,9 +87,9 @@ def neg1(x): # +-----------------------------+----------------------------------------------------------------+ # | ``fx_graph_transformed.py`` | Transformed FX graph, after pattern match | # +-----------------------------+----------------------------------------------------------------+ -# | ``ir_post_fusion.txt`` | Inductor IR before fusion | +# | ``ir_pre_fusion.txt`` | Inductor IR before fusion | # +-----------------------------+----------------------------------------------------------------+ -# | ``ir_pre_fusion.txt`` | Inductor IR after fusion | +# | ``ir_post_fusion.txt`` | Inductor IR after fusion | # +-----------------------------+----------------------------------------------------------------+ # | ``output_code.py`` | Generated Python code for graph, with C++/Triton kernels | # +-----------------------------+----------------------------------------------------------------+