flyteorg · davidmirror-ops · Oct 31, 2024 · Oct 29, 2024 · Oct 30, 2024 · Oct 30, 2024
@@ -123,3 +123,36 @@ Caching/Memoization
 
 Flyte supports memoization of task outputs to ensure that identical invocations of a task are not executed repeatedly, thereby saving compute resources and execution time. For example, if you wish to run the same piece of code multiple times, you can reuse the output instead of re-computing it.
 For more information on memoization, refer to the :std:doc:`/user_guide/development_lifecycle/caching`.
+
+### Retries and Spot Instances
+
+Tasks can define a retry strategy to handle different types of failures:
+
+1. **System Retries**: Used for infrastructure-level failures outside of user control:
+   - Spot instance preemptions
+   - Network issues
+   - Service unavailability
+   - Hardware failures
+
+*Important*: When running on spot/interruptible instances, preemptions count against the system retry budget, not the user retry budget. The last retry attempt automatically runs on a non-preemptible instance to ensure task completion.
+
+2. **User Retries**: Specified in the ``@task`` decorator (via ``retries`` parameter), used for:
+   - Application-level errors
+   - Invalid input handling
+   - Business logic failures
+
+.. code-block:: python
+
+   @task(retries=3)  # Sets user retry budget to 3
+   def my_task() -> None:
+       ...
+
+### Alternative Retry Behavior
+
+Starting from 1.10.0, Flyte offers a simplified retry behavior where both system and user retries count towards a single retry budget defined in the task decorator. To enable this:
+
+1. Set ``configmap.core.propeller.node-config.ignore-retry-cause`` to ``true`` in helm values
+2. Define retries in the task decorator to set the total retry budget
+3. The last retries will automatically run on non-spot instances
+
+This provides a simpler, more predictable retry behavior while maintaining reliability.
@@ -273,6 +273,47 @@ the resources that you need. In this case, that need is distributed
 training, but Flyte also provides integrations for {ref}`Spark <plugins-spark-k8s>`,
 {ref}`Ray <kube-ray-op>`, {ref}`MPI <kf-mpi-op>`, {ref}`Snowflake <snowflake_agent>`, and more.
 
+## Retries and Spot Instances
+
+When running tasks on spot/interruptible instances, it's important to understand how retries work:
+
+```python
+from flytekit import task
+
+@task(
+    retries=3,               # User retry budget
+    interruptible=True       # Enables running on spot instances
+)
+def my_task() -> None:
+    ...
+```
+
+### Default Retry Behavior
+- Spot instance preemptions count against the system retry budget (not user retries)
+- The last system retry automatically runs on a non-preemptible instance
+- User retries (specified in `@task` decorator) are only used for application errors
+
+### Simplified Retry Behavior
+Flyte also offers a simplified retry model where both system and user retries count towards a single budget:
+
+```python
+@task(
+    retries=5,               # Total retry budget for both system and user errors
+    interruptible=True
+)
+def my_task() -> None:
+    ...
+```
+
+To enable this behavior:
+1. Set `configmap.core.propeller.node-config.ignore-retry-cause=true` in platform config
+2. Define total retry budget in task decorator
+3. Last retries automatically run on non-spot instances
+
+Choose the retry model that best fits your use case:
+- Default: Separate budgets for system vs user errors
+- Simplified: Single retry budget with guaranteed completion 
+
 Even though Flyte itself is a powerful compute engine and orchestrator for
 data engineering, machine learning, and analytics, perhaps you have existing
 code that leverages other platforms. Flyte recognizes the pain of migrating code,