Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed [Docs] Spot/interruptible docs imply retries come from the user… Closes #3956 #5938

Merged
merged 3 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions docs/user_guide/concepts/main_concepts/tasks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -123,3 +123,36 @@ Caching/Memoization

Flyte supports memoization of task outputs to ensure that identical invocations of a task are not executed repeatedly, thereby saving compute resources and execution time. For example, if you wish to run the same piece of code multiple times, you can reuse the output instead of re-computing it.
For more information on memoization, refer to the :std:doc:`/user_guide/development_lifecycle/caching`.

### Retries and Spot Instances

Tasks can define a retry strategy to handle different types of failures:

1. **System Retries**: Used for infrastructure-level failures outside of user control:
- Spot instance preemptions
- Network issues
- Service unavailability
- Hardware failures

*Important*: When running on spot/interruptible instances, preemptions count against the system retry budget, not the user retry budget. The last retry attempt automatically runs on a non-preemptible instance to ensure task completion.

2. **User Retries**: Specified in the ``@task`` decorator (via ``retries`` parameter), used for:
- Application-level errors
- Invalid input handling
- Business logic failures

.. code-block:: python

@task(retries=3) # Sets user retry budget to 3
def my_task() -> None:
...

### Alternative Retry Behavior

Starting from 1.10.0, Flyte offers a simplified retry behavior where both system and user retries count towards a single retry budget defined in the task decorator. To enable this:

1. Set ``configmap.core.propeller.node-config.ignore-retry-cause`` to ``true`` in helm values
2. Define retries in the task decorator to set the total retry budget
3. The last retries will automatically run on non-spot instances

This provides a simpler, more predictable retry behavior while maintaining reliability.
41 changes: 41 additions & 0 deletions docs/user_guide/flyte_fundamentals/optimizing_tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -273,6 +273,47 @@ the resources that you need. In this case, that need is distributed
training, but Flyte also provides integrations for {ref}`Spark <plugins-spark-k8s>`,
{ref}`Ray <kube-ray-op>`, {ref}`MPI <kf-mpi-op>`, {ref}`Snowflake <snowflake_agent>`, and more.

## Retries and Spot Instances

When running tasks on spot/interruptible instances, it's important to understand how retries work:

```python
from flytekit import task

@task(
retries=3, # User retry budget
interruptible=True # Enables running on spot instances
)
def my_task() -> None:
...
```

### Default Retry Behavior
- Spot instance preemptions count against the system retry budget (not user retries)
- The last system retry automatically runs on a non-preemptible instance
- User retries (specified in `@task` decorator) are only used for application errors

### Simplified Retry Behavior
Flyte also offers a simplified retry model where both system and user retries count towards a single budget:

```python
@task(
retries=5, # Total retry budget for both system and user errors
interruptible=True
)
def my_task() -> None:
...
```

To enable this behavior:
1. Set `configmap.core.propeller.node-config.ignore-retry-cause=true` in platform config
2. Define total retry budget in task decorator
3. Last retries automatically run on non-spot instances

Choose the retry model that best fits your use case:
- Default: Separate budgets for system vs user errors
- Simplified: Single retry budget with guaranteed completion

Even though Flyte itself is a powerful compute engine and orchestrator for
data engineering, machine learning, and analytics, perhaps you have existing
code that leverages other platforms. Flyte recognizes the pain of migrating code,
Expand Down
Loading