Skip to content

Commit

Permalink
[FLINK-34454][doc/annotation] Use claim mode instead of restore mode …
Browse files Browse the repository at this point in the history
…everywhere
  • Loading branch information
Zakelly committed Feb 26, 2024
1 parent 16ba3f3 commit bdbf9ff
Show file tree
Hide file tree
Showing 12 changed files with 32 additions and 28 deletions.
2 changes: 1 addition & 1 deletion docs/content.zh/docs/deployment/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -315,7 +315,7 @@ $ ./bin/flink run \
```
This is useful if your program dropped an operator that was part of the savepoint.

You can also select the [restore mode]({{< ref "docs/ops/state/savepoints" >}}#restore-mode)
You can also select the [claim mode]({{< ref "docs/ops/state/savepoints" >}}#claim-mode)
which should be used for the savepoint. The mode controls who takes ownership of the files of
the specified savepoint.

Expand Down
16 changes: 8 additions & 8 deletions docs/content.zh/docs/ops/state/savepoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,16 +182,16 @@ $ bin/flink run -s :savepointPath [:runArgs]

默认情况下,resume 操作将尝试将 Savepoint 的所有状态映射回你要还原的程序。 如果删除了运算符,则可以通过 `--allowNonRestoredState`(short:`-n`)选项跳过无法映射到新程序的状态:

#### Restore 模式
#### Claim 模式

`Restore 模式` 决定了在 restore 之后谁拥有Savepoint 或者 [externalized checkpoint]({{< ref "docs/ops/state/checkpoints" >}}/#resuming-from-a-retained-checkpoint)的文件的所有权。在这种语境下 Savepoint 和 externalized checkpoint 的行为相似。
`Claim 模式` 决定了在 restore 之后谁拥有Savepoint 或者 [externalized checkpoint]({{< ref "docs/ops/state/checkpoints" >}}/#resuming-from-a-retained-checkpoint)的文件的所有权。在这种语境下 Savepoint 和 externalized checkpoint 的行为相似。
这里我们将它们都称为“快照”,除非另有明确说明。

如前所述,restore 模式决定了谁来接管我们从中恢复的快照文件的所有权。快照可被用户或者 Flink 自身拥有。如果快照归用户所有,Flink 不会删除其中的文件,而且 Flink 不能依赖该快照中文件的存在,因为它可能在 Flink 的控制之外被删除。
如前所述,claim 模式决定了谁来接管我们从中恢复的快照文件的所有权。快照可被用户或者 Flink 自身拥有。如果快照归用户所有,Flink 不会删除其中的文件,而且 Flink 不能依赖该快照中文件的存在,因为它可能在 Flink 的控制之外被删除。

每种 restore 模式都有特定的用途。尽管如此,我们仍然认为默认的 *NO_CLAIM* 模式在大多数情况下是一个很好的折中方案,因为它在提供明确的所有权归属的同时只给恢复后第一个 checkpoint 带来较小的代价。
每种 claim 模式都有特定的用途。尽管如此,我们仍然认为默认的 *NO_CLAIM* 模式在大多数情况下是一个很好的折中方案,因为它在提供明确的所有权归属的同时只给恢复后第一个 checkpoint 带来较小的代价。

你可以通过如下方式指定 restore 模式:
你可以通过如下方式指定 claim 模式:
```shell
$ bin/flink run -s :savepointPath -claimMode :mode -n [:runArgs]
```
Expand All @@ -205,15 +205,15 @@ $ bin/flink run -s :savepointPath -claimMode :mode -n [:runArgs]
一旦第一个全量的 checkpoint 完成后,所有后续的 checkpoint 会照常创建。所以,一旦一个 checkpoint 成功制作,就可以删除原快照。在此之前不能删除原快照,因为没有任何完成的 checkpoint,Flink 会在故障时尝试从初始的快照恢复。

<div style="text-align: center">
{{< img src="/fig/restore-mode-no_claim.svg" alt="NO_CLAIM restore mode" width="70%" >}}
{{< img src="/fig/restore-mode-no_claim.svg" alt="NO_CLAIM mode" width="70%" >}}
</div>

**CLAIM**

另一个可选的模式是 *CLAIM* 模式。该模式下 Flink 将声称拥有快照的所有权,并且本质上将其作为 checkpoint 对待:控制其生命周期并且可能会在其永远不会被用于恢复的时候删除它。因此,手动删除快照和从同一个快照上启动两个作业都是不安全的。Flink 会保持[配置数量]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}/#state-checkpoints-num-retained)的 checkpoint。

<div style="text-align: center">
{{< img src="/fig/restore-mode-claim.svg" alt="CLAIM restore mode" width="70%" >}}
{{< img src="/fig/restore-mode-claim.svg" alt="CLAIM mode" width="70%" >}}
</div>

{{< hint info >}}
Expand All @@ -228,7 +228,7 @@ $ bin/flink run -s :savepointPath -claimMode :mode -n [:runArgs]
Legacy 模式是 Flink 在 1.15 之前的工作方式。该模式下 Flink 永远不会删除初始恢复的 checkpoint。同时,用户也不清楚是否可以删除它。导致该的问题原因是, Flink 会在用来恢复的 checkpoint 之上创建增量的 checkpoint,因此后续的 checkpoint 都有可能会依赖于用于恢复的那个 checkpoint。总而言之,恢复的 checkpoint 的所有权没有明确的界定。

<div style="text-align: center">
{{< img src="/fig/restore-mode-legacy.svg" alt="LEGACY restore mode" width="70%" >}}
{{< img src="/fig/restore-mode-legacy.svg" alt="LEGACY claim mode" width="70%" >}}
</div>

{{< hint warning >}}
Expand Down
2 changes: 1 addition & 1 deletion docs/content/docs/deployment/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ $ ./bin/flink run \
```
This is useful if your program dropped an operator that was part of the savepoint.

You can also select the [restore mode]({{< ref "docs/ops/state/savepoints" >}}#restore-mode)
You can also select the [claim mode]({{< ref "docs/ops/state/savepoints" >}}#claim-mode)
which should be used for the savepoint. The mode controls who takes ownership of the files of
the specified savepoint.

Expand Down
18 changes: 9 additions & 9 deletions docs/content/docs/ops/state/savepoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -210,20 +210,20 @@ This submits a job and specifies a savepoint to resume from. You may give a path

By default, the resume operation will try to map all state of the savepoint back to the program you are restoring with. If you dropped an operator, you can allow to skip state that cannot be mapped to the new program via `--allowNonRestoredState` (short: `-n`) option:

#### Restore mode
#### Claim mode

The `Restore Mode` determines who takes ownership of the files that make up a Savepoint or [externalized checkpoints]({{< ref "docs/ops/state/checkpoints" >}}/#resuming-from-a-retained-checkpoint) after restoring it.
The `Claim Mode` determines who takes ownership of the files that make up a Savepoint or [externalized checkpoints]({{< ref "docs/ops/state/checkpoints" >}}/#resuming-from-a-retained-checkpoint) after restoring it.
Both savepoints and externalized checkpoints behave similarly in this context.
Here, they are just called "snapshots" unless explicitely noted otherwise.
Here, they are just called "snapshots" unless explicitly noted otherwise.

As mentioned, the restore mode determines who takes over ownership of the files of the snapshots that we are restoring from.
As mentioned, the claim mode determines who takes over ownership of the files of the snapshots that we are restoring from.
Snapshots can be owned either by a user or Flink itself.
If a snapshot is owned by a user, Flink will not delete its files, moreover, Flink can not depend on the existence of the files from such a snapshot, as it might be deleted outside of Flink's control.

Each restore mode serves a specific purposes.
Each claim mode serves a specific purposes.
Still, we believe the default *NO_CLAIM* mode is a good tradeoff in most situations, as it provides clear ownership with a small price for the first checkpoint after the restore.

You can pass the restore mode as:
You can pass the claim mode as:
```shell
$ bin/flink run -s :savepointPath -claimMode :mode -n [:runArgs]
```
Expand All @@ -243,7 +243,7 @@ Consequently, once a checkpoint succeeds you can manually delete the original sn
this earlier, because without any completed checkpoints Flink will - upon failure - try to recover from the initial snapshot.

<div style="text-align: center">
{{< img src="/fig/restore-mode-no_claim.svg" alt="NO_CLAIM restore mode" width="70%" >}}
{{< img src="/fig/restore-mode-no_claim.svg" alt="NO_CLAIM claim mode" width="70%" >}}
</div>

**CLAIM**
Expand All @@ -256,7 +256,7 @@ a [configured number]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing
of checkpoints.

<div style="text-align: center">
{{< img src="/fig/restore-mode-claim.svg" alt="CLAIM restore mode" width="70%" >}}
{{< img src="/fig/restore-mode-claim.svg" alt="CLAIM mode" width="70%" >}}
</div>

{{< hint info >}}
Expand All @@ -279,7 +279,7 @@ is that Flink might immediately build an incremental checkpoint on top of the re
subsequent checkpoints depend on the restored checkpoint. Overall, the ownership is not well-defined.

<div style="text-align: center">
{{< img src="/fig/restore-mode-legacy.svg" alt="LEGACY restore mode" width="70%" >}}
{{< img src="/fig/restore-mode-legacy.svg" alt="LEGACY claim mode" width="70%" >}}
</div>

{{< hint warning >}}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,11 @@

import static org.apache.flink.configuration.description.TextElement.text;

/** Defines how Flink should restore from a given savepoint or retained checkpoint. */
/**
* Defines state files ownership when Flink restore from a given savepoint or retained checkpoint.
* TODO: Rename 'RestoreMode' to 'RecoveryClaimMode' in Flink 2.0. Any related variable names should
* be adjusted accordingly.
*/
@PublicEvolving
public enum RestoreMode implements DescribedEnum {
CLAIM(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1907,7 +1907,7 @@ public boolean restoreSavepoint(
checkpointProperties = CheckpointProperties.forUnclaimedSnapshot();
break;
default:
throw new IllegalArgumentException("Unknown snapshot restore mode");
throw new IllegalArgumentException("Unknown snapshot claim mode");
}

// Load the savepoint as a checkpoint into the system
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ public interface CheckpointRecoveryFactory {
* @param sharedStateRegistryFactory Simple factory to produce {@link SharedStateRegistry}
* objects.
* @param ioExecutor Executor used to run (async) deletes.
* @param restoreMode the restore mode with which the job is restoring.
* @param restoreMode the claim mode with which the job is restoring.
* @return {@link CompletedCheckpointStore} instance for the job
*/
CompletedCheckpointStore createRecoveredCompletedCheckpointStore(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ public EmbeddedCompletedCheckpointStore(int maxRetainedCheckpoints) {
this(
maxRetainedCheckpoints,
Collections.emptyList(),
/* Using the default restore mode in tests to detect any breaking changes early. */
/* Using the default claim mode in tests to detect any breaking changes early. */
RestoreMode.DEFAULT);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ public StandaloneCompletedCheckpointStore(int maxNumberOfCheckpointsToRetain) {
maxNumberOfCheckpointsToRetain,
SharedStateRegistry.DEFAULT_FACTORY,
Executors.directExecutor(),
/* Using the default restore mode in tests to detect any breaking changes early. */
/* Using the default mode in tests to detect any breaking changes early. */
RestoreMode.DEFAULT);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -217,10 +217,10 @@ public void registerAllAfterRestored(CompletedCheckpoint checkpoint, RestoreMode
checkpoint
.getRestoredProperties()
.map(props -> props.getCheckpointType().getSharingFilesStrategy()));
// In NO_CLAIM and LEGACY restore modes, shared state of the initial checkpoints must be
// In NO_CLAIM and LEGACY claim modes, shared state of the initial checkpoints must be
// preserved. This is achieved by advancing highestRetainCheckpointID here, and then
// checking entry.createdByCheckpointID against it on checkpoint subsumption.
// In CLAIM restore mode, the shared state of the initial checkpoints must be
// In CLAIM mode, the shared state of the initial checkpoints must be
// discarded as soon as it becomes unused - so highestRetainCheckpointID is not updated.
if (mode != RestoreMode.CLAIM) {
highestNotClaimedCheckpointID =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1406,7 +1406,7 @@ private void checkForcedFullSnapshotSupport(CheckpointOptions checkpointOptions)
String.format(
"Configured state backend (%s) does not support enforcing a full"
+ " snapshot. If you are restoring in %s mode, please"
+ " consider choosing %s restore mode.",
+ " consider choosing %s mode.",
stateBackend, RestoreMode.NO_CLAIM, RestoreMode.CLAIM));
} else if (checkpointOptions.getCheckpointType().isSavepoint()) {
SavepointType savepointType = (SavepointType) checkpointOptions.getCheckpointType();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -699,7 +699,7 @@ void testForceFullSnapshotOnIncompatibleStateBackend() throws Exception {
.hasMessage(
"Configured state backend (OnlyIncrementalStateBackend) does not"
+ " support enforcing a full snapshot. If you are restoring in"
+ " NO_CLAIM mode, please consider choosing CLAIM restore mode.");
+ " NO_CLAIM mode, please consider choosing CLAIM mode.");
}
}

Expand Down

0 comments on commit bdbf9ff

Please sign in to comment.