From 7ec28dd62b80e28a6160508d47e5f888e983d47a Mon Sep 17 00:00:00 2001 From: Marius Poke Date: Thu, 30 Nov 2023 14:26:57 +0100 Subject: [PATCH] docs: update Gaia processes (#2818) * update Gaia processes * Update STATE-COMPATIBILITY.md Co-authored-by: Simon Noetzlin * update cosmos sdk link --------- Co-authored-by: Simon Noetzlin --- CONTRIBUTING.md | 13 ++- RELEASE_PROCESS.md | 88 +++++++++++------ STATE-COMPATIBILITY.md | 211 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 277 insertions(+), 35 deletions(-) create mode 100644 STATE-COMPATIBILITY.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index e729811d4dc..97ee84b16ac 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -243,7 +243,8 @@ where: - `section` is one of `dependencies`, `improvements`, `features`, `bug-fixes`, `state-breaking`, `api-breaking`, - and _**if multiple apply, create multiple files**_; + and _**if multiple apply, create multiple files**_, + not necessarily with the same `short-description` or content; - `pr-number` is the PR number; - `short-description` is a short (4 to 6 word), hyphen separated description of the change; - `component` is used for changes that affect one of the components defined in the [config](.changelog/config.toml), e.g., `tests`, `globalfee`. @@ -272,14 +273,18 @@ where `${description}` is a detailed description of the changelog entry. For example, ```bash # add an entry for bumping IBC to v4.4.2 -unclog add -i "2554-bump-ibc" -p 2554 -s "dependencies" -m "Bump [ibc-go](https://github.com/cosmos/ibc-go) to [v4.4.2](https://github.com/cosmos/ibc-go/releases/tag/v4.4.2)" +unclog add -i "2554-bump-ibc" -p 2554 -s dependencies -m "Bump [ibc-go](https://github.com/cosmos/ibc-go) to [v4.4.2](https://github.com/cosmos/ibc-go/releases/tag/v4.4.2)" # add an entry for changing the global fee module; # note that the entry is added to both state-breaking and api-breaking sections -unclog add -i "2424-params" -p 2424 -c globalfee -s "state-breaking" -m "Add \`bypass-min-fee-msg-types\` and \`maxTotalBypassMinFeeMsgGagUsage\` to globalfee params" -unclog add -i "2424-params" -p 2424 -c globalfee -s "api-breaking" -m "Add \`bypass-min-fee-msg-types\` and \`maxTotalBypassMinFeeMsgGagUsage\` to globalfee params" +unclog add -i "2424-params" -p 2424 -c globalfee -s state-breaking -m "Add \`bypass-min-fee-msg-types\` and \`maxTotalBypassMinFeeMsgGagUsage\` to globalfee params" +unclog add -i "2424-params" -p 2424 -c globalfee -s api-breaking -m "Add \`bypass-min-fee-msg-types\` and \`maxTotalBypassMinFeeMsgGagUsage\` to globalfee params" ``` +**Note:** `unclog add` requires an editor. This can be set either by configuring +an `$EDITOR` environment variable or by manually specify an editor binary path +via the `--editor` flag. + **Note:** Changelog entries should answer the question: "what is important about this change for users to know?" or "what problem does this solve for users?". It should not simply be a reiteration of the title of the associated PR, unless the diff --git a/RELEASE_PROCESS.md b/RELEASE_PROCESS.md index ceb1794cf4a..499cc6132f6 100644 --- a/RELEASE_PROCESS.md +++ b/RELEASE_PROCESS.md @@ -1,10 +1,12 @@ # Release Process - [Release Process](#release-process) + - [Breaking Changes](#breaking-changes) - [Major Release Procedure](#major-release-procedure) - [Changelog](#changelog) - [Creating a new release branch](#creating-a-new-release-branch) - [Cutting a new release](#cutting-a-new-release) + - [Update the changelog on main](#update-the-changelog-on-main) - [Release Notes](#release-notes) - [Tagging Procedure](#tagging-procedure) - [Test building artifacts](#test-building-artifacts) @@ -19,20 +21,19 @@ This document outlines the release process for Cosmos Hub (Gaia). Gaia follows [semantic versioning](https://semver.org), but with the following deviations to account for state-machine and API breaking changes: - State-machine breaking changes will result in an increase of the major version X (X.y.z). -- Emergency releases & API breaking changes (changes in node interactions e.g. queries) will result in an increase of the minor version Y (x.Y.z | x > 0). +- Emergency releases & API breaking changes will result in an increase of the minor version Y (x.Y.z | x > 0). - All other changes will result in an increase of the patch version Z (x.y.Z | x > 0). -**State compatibility**: -It is critical for the patch releases to be state-machine compatible with prior releases in the same minor version. -For example, v9.2.1 must be compatible with v9.2.0. +**Note:** In case a major release is deprecated before ending up on the network (due to potential bugs), +it is replaced by a minor release (eg: `v14.0.0` → `v14.1.0`). +As a result, this minor release is considered state-machine breaking. -Minor releases will be handled on a case-by-case basis, but generally should only arise in case of safety or security issues that require a co-ordinated network upgrade without a governance process. +### Breaking Changes -This is to ensure determinism, i.e. that given the same input, the nodes will always produce the same output. -State-incompatibility is allowed for major upgrades because all nodes in the network perform it at the same time. -Therefore, after the upgrade, the nodes continue functioning in a deterministic way. +A change is considered to be ***state-machine breaking*** if it requires a coordinated upgrade for the network to preserve [state compatibility](./STATE-COMPATIBILITY.md). +Note that when bumping the dependencies of [Cosmos SDK](https://github.com/cosmos/cosmos-sdk), [IBC](https://github.com/cosmos/ibc-go), and [ICS](https://github.com/cosmos/interchain-security) we will only treat patch releases as non state-machine breaking. -**Note**: State-machine breaking changes include changes that impact the amount of gas needed to execute a transaction as it results in a different `apphash` after the code is executed. +A change is considered to be ***API breaking*** if it modifies the provided API. This includes events, queries, CLI interfaces. ## Major Release Procedure @@ -41,7 +42,7 @@ A _major release_ is an increment of the first number (eg: `v9.1.0` → `v10.0.0 **Note**: Generally, PRs should target either `main` or a long-lived feature branch (see [CONTRIBUTING.md](./CONTRIBUTING.md#pull-requests)). An exception are PRs open via the Github mergify integration (i.e., backported PRs). -* Once the team feels that `main` is _**feature complete**_, we create a `release/vY` branch (going forward known a release branch), +* Once the team feels that `main` is _**feature complete**_, we create a `release/vY` branch (going forward known as release branch), where `Y` is the version number, with the minor and patch part substituted to `x` (eg: 11.x). * Update the [GitHub mergify integration](./.mergify.yml) by adding instructions for automatically backporting commits from `main` to the `release/vY` using the `A:backport/vY` label. * **PRs targeting directly a release branch can be merged _only_ when exceptional circumstances arise**. @@ -76,43 +77,68 @@ Unreleased changes are collected on `main` in `.changelog/unreleased/`. However, `.changelog/` on `main` contains also existing releases (e.g., `v10.0.0`). Thus, when creating a new release branch (e.g., `release/v11.x`), the following steps are necessary: -- move to the release branch, e.g., `release/v11.x` +- create a new release branch, e.g., `release/v11.x` + ```bash + git checkout main + git pull + git checkout -b release/v11.x + ``` - delete all the sub-folders in `.changelog/` except `unreleased/` + ```bash + find ./.changelog -mindepth 1 -maxdepth 1 -type d -not -name unreleased | xargs rm -r + ``` - replace the content of `.changelog/epilogue.md` with the following text -```md -## Previous Versions + ```md + ## Previous Versions -[CHANGELOG of previous versions](https://github.com/cosmos/gaia/blob/main/CHANGELOG.md) -``` + [CHANGELOG of previous versions](https://github.com/cosmos/gaia/blob/main/CHANGELOG.md) + ``` +- push the release branch upstream + ```bash + git push + ``` #### Cutting a new release Before cutting a _**release candidate**_ (e.g., `v11.0.0-rc0`), the following steps are necessary: - move to the release branch, e.g., `release/v11.x` + ```bash + git checkout release/v11.x + ``` - move all entries in ".changelog/unreleased" to the release version, e.g., `v11.0.0`, i.e., -```bash -unclog release v11.0.0 -``` -- update the CHANGELOG.md, i.e., -```bash -unclog build > CHANGELOG.md -``` + ```bash + unclog release v11.0.0 + ``` +- update `CHANGELOG.md`, i.e., + ```bash + unclog build > CHANGELOG.md + ``` +- open a PR (from this new created branch) against the release branch, e.g., `release/v11.x` + +Now you can cut the release candidate, e.g., v11.0.0-rc0 (follow the [Tagging Procedure](#tagging-procedure)). -❗Once the **final release** is cut, the new changelog section must be added to main: +#### Update the changelog on main + +Once the **final release** is cut, the new changelog section must be added to main: - checkout a new branch from the `main` branch, i.e., + ```bash + git checkout main + git pull + git checkout -b /backport_changelog + ``` - bring the new changelog section from the release branch into this branch, e.g., -```bash -git checkout release/v11.x .changelog/v11.0.0 -``` + ```bash + git checkout release/v11.x .changelog/v11.0.0 + ``` - remove duplicate entries that are both in `.changelog/unreleased/` and the new changelog section, e.g., `.changelog/v11.0.0` -- update the CHANGELOG.md -```bash -unclog build > CHANGELOG.md -``` +- update `CHANGELOG.md`, i.e., + ```bash + unclog build > CHANGELOG.md + ``` - open a PR (from this new created branch) against `main` - + ### Release Notes Release notes will be created using the `RELEASE_NOTES.md` from the release branch. diff --git a/STATE-COMPATIBILITY.md b/STATE-COMPATIBILITY.md new file mode 100644 index 00000000000..262444f45d2 --- /dev/null +++ b/STATE-COMPATIBILITY.md @@ -0,0 +1,211 @@ +# State-Compatibility + +- [State-Compatibility](#state-compatibility) + - [Scope](#scope) + - [Validating State-Compatibility](#validating-state-compatibility) + - [AppHash](#apphash) + - [LastResultsHash](#lastresultshash) + - [Major Sources of State-incompatibility](#major-sources-of-state-incompatibility) + - [Creating Additional State](#creating-additional-state) + - [Changing Proto Field Definitions](#changing-proto-field-definitions) + - [Returning Different Errors Given Same Input](#returning-different-errors-given-same-input) + - [Variability in Gas Usage](#variability-in-gas-usage) + - [Secondary Limitations To Keep In Mind](#secondary-limitations-to-keep-in-mind) + - [Network Requests to External Services](#network-requests-to-external-services) + - [Randomness](#randomness) + - [Parallelism and Shared State](#parallelism-and-shared-state) + - [Hardware Errors](#hardware-errors) + + +It is critical for the patch and minor releases to be state-machine compatible with prior releases in the same minor version. +For example, v13.0.2 must be state-machine compatible with v13.0.1. +_An exception are minor releases that are either emergency releases or replacements of deprecated major releases_. + +This is to ensure **determinism**, i.e., given the same input, the nodes will always produce the same output. + +State-incompatibility is allowed for major upgrades because all nodes in the network perform it at the same time. Therefore, after the upgrade, the nodes continue functioning in a deterministic way. + +## Scope + +The state-machine scope includes the following areas: + +- All ICS messages including: + - Every msg's ValidateBasic method + - Every msg's MsgServer method + - Net gas usage, in all execution paths + - Error result returned + - State changes (namely every store write) +- AnteHandlers in "DeliverTx" mode +- All `BeginBlock`/`EndBlock` logic + +The following are **NOT** in the state-machine scope: + +- Events +- Queries that are not whitelisted +- CLI interfaces + +## Validating State-Compatibility + +CometBFT ensures state compatibility by validating a number of hashes that can be found [here](https://github.com/cometbft/cometbft/blob/v0.38.2/proto/tendermint/types/types.proto#L59-L66). + +`AppHash` and `LastResultsHash` are the common sources of problems stemming from our work. +To avoid these problems, let's now examine how these hashes work. + +### AppHash + +**Note:** The following explanation is simplified for clarity. + +An app hash is a hash of hashes of every store's Merkle root that is returned by ABCI's `Commit()` from Cosmos-SDK to CometBFT. +Cosmos-SDK [takes an app hash of the application state](https://github.com/cosmos/cosmos-sdk/blob/v0.47.6/store/rootmulti/store.go#L468), and propagates it to CometBFT which, in turn, compares it to the app hash of the rest of the network. +Then, CometBFT ensures that the app hash of the local node matches the app hash of the network. + +### LastResultsHash + +`LastResultsHash` is the root hash of all results from the transactions in the block returned by the ABCI's `DeliverTx`. + +The [`LastResultsHash`](https://github.com/cometbft/cometbft/blob/v0.34.29/types/results.go#L47-L54) +in CometBFT [v0.34.29](https://github.com/cometbft/cometbft/releases/tag/v0.34.29) contains: + +1. Tx `GasWanted` + +2. Tx `GasUsed` + > `GasUsed` being Merkelized means that we cannot freely reorder methods that consume gas. + > We should also be careful of modifying any validation logic since changing the + > locations where we error or pass might affect transaction gas usage. + > + > There are plans to remove this field from being Merkelized in a subsequent CometBFT release, + > at which point we will have more flexibility in reordering operations / erroring. + +3. Tx response `Data` + + > The `Data` field includes the proto marshalled Tx response. Therefore, we cannot + > change these in patch releases. + +4. Tx response `Code` + + > This is an error code that is returned by the transaction flow. In the case of + > success, it is `0`. On a general error, it is `1`. Additionally, each module + > defines its custom error codes. + > + > As a result, it is important to avoid changing custom error codes or change + > the semantics of what is valid logic in transaction flows. + +Note that all of the above stem from `DeliverTx` execution path, which handles: + +- `AnteHandler`'s marked as deliver tx +- `msg.ValidateBasic` +- execution of a message from the message server + +The `DeliverTx` return back to the CometBFT is defined [here](https://github.com/cosmos/cosmos-sdk/blob/d11196aad04e57812dbc5ac6248d35375e6603af/baseapp/abci.go#L293-L303). + +## Major Sources of State-incompatibility + +### Creating Additional State + +By erroneously creating database entries that exist in Version A but not in +Version B, we can cause the app hash to differ across nodes running +these versions in the network. Therefore, this must be avoided. + +### Changing Proto Field Definitions + +For example, if we change a field that gets persisted to the database, +the app hash will differ across nodes running these versions in the network. + +Additionally, this affects `LastResultsHash` because it contains a `Data` field that is a marshaled proto message. + +### Returning Different Errors Given Same Input + +```go +// Version A +func (sk Keeper) validateAmount(ctx context.Context, amount math.Int) error { + if amount.IsNegative() { + return sdkerrors.Wrap(sdkerrors.ErrInvalidRequest, "amount must be positive or zero") + } + return nil +} +``` + +```go +// Version B +func (sk Keeper) validateAmount(ctx context.Context, amount math.Int) error { + if amount.IsNegative() || amount.IsZero() { + return sdkerrors.Wrap(sdkerrors.ErrInvalidRequest, "amount must be positive") + } + return nil +} +``` + +Note that now an amount of 0 can be valid in "Version A", but invalid in "Version B". +Therefore, if some nodes are running "Version A" and others are running "Version B", +the final app hash might not be deterministic. + +Additionally, a different error message does not matter because it +is not included in any hash. However, an error code `sdkerrors.ErrInvalidRequest` does. +It translates to the `Code` field in the `LastResultsHash` and participates in +its validation. + +### Variability in Gas Usage + +For transaction flows (or any other flow that consumes gas), it is important +that the gas usage is deterministic. + +Currently, gas usage is being Merklized in the state. As a result, reordering functions +becomes risky. + +Suppose my gas limit is 2000 and 1600 is used up before entering +`someInternalMethod`. Consider the following: + +```go +func someInternalMethod(ctx sdk.Context) { + object1 := readOnlyFunction1(ctx) # consumes 1000 gas + object2 := readOnlyFunction2(ctx) # consumes 500 gas + doStuff(ctx, object1, object2) +} +``` + +- It will run out of gas with `gasUsed = 2600` where 2600 getting merkelized +into the tx results. + +```go +func someInternalMethod(ctx sdk.Context) { + object2 := readOnlyFunction2(ctx) # consumes 500 gas + object1 := readOnlyFunction1(ctx) # consumes 1000 gas + doStuff(ctx, object1, object2) +} +``` + +- It will run out of gas with `gasUsed = 2100` where 2100 is getting merkelized +into the tx results. + +Therefore, we introduced a state-incompatibility by merklezing diverging gas +usage. + +## Secondary Limitations To Keep In Mind + +### Network Requests to External Services + +It is critical to avoid performing network requests to external services +since it is common for services to be unavailable or rate-limit. + +Imagine a service that returns exchange rates when clients query its HTTP endpoint. +This service might experience downtime or be restricted in some geographical areas. + +As a result, nodes may get diverging responses where some +get successful responses while others errors, leading to state breakage. + +### Randomness + +Randomness cannot be used in the state machine, as the state machine must be deterministic. + +**Note:** Iteration order over `map`s is non-deterministic, so to be deterministic +you must gather the keys, and sort them all prior to iterating over all values. + +### Parallelism and Shared State + +Threads and Goroutines might preempt differently in different hardware. Therefore, +they should be avoided for the sake of determinism. Additionally, it is hard +to predict when the multi-threaded state can be updated. + +### Hardware Errors + +This is out of the developer's control but is mentioned for completeness. \ No newline at end of file