Skip to content

Commit

Permalink
bump docs, add TODOs for buggy behavior (#15023)
Browse files Browse the repository at this point in the history
  • Loading branch information
makramkd authored Oct 31, 2024
1 parent 50c1b3d commit a21f733
Show file tree
Hide file tree
Showing 7 changed files with 125 additions and 52 deletions.
112 changes: 79 additions & 33 deletions core/capabilities/ccip/launcher/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,62 +8,108 @@ particular, there are three kinds of events that would affect a particular capab
1. DON Creation: when `addDON` is called on the CR, the capabilities of this new DON are specified.
If CCIP is one of those capabilities, the launcher will launch a commit and an execution plugin
with the OCR configuration specified in the DON creation process. See
[Types.sol](../../../../contracts/src/v0.8/ccip/capability/libraries/Types.sol) for more details
on what the OCR configuration contains.
[CCIPHome.sol](../../../../contracts/src/v0.8/ccip/capability/CCIPHome.sol), specifically `struct OCR3Config`,
for more details on what the OCR configuration contains.
2. DON update: when `updateDON` is called on the CR, capabilities of the DON can be updated. In the
CCIP use case specifically, `updateDON` is used to update OCR configuration of that DON. Updates
follow the blue/green deployment pattern (explained in detail below with a state diagram). In this
follow the active/candidate deployment pattern (explained in detail below with a state diagram). In this
scenario the launcher must either launch brand new instances of the commit and execution plugins
(in the event a green deployment is made) or promote the currently running green instance to be
the blue instance.
(in the event a candidate deployment is made) or promote the currently running candidate instance to be
the active instance.
3. DON deletion: when `deleteDON` is called on the CR, the launcher must shut down all running plugins
related to that DON. When a DON is deleted it effectively means that it should no longer function.
DON deletion is permanent.

## Architecture Diagram

![CCIP Capability Launcher](ccip_capability_launcher.png)
![CCIP Capability Launcher](launcher_arch.png)

The above diagram shows how the CCIP capability launcher interacts with the rest of the components
in the CCIP system.

The CCIP capability job, which is created on the Chainlink node, will spin up the CCIP capability
launcher alongside the home chain reader, which reads the [CCIPConfig.sol](../../../../contracts/src/v0.8/ccip/capability/CCIPConfig.sol)
The CCIP capability job, which is created on the Chainlink node, will spin up the following services, in
the following order:

* Home chain contract reader
* Home chain capability registry syncer
* Home chain CCIPHome reader
* CCIP Capability Launcher

The order in which these services are started is important due to the dependencies some have on others; i.e
the capability launcher depends upon the home chain `CCIPHome` reader and the home chain capability registry syncer;
these in turn depend on the home chain contract reader.

The home chain `CCIPHome` reader reads the [CCIPHome.sol](../../../../contracts/src/v0.8/ccip/capability/CCIPHome.sol)
contract deployed on the home chain (typically Ethereum Mainnet, though could be "any chain" in theory).

Injected into the launcher is the [OracleCreator](../types/types.go) object which knows how to spin up CCIP
oracles (both bootstrap and plugin oracles). This is used by the launcher at the appropriate time in order
to create oracle instances but not start them right away.

After all the required oracles have been created, the launcher will start and shut them down as required
in order to match the configuration that was posted on-chain in the CR and the CCIPConfig.sol contract.

in order to match the configuration that was posted on-chain in the Capability Registry and the CCIPHome.sol contract.

## Config State Diagram

![CCIP Config State Machine](ccip_config_state_machine.png)

CCIP's blue/green deployment paradigm is intentionally kept as simple as possible.

Every CCIP DON starts in the `Init` state. Upon DON creation, which must provide a valid OCR
configuration, the CCIP DON will move into the `Running` state. In this state, the DON is
presumed to be fully functional from a configuration standpoint.

When we want to update configuration, we propose a new configuration to the CR that consists of
an array of two OCR configurations:

1. The first element of the array is the current OCR configuration that is running (termed "blue").
2. The second element of the array is the future OCR configuration that we want to run (termed "green").

Various checks are done on-chain in order to validate this particular state transition, in particular,
related to config counts. Doing this will move the state of the configuration to the `Staging` state.

In the `Staging` state, there are effectively four plugins running - one (commit, execution) pair for the
blue configuration, and one (commit, execution) pair for the green configuration. However, only the blue
configuration will actually be writing on-chain, where as the green configuration will be "dry running",
CCIP's active/candidate deployment paradigm is intentionally kept as simple as possible.

The below state diagram (copy/pasted from CCIPHome.sol's doc comment) is relevant:

```solidity
/// @dev This contract is a state machine with the following states:
/// - Init: The initial state of the contract, no config has been set, or all configs have been revoked.
/// [0, 0]
///
/// - Candidate: A new config has been set, but it has not been promoted yet, or all active configs have been revoked.
/// [0, 1]
///
/// - Active: A non-zero config has been promoted and is active, there is no candidate configured.
/// [1, 0]
///
/// - ActiveAndCandidate: A non-zero config has been promoted and is active, and a new config has been set as candidate.
/// [1, 1]
///
/// The following state transitions are allowed:
/// - Init -> Candidate: setCandidate()
/// - Candidate -> Active: promoteCandidateAndRevokeActive()
/// - Candidate -> Candidate: setCandidate()
/// - Candidate -> Init: revokeCandidate()
/// - Active -> ActiveAndCandidate: setCandidate()
/// - Active -> Init: promoteCandidateAndRevokeActive()
/// - ActiveAndCandidate -> Active: promoteCandidateAndRevokeActive()
/// - ActiveAndCandidate -> Active: revokeCandidate()
/// - ActiveAndCandidate -> ActiveAndCandidate: setCandidate()
///
/// This means the following calls are not allowed at the following states:
/// - Init: promoteCandidateAndRevokeActive(), as there is no config to promote.
/// - Init: revokeCandidate(), as there is no config to revoke
/// - Active: revokeCandidate(), as there is no candidate to revoke
/// Note that we explicitly do allow promoteCandidateAndRevokeActive() to be called when there is an active config but
/// no candidate config. This is the only way to remove the active config. The alternative would be to set some unusable
/// config as candidate and promote that, but fully clearing it is cleaner.
///
/// ┌─────────────┐ setCandidate ┌─────────────┐
/// │ ├───────────────────►│ │ setCandidate
/// │ Init │ revokeCandidate │ Candidate │◄───────────┐
/// │ [0,0] │◄───────────────────┤ [0,1] │────────────┘
/// │ │ ┌─────────────────┤ │
/// └─────────────┘ │ promote- └─────────────┘
/// ▲ │ Candidate
/// promote- │ │
/// Candidate │ │
/// │ │
/// ┌──────────┴──┐ │ promote- ┌─────────────┐
/// │ │◄─┘ Candidate OR │ Active & │ setCandidate
/// │ Active │ revokeCandidate │ Candidate │◄───────────┐
/// │ [1,0] │◄───────────────────┤ [1,1] │────────────┘
/// │ ├───────────────────►│ │
/// └─────────────┘ setSecondary └─────────────┘
///
```

In the `Active & Candidate` state, there are effectively four plugins running - one (commit, execution) pair for the
active configuration, and one (commit, execution) pair for the candidate configuration. However, only the active
configuration will actively be transmitting OCR reports on-chain, where as the green configuration will be "dry running",
i.e doing everything except transmitting.

This allows us to test out new configurations without committing to them immediately.

Finally, from the `Staging` state, there is only one transition, which is to promote the green configuration
to be the new blue configuration, and go back into the `Running` state.
Binary file not shown.
Binary file not shown.
18 changes: 18 additions & 0 deletions core/capabilities/ccip/launcher/doc.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
// Package launcher provides the functionality to launch and manage OCR instances.
// For system-level documentation and diagrams, please refer to the README.md
// in this package directory.
//
// The CCIP launcher, at a high level, consumes updates from the Capabilities Registry,
// in particular, DON additions, updates, and removals, and depending on the changes,
// launches (in the case of additions and updates) or shuts down (in the case of removals)
// CCIP OCR instances.
//
// It achieves this by diffing the current state of the registry with the previous state,
// and then launching or shutting down instances as necessary. See the launcher's tick()
// method for the main logic.
//
// Diffing logic is contained within diff.go, and the main logic for launching and shutting
// down instances is contained within launcher.go.
//
// Active/candidate deployment support is provided by the ccipDeployment struct in deployment.go.
package launcher
45 changes: 27 additions & 18 deletions core/capabilities/ccip/launcher/launcher.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,16 +29,17 @@ var (
_ registrysyncer.Launcher = (*launcher)(nil)
)

// New creates a new instance of the CCIP launcher.
func New(
capabilityID string,
p2pID ragep2ptypes.PeerID,
myP2PID ragep2ptypes.PeerID,
lggr logger.Logger,
homeChainReader ccipreader.HomeChain,
tickInterval time.Duration,
oracleCreator cctypes.OracleCreator,
) *launcher {
return &launcher{
p2pID: p2pID,
myP2PID: myP2PID,
capabilityID: capabilityID,
lggr: lggr,
homeChainReader: homeChainReader,
Expand All @@ -57,8 +58,12 @@ func New(
type launcher struct {
services.StateMachine

capabilityID string
p2pID ragep2ptypes.PeerID
// capabilityID is the fully qualified capability registry ID of the CCIP capability.
// this is <capability_name>@<capability-semver>, e.g "[email protected]".
capabilityID string

// myP2PID is the peer ID of the node running this launcher.
myP2PID ragep2ptypes.PeerID
lggr logger.Logger
homeChainReader ccipreader.HomeChain
stopChan chan struct{}
Expand Down Expand Up @@ -129,6 +134,7 @@ func (l *launcher) Start(context.Context) error {
})
}

// monitor calls tick() at regular intervals to check for changes in the capability registry.
func (l *launcher) monitor() {
defer l.wg.Done()
ticker := time.NewTicker(l.tickInterval)
Expand All @@ -144,6 +150,8 @@ func (l *launcher) monitor() {
}
}

// tick gets the latest registry state and processes the diff between the current and latest state.
// This may lead to starting or stopping OCR instances.
func (l *launcher) tick() error {
// Ensure that the home chain reader is healthy.
// For new jobs it may be possible that the home chain reader is not yet ready
Expand Down Expand Up @@ -197,7 +205,7 @@ func (l *launcher) processUpdate(updated map[registrysyncer.DonID]registrysyncer

futDeployment, err := updateDON(
l.lggr,
l.p2pID,
l.myP2PID,
l.homeChainReader,
*prevDeployment,
don,
Expand Down Expand Up @@ -231,24 +239,26 @@ func (l *launcher) processAdded(added map[registrysyncer.DonID]registrysyncer.DO
for donID, don := range added {
dep, err := createDON(
l.lggr,
l.p2pID,
l.myP2PID,
l.homeChainReader,
don,
l.oracleCreator,
)
if err != nil {
return err
return fmt.Errorf("processAdded: call createDON %d: %w", donID, err)
}
if dep == nil {
// not a member of this DON.
continue
}

// TODO: this doesn't seem to be correct; a newly added DON will not have an active
// instance but a candidate instance.
if err := dep.StartActive(); err != nil {
if shutdownErr := dep.CloseActive(); shutdownErr != nil {
l.lggr.Errorw("Failed to shutdown active instance after failed start", "donId", donID, "err", shutdownErr)
}
return fmt.Errorf("failed to start oracles for CCIP DON %d: %w", donID, err)
return fmt.Errorf("processAdded: start oracles for CCIP DON %d: %w", donID, err)
}

// update state.
Expand Down Expand Up @@ -314,26 +324,24 @@ func updateDON(
don.ID, err)
}

commitBgd, err := createFutureActiveCandidateDeployment(don.ID, prevDeployment, commitOCRConfigs, oracleCreator, cctypes.PluginTypeCCIPCommit)
commitAcd, err := createFutureActiveCandidateDeployment(don.ID, prevDeployment, commitOCRConfigs, oracleCreator, cctypes.PluginTypeCCIPCommit)
if err != nil {
return nil, fmt.Errorf("failed to create future active-candidate deployment for CCIP commit plugin: %w, don id: %d", err, don.ID)
}

execBgd, err := createFutureActiveCandidateDeployment(don.ID, prevDeployment, execOCRConfigs, oracleCreator, cctypes.PluginTypeCCIPExec)
execAcd, err := createFutureActiveCandidateDeployment(don.ID, prevDeployment, execOCRConfigs, oracleCreator, cctypes.PluginTypeCCIPExec)
if err != nil {
return nil, fmt.Errorf("failed to create future active-candidate deployment for CCIP exec plugin: %w, don id: %d", err, don.ID)
}

return &ccipDeployment{
commit: commitBgd,
exec: execBgd,
commit: commitAcd,
exec: execAcd,
}, nil
}

// valid cases:
// a) len(ocrConfigs) == 2 && !prevDeployment.HasCandidateInstance(pluginType): this is a new candidate instance.
// b) len(ocrConfigs) == 1 && prevDeployment.HasCandidateInstance(): this is a promotion of candidate->active.
// All other cases are invalid. This is enforced in the ccip config contract.
// TODO: this is not technically correct, CCIPHome has other transitions
// that are not covered here, e.g revokeCandidate.
func createFutureActiveCandidateDeployment(
donID uint32,
prevDeployment ccipDeployment,
Expand All @@ -344,13 +352,13 @@ func createFutureActiveCandidateDeployment(
var deployment activeCandidateDeployment
if isNewCandidateInstance(pluginType, ocrConfigs, prevDeployment) {
// this is a new candidate instance.
greenOracle, err := oracleCreator.Create(donID, cctypes.OCR3ConfigWithMeta(ocrConfigs[1]))
candidateOracle, err := oracleCreator.Create(donID, cctypes.OCR3ConfigWithMeta(ocrConfigs[1]))
if err != nil {
return activeCandidateDeployment{}, fmt.Errorf("failed to create CCIP commit oracle: %w", err)
}

deployment.active = prevDeployment.commit.active
deployment.candidate = greenOracle
deployment.candidate = candidateOracle
} else if isPromotion(pluginType, ocrConfigs, prevDeployment) {
// this is a promotion of candidate->active.
deployment.active = prevDeployment.commit.candidate
Expand Down Expand Up @@ -409,6 +417,7 @@ func createDON(
return nil, fmt.Errorf("failed to create CCIP exec oracle: %w", err)
}

// TODO: incorrect, should be setting candidate?
return &ccipDeployment{
commit: activeCandidateDeployment{
active: commitOracle,
Expand Down
Binary file added core/capabilities/ccip/launcher/launcher_arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion core/capabilities/ccip/launcher/launcher_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -419,7 +419,7 @@ func Test_launcher_processDiff(t *testing.T) {
l := &launcher{
dons: tt.fields.dons,
regState: tt.fields.regState,
p2pID: tt.fields.p2pID,
myP2PID: tt.fields.p2pID,
lggr: tt.fields.lggr,
homeChainReader: tt.fields.homeChainReader,
oracleCreator: tt.fields.oracleCreator,
Expand Down

0 comments on commit a21f733

Please sign in to comment.