Establishing Backward Compatibility in our Data Systems #295

Jurshsmith · 2024-10-29T14:01:45Z

Jurshsmith
Oct 29, 2024
Collaborator

Overview

This RFC proposes a robust architectural framework to ensure that our data systems maintain a stable, backward-compatible interface as we release changes to production. Our primary objective is to minimize disruptions for downstream consumers, particularly to improve the developer experience for DApp integrators and consumers. This architecture will facilitate a controlled evolution of our data systems with minimized impact on DApps, regardless of the pace of updates.

Problem Context and Instability Risks

To build a reliable, scalable system, we must first understand the main sources of instability, which impact both DApp compatibility and data integrity. We categorize these risks as follows:

Data Correctness

Incorrect Payloads: Resulting from bugs or inconsistencies in data, which can propagate inaccuracies to downstream systems.
Chain Reorganizations and Regenesis Events: Instabilities introduced through chain reorgs or regenesis, resulting in changes to hash values and other underlying data.

Structural and Interface Changes

SDK Versioning: Requiring frequent consumer-side adaptations due to SDK upgrades can impose substantial DX and operational costs.
Data Schema Modifications: Changes in schema due to chain upgrades or feature rollouts necessitate careful handling to prevent breaking downstream systems.

Proposed Solution Overview

Given the project’s early phase, some degree of data inaccuracy is probably expected. For this reason, we recommend primary support for presentational DApps, while transaction-based DApps are advised to limit reliance on the data stream until further stabilization.

To address these risks, we propose two complementary solutions that can be implemented concurrently:

Controlled Republishing: Handles minor updates and corrects erroneous data with transparent, controlled processes for DApp developers.
Versioned Publishing: Provides backward compatibility by enabling older data structures to co-exist with new schemas, supporting DApps dependent on previous versions while facilitating upgrades.

Controlled Republishing

Description

Controlled Republishing supports patch-level updates and corrects erroneous data in a way that minimizes DX disruptions. By isolating republishing to specific data streams, we can correct targeted streams without requiring global updates across DApp codebases.

High-Level Implementation Summary

Streams Isolation: Each stream (e.g., blocks, transactions) is managed as an independent, long-running process. A last_published checkpoint is maintained per stream to track the state of each republish operation:
```
last_published.blocks
last_published.transactions
```
Sample checkpoint structure:
```
LastPublished { epoch_timestamp: u64, block_number: u64, republish_count: u64 }
```
Environment-Based Republishing Controls: Utilize environment variables to granularly control republishing across all streams or specific ones:
- ALL_REPUBLISH_COUNT: Specifies the number of blocks to republish.
- ALL_REPUBLISH_START_BLOCK: Defines the starting block for republishing, defaulting to the genesis block.
- {STREAM}_REPUBLISH_COUNT: Sets a republish count per stream.
- {STREAM}_REPUBLISH_START_BLOCK: Specifies the starting block per stream.
RepublisherFactory and Republisher Processes: A factory generates a dedicated Republisher process for each stream, which continuously monitors its environment variables. If republish_count is incremented, republishing initiates; otherwise, normal publishing resumes.
Data Prioritization Strategies:
- Data Availability Priority: Updates using the Put API from the key-value store. This approach maximizes data availability but may temporarily expose inconsistent data.
- Data Correctness Priority: Deletes affected data blocks or sequences before republishing, prioritizing data integrity at the expense of temporary unavailability. (A more NATS-agnostic approach)
Question: Which should we prioritize more? Data availability vs. consistency?

Versioned Publishing

Description

Versioned Publishing addresses structural or interface-breaking changes that could disrupt DApps relying on specific data formats or API versions. As it would be shown below, Versioned publishing introduces non-trivial complexity and resource requirements. Therefore, it should be reserved for cases where API stability is achieved and backward compatibility for high-value DApps is essential.

High-Level Implementation Summary

Subject and Payload Versioning:
To streamline version management and reduce code complexity, we introduce an ALLOW_REPUBLISHING environment variable that disables republishing for previous versions. This approach encourages users who require data corrections to upgrade to the latest version, simplifying maintenance. Otherwise, supporting multiple versions would require versioned Subjects and Payloads (e.g., BlockSubjectV_0_1 for BlockV_0_1 payload, and BlockSubjectV_0_2 for BlockV_0_2 payload), significantly increasing codebase complexity and the maintenance burden. This variable enables us to keep the system manageable and maintain a cleaner architecture as we evolve our data schemas.
Instance Management
Deploy a separate Publisher instance for each supported version. For instance, maintain a 0.1 Publisher alongside a 0.2 Publisher to ensure compatibility which would point to buckets fuel_{stream}_0_1 and fuel_{stream}_0_2 in the NATS cluster respectively. As new versions are introduced, additional publisher instances will be deployed, incrementally increasing deployment complexity.

Governance and Operational Considerations

Change Management: Maintain a robust changelog that details all updates and potential disruptions, offering clear upgrade paths for DApp developers.
Monitoring and Observability: Implement comprehensive observability tooling, including logging, metrics, and alerts, to proactively identify and address data inconsistencies or failures across versions.
Developer Communication: Establish communication channels to keep developers informed of any breaking changes and recommended practices for ensuring compatibility.

Conclusion

The combination of Controlled Republishing and Versioned Publishing offers a flexible, scalable approach to maintaining a stable data interface amid rapid system evolution. Controlled Republishing allows us to address data inconsistencies with minimal impact on developers, while Versioned Publishing provides essential backward compatibility for critical DApps reliant on legacy schemas. Together, these strategies enable a controlled, developer-friendly upgrade path that minimizes disruption and enhances resilience, ensuring that both existing and future DApps can confidently interact with our evolving data systems.

pedronauck · 2024-10-29T17:44:18Z

pedronauck
Oct 29, 2024
Maintainer

Excellent job, sir 👏🏻 My 50 cents about some parts of the RFC:

Multiple Publishers and Versioning

Each subject (e.g., blocks, transactions) is managed as an independent, long-running process. A last_published checkpoint is maintained per subject to track the state of each republish operation:

Should it not be each stream instead of the subject here?

RepublisherFactory and Republisher Processes: A factory generates a dedicated Republisher process for each subject, which continuously monitors its environment variables. If republish_count is incremented, republishing initiates; otherwise, normal publishing resumes.

The same here; I guess we should have each stream separated as a long process instead of each subject because there are too many subjects to track.

Otherwise, supporting multiple versions would require versioned Subjects and Payloads (e.g., BlockSubjectV_0_1 for BlockV_0_1 payload, and BlockSubjectV_0_2 for BlockV_0_2 payload), significantly increasing codebase complexity and the maintenance burden

Deploy a separate Publisher instance for each supported version. For instance, maintain a 0.1 Publisher alongside a 0.2 Publisher to ensure compatibility which would point to buckets fuel_streams_0_1 and fuel_streams_0_2 in the NATS cluster respectively. As new versions are introduced, additional publisher instances will be deployed, incrementally increasing deployment complexity.

I like having multiple publishers running in parallel for each stream. As we discussed, we can separate resources, handlers, etc., avoiding that one stream affecting others when issues occur. However, the idea of splitting this into versions I guess is too much for now.

Versioning suggestion

As I suggested, we can have an API_VERSION environment variable that will be used on both fuel-streams-publisher and fuel-streams. This variable can be used as JetStream domain prefix or as a prefix for the bucket name: v<version>_fuel_blocks.

Since we control entirely how our users consume our streams, this approach makes it easy for us and our users to get the data from the correct place; they just need to update the crates. Also, for users that can't update the crate but can access the updated data, we can just add a config option api_version for the Client.

Both approaches that I mentioned will simplify resources and deployment management.

Data Prioritization Strategies

Data Availability Priority: Updates using the Put API from the key-value store. This approach maximizes data availability but may temporarily expose inconsistent data.
Data Correctness Priority: Deletes affected data blocks or sequences before republishing, prioritizing data integrity at the expense of temporary unavailability. (A more NATS-agnostic approach)

In most cases, we can push a new version, wait for the latest version to get synced in the publisher, and then release it into crates.io so people can access it without further implications.
In other cases where we need to update the data urgently, IMHO, we should shut down the service to avoid making a wrong DApp decision based on this bad data and then update the current bucket using Put from the KV store.

Chain Reorganization

For this part, I guess we should think a bit more and get some opinion from @Voxelot about in which cases this could happen and which implications we can have.

Other

Some other suggestions I would like to see here are:

We should define a clear version deprecation timeline and communication strategy
How will we ensure the data we have saved is correct? Should we Implement automated testing for this?

0 replies

Jurshsmith · 2024-10-29T17:50:01Z

Jurshsmith
Oct 29, 2024
Collaborator Author

Should it not be each stream instead of the subject here?

Ah, yes, it should be stream

1 reply

pedronauck Oct 29, 2024
Maintainer

Can you adapt the text to stream instead @Jurshsmith?

Jurshsmith · 2024-10-29T18:03:45Z

Jurshsmith
Oct 29, 2024
Collaborator Author

As I suggested, we can have an API_VERSION environment variable that will be used on both fuel-streams-publisher and fuel-streams. This variable can be used as JetStream domain prefix or as a prefix for the bucket name: v_fuel_blocks.

Since we control entirely how our users consume our streams, this approach makes it easy for us and our users to get the data from the correct place; they just need to update the crates. Also, for users that can't update the crate but can access the updated data, we can just add a config option api_version for the Client.

Both approaches that I mentioned will simplify resources and deployment management.

Not sure If I misunderstood but that's what I meant here:

Instance Management
Deploy a separate Publisher instance for each supported version. For instance, maintain a 0.1 Publisher alongside a 0.2 Publisher to ensure compatibility which would point to buckets fuel_streams_0_1 and fuel_streams_0_2 in the NATS cluster respectively. As new versions are introduced, additional publisher instances will be deployed, incrementally increasing deployment complexity.

We could use an environment variable here or simply work with static guarantees i.e. Users using fuel-streams @0.1 which is hardcoded to point to fuel_<bucket>_0_1 (or v_0_1_fuel_blocks following your structure)

1 reply

pedronauck Oct 29, 2024
Maintainer

Oh, I got it. When I read "Deploy a separate Publisher instance," I thought you were talking about having multiple fuel-streams-publisher running in our cluster, each with a different version. My bad 🙏🏻

Jurshsmith · 2024-10-29T18:13:41Z

Jurshsmith
Oct 29, 2024
Collaborator Author

In most cases, we can push a new version, wait for the latest version to get synced in the publisher, and then release it into crates.io so people can access it without further implications.
In other cases where we need to update the data urgently, IMHO, we should shut down the service to avoid making a wrong DApp decision based on this bad data and then update the current bucket using Put from the KV store.

Republishing are patch fixes that will not require any action from our users i.e. they won't need to update their crates since it's all data change. This is why we need to figure out if we should republish with data consistency or data availability prioritized in mind (?)

1 reply

pedronauck Oct 29, 2024
Maintainer

Yeah, in my opinion, we should prioritize data consistency, but I guess it is something @Voxelot and @luizstacio should say for us

Jurshsmith · 2024-10-29T18:27:25Z

Jurshsmith
Oct 29, 2024
Collaborator Author

For this part, I guess we should think a bit more and get some opinion from @Voxelot about in which cases this could happen and which implications we can have.

Yes, that would appreciated.

Currently, I understand that Chain re-orgs are not possible ATM. But if there were data changes originating from the chain, maybe from Regenesis or related, we would have to republish. This could be a breaking interface change or simply a data patch, but in general, we will republish and document the update in the CHANGELOG allowing users to determine if they need re-aggregate their data or make certain actions.

In the future, we could introduce an aggregation API to support indexing in specific stores (e.g., our planned SurrealDB). This would enable automated data migration management for users, streamlining the handling of these scenarios.

0 replies

Jurshsmith · 2024-10-29T18:32:06Z

Jurshsmith
Oct 29, 2024
Collaborator Author

We should define a clear version deprecation timeline and communication strategy

It may be challenging to generalize since each breaking interface change often comes with unique nuances. How difficult will it be to encourage high-value DApps to migrate to the latest version? Are there any instances of corrupt data in older versions that cannot be republished? What is the version distribution of DApps currently consuming our streams?

Seems this will become clearer as we gain adoption. wdyt?

1 reply

pedronauck Oct 29, 2024
Maintainer

My main concern is the amount of storage we will have in our Cluster if we keep more than three versions synced + NATS backup; this could easily go to more than 1TB, so maybe we should define with @luizstacio how many versions he would like to keep before we entirely delete from NATS.

Jurshsmith · 2024-10-29T18:47:53Z

Jurshsmith
Oct 29, 2024
Collaborator Author

How will we ensure the data we have saved is correct? Should we Implement automated testing for this?

This may be somewhat outside the scope of this RFC, but I agree—ensuring data integrity is critical, and detecting data discrepancies could come from various sources. We definitely need a combination of robust validations, checksums, and extensive unit and end-to-end tests. Achieving this level of stability will be key for reliable data integrity.

Increasing our test coverage is essential. Ideally, we can leverage tools from similar repositories within Fuel to enable more advanced mocking and simulations, which would significantly enhance our testing capabilities.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Establishing Backward Compatibility in our Data Systems #295

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Establishing Backward Compatibility in our Data Systems #295

Jurshsmith Oct 29, 2024 Collaborator

Overview

Problem Context and Instability Risks

Data Correctness

Structural and Interface Changes

Proposed Solution Overview

Controlled Republishing

Description

High-Level Implementation Summary

Versioned Publishing

Description

High-Level Implementation Summary

Governance and Operational Considerations

Conclusion

Replies: 7 comments · 4 replies

pedronauck Oct 29, 2024 Maintainer

Multiple Publishers and Versioning

Versioning suggestion

Data Prioritization Strategies

Chain Reorganization

Other

Jurshsmith Oct 29, 2024 Collaborator Author

pedronauck Oct 29, 2024 Maintainer

Jurshsmith Oct 29, 2024 Collaborator Author

pedronauck Oct 29, 2024 Maintainer

Jurshsmith Oct 29, 2024 Collaborator Author

pedronauck Oct 29, 2024 Maintainer

Jurshsmith Oct 29, 2024 Collaborator Author

Jurshsmith Oct 29, 2024 Collaborator Author

pedronauck Oct 29, 2024 Maintainer

Jurshsmith Oct 29, 2024 Collaborator Author

Jurshsmith
Oct 29, 2024
Collaborator

Replies: 7 comments 4 replies

pedronauck
Oct 29, 2024
Maintainer

Jurshsmith
Oct 29, 2024
Collaborator Author

pedronauck Oct 29, 2024
Maintainer

Jurshsmith
Oct 29, 2024
Collaborator Author

pedronauck Oct 29, 2024
Maintainer

Jurshsmith
Oct 29, 2024
Collaborator Author

pedronauck Oct 29, 2024
Maintainer

Jurshsmith
Oct 29, 2024
Collaborator Author

Jurshsmith
Oct 29, 2024
Collaborator Author

pedronauck Oct 29, 2024
Maintainer

Jurshsmith
Oct 29, 2024
Collaborator Author