Data Parsing & Data Optimization for NATS #110
Replies: 5 comments
-
Really nice work here @0xterminator, congrats 👏🏻 tldr: Let's keep it simple and focus on one compression and one serialization method for now, plus JSON for SDK use. We can add more options later if needed. I appreciate the concept of a "universal data parsing" system that can leverage multiple serialization and compression methods. However, from a product perspective, and considering our current stage of development, I have some concerns. At this phase, introducing multiple options might create unnecessary complexity and overhead that we should avoid. Instead, I suggest we take a more targeted approach. Let's select a single, high-performance algorithm for both compression and serialization that best meets our current needs. We should, however, include JSON as an additional option - not for inter-service messaging, but for SDK integration and high-level usage after data consumption. This approach doesn't mean we're closing doors for the future. As we gather more user feedback and encounter specific use cases requiring different algorithms, we can always add support for them incrementally. By focusing on a single method for each, we can deeply optimize our implementation, ensure robust performance, and simplify our codebase, testing procedures, and ongoing maintenance efforts. It'll also allow us to provide more detailed and focused documentation for developers. This strategy allows us to move forward efficiently while leaving room for future enhancements based on real-world usage and requirements. It strikes a balance between functionality, simplicity, and adaptability - which I believe is crucial at our current stage. |
Beta Was this translation helpful? Give feedback.
-
I liked the multiple options. As such, we focus on benchmarking and have the most optimal one based on what you described. The same goes for data encoded/decoded. We should only keep in mind portability as the client for the stream can also be built on the langs like javascript. |
Beta Was this translation helpful? Give feedback.
-
Why not consider using the Fuel ABI encoding / decoding for this? All our SDK's already come with that encoder/decoder and it's fairly compact all things considered. I'd be open to seeing which options are the best over the wire, I'd be curious to hear what @Voxelot has to say. |
Beta Was this translation helpful? Give feedback.
-
After running series of benchmark tests, I am pasting a block of results that depict the overall conclusions quite nicely:
|
Beta Was this translation helpful? Give feedback.
-
KeysThe above quoted benchmark tests have a few key items to be explained:
Analysis of ResultsThe results above have been analyzed in terms of averaged time and statistical significance. serialize (no compression):
serialize + compression:
deserialize (no compression):
decompress_deserialize: Below are the significant results:
Key Takeaways: Binary Formats are Generally Faster: Binary serialization formats like Bincode and Postcard tend to perform better in terms of both time and stability under various compression schemes. Compression size analysis: A few analysis were performed and the following compressed and serialized data sizes were compared in the benchmarks acc. to the serialization and compression schemes: json 45700 bytes It seems that postcard delivers the best data sizes, but that needs to be analyzed further varying the data types. Conclusions Bincode and postcard are the two main alternatives to be considered. bincode shows the fastest serialization with average ~9.1 µs out of all. Applied compression does not really contribute that much to data reduction with bincode which is also due to the type of data we have with little entropy. Serialized data sizes are slightly higher compared to postcard. Seems like bincode with zstd (fastest level) brings the best performance. Postcard on the other hand, shows ~13.5 µs average time. Here again, compression brings little value which again is due to the little entropy present in the data. Serialized data sizes tend to be smaller compared to bincode. Seems like postcard with zlib (fastest level) brings the best performance. I would suggest using bincode(+ zstd + fastest) vs. postcard (+zlib + fastest) in cloud-based infrastructure with real-time data and comparing the overall performance based on the following criteria:
|
Beta Was this translation helpful? Give feedback.
-
Goals
In the context of fuel-streams and NATS in particular, there are two important aspects related to data that we need to analyze well before we attempt to optimize the data transport for production.
In the context of NATS, we generally aim at having small and easily transportable data packages sent over the wire especially as the NATS cluster is likely to be on a different cloud infra and publishers/consumers be located at different geographical locations. There could also be data delays, latencies, relays etc. slowing down the data transport from the moment it gets published over the wire and the time by which it gets consumed and unwrapped in its expected for ready-use format. To minimize the latter, compression and serialization often contribute to keeping the data packages small and compact, thus reducing the load on the message bus system, increasing delivery times and at the same time reducing the storage required for keeping the data persistently. There are two major criteria we need to consider when evaluating the success of our optimization:
When publishing:
When consuming:
The main mathematical goal of our discussion and testing will be to minimize the aggregated sums of all times for the publishing and respectively consuming parts. Of course, each of these could be compared in isolation to see which serialization / compression works best on their own (and also combined) given the published data morphology we have. We might also quickly have a look at the transport time based on package size.
Parsing Technology
These are some good articles with comparisons and benchmarks that shed some light on the topics of data serialization and compression:
Since we are using an e2e Rust solution, it is highly recommended to utilize a native, sound and robust Rust - based solution. Since we also heavily tokio-oriented, we shall also make use of a tokio-based solution.
For the current implementation and based on the scope o the articles mentioned we shall examine the following crates for serialization/deserialization:
which support the serde-based
bincode
,serde json
,postcard
schemes. We also provide support for prost (protobuf) schemas via theprost
crate.and also consider the following for compression/decompression:
async-compression = { version = "0.4.11", features = ["all"] }
async-compression is an async tokio-based library with support for the most famous compression types:
Zlib, Gzip, Brotli, Bz, Lzma, Deflate, Zstd
.Implementation
A data parser builder is suggested as in the MR: #110. The idea behind it is to have exposed interfaces for setting
CompressionType
,CompressionLevel
andSerializationType
.The SerializationType and CompressionType enums allow setting the schema types using an appropriate variant:
The latter allows building a universal data parser as such:
having the following interfaces:
We suggest having combined methods for serializing and compression, deserializing and decompression as well as these operations ones. The data parser wraps internally various compression/decompression and serialization/deserialization methods which can be easily set by the enums CompressionType, Leval and SerializationType suggested above. As its to be seen, serialization and deserialization on structures that implement serde::Serialize and serde::DeserializeOwned (which most of the fuel-core-types have) can be handled by all wrapped serializers.
A test example would look like:
Data Types
The data types which ought to be serialized/deserialized are all described in the notion document:
https://www.notion.so/fuellabs/Fuel-Streams-Tech-Specs-f18465f40b5d433fbae90427649bea9a#c3b979f13e004dcea75df54766b4a9ae
alongside with the data format specification. One can use this structure and data morphology when applying an appropriate compression.
Testing Strategy
1. Objective
The primary objective of the testing strategy is to establish a benchmarking protocol using the crate
criterion
to identify the most efficient serialization and compression methods for optimizing data payloads sent over NATS. The goal is to minimize the total time taken for:2. Scope
This benchmark will assess multiple serialization and compression combinations to determine which provides the best balance of speed and payload size for communication via NATS. The scope includes:
3. Benchmarking Setup
Hardware and Environment
Test Data
Define a standard test dataset based on a given test struct which includes diverse data types (strings, integers, custom structs).
Ensure the data is a real data taken from fuel-core-types.
4. Test Methodology
Testing Phases
Each test should be run multiple times to calculate the average and standard deviation for each phase.
5. Evaluation Criteria
6. Reporting
Present the results in a clear format, using tables and graphs to compare the time and efficiency of each method.
Include raw data logs for transparency and further analysis.
7. Recommendations
Based on the benchmark results, recommend the best serialization and compression combinations for different types of data payloads.
Suggest configuration settings for optimal performance with NATS.
Beta Was this translation helpful? Give feedback.
All reactions