Data Serialization Format (CLIO) #6
Replies: 3 comments 3 replies
-
While I agree that speed is very important in most decentralized applications, my sense is that a peer-to-peer cryptographic network would be more limited by bandwidth than by speed, as most of the bottleneck is in distributing the information horizontally, rather than the bottleneck being within the vertical of a given transaction. More precisely, if each user is effectively a node within the system, each node can simultaneously compute and generate a transaction to be broadcast with no effect on others’ transactions, while the broadcasting itself does have an effect. One could even make the transaction take longer, by making the device generate some proof of work, as a means to prevent spamming of the system. I’m curious, what are the advantages of using the “->” means for addressing a substructure rather than the “.”? Seems more confusing to introduce a different means for accessing the data, forcing the person programming to remember whether to put a “.” or a “->”. I’m not much of a programmer though so take that particular opinion with a grain of salt. |
Beta Was this translation helpful? Give feedback.
-
Canonical Representation is a feature or a handicap? If the object schema changes trivially, the serialization could fail. So it's a tradeoff -- the gains come from assuming the schema will not change. In a world where related services cannot be updated simultaneously, schema updates might become very delicate. The trade-off only becomes economical in a very fast-paced environment, imo. Love it! |
Beta Was this translation helpful? Give feedback.
-
@bytemaster minor corrections, in the Example Usage section, for the second reflection macro, the final two parameters should be reversed. Instead of: It should be: Also, search for |
Beta Was this translation helpful? Give feedback.
-
The purpose of this discussion is to document the thoughts that went into the choice for a "custom" format and to gather community feedback on alternatives. There are many desirable properties that a format can have, some of them are mutually exclusive. Here is a brief list:
Self Describing
A self-describing is a data format that does not require the reader to have a schema in order to interact with it. Examples of this include JSON, BSON, and XML. This is in contrast to schema-driven formats like Protocol Buffers, Flat Buffers, and Cap'nProto.
Human Readable / Editable
A human readable format can be easily opened and manipulated in a text editor. This makes it easier to develop and debug.
Canonical Representation
A canonical representation means that for a given format there is a single, unambiguous, standardized way to write the data that can be reliably reproduced. Formats like JSON do not specify a canonical order for the keys and the "same" object can be represented several different ways. Likewise, Protocol Buffers lacks a canonical representation in its standard despite attempts to add Canonical Encoding Rules.
Backward / Forward Compatibility
The only constant in the universe is change. A backward compatible format allows new versions of a program to read old formats. A forward compatible format allows old code to interact with new formats. JSON, Protocol Buffers, and Cap'nProto both support backward/forward compatibility.
Encoding / Decoding Speed
The data format can have a huge impact on performance. Human Readable formats are almost always slower than Binary formats. That said, even among binary formats there can be massive differences in speed. Every time data is read or written to the network or database there is a potential encode/decode step. Manipulating JSON objects or unpacking Protocol Buffer blobs can become a major limit on throughput.
A corollary to speed is efficiency. When building a protocol that needs to minimize resource usage efficiency can be critical to achieving the desired scale.
Data Size
Data size has a direct impact on network bandwidth and database storage. Human Readable formats are often larger while binary formats can typically be smaller. Compression techniques can be utilized to close the gap at the expense of increased encoding/decoding time.
Language Support
Ideally a format should be easily read and written to from many languages. Self-describing formats, like JSON, are widely used because tools exist in almost every language. Schema-driven formats, such as Protocol Buffers and Cap'nProto rely upon code generators.
What Properties are Desirable for a Clarion IO?
Clarion relies heavily on cryptographic signatures and hash functions which means a conical representation is an absolute requirement. Typical decentralized systems suffer from poor performance and scalability issues, which means that Clarion should utilize a fast/efficient format. This pushes us toward binary-encoded messages and away from human readable messages.
The following feature matrix is borrowed and modified from Cap'nProto'
The primary difference between Clarion IO and all other approaches is that no code generation tooling is required which enables means the developer gets to use their native types. This makes the serialization library unintrusive. The other "code generators" push the schema code-style into your application. If you convert to/from native types then it defeats the purpose of zero-copy.
The Clarion IO reflection system enables it to encode/decode data structures in the following formats:
It will even generate a ProtoBuf schema file which can be used by other tools. With a little bit of work it should be possible to write code-generators for other languages that can read the native flat format.
Example Usage
You can then initialize an object using the conventional C++ approach:
Then to serialize it into a buffer suitable for the wire or a database:
API Magic
Here is where the magic happens, without decoding the data we can access the fields like so:
You will notice that the syntax is almost identical to accessing the same data off of the
tester
object with the exception of using '->' instead of '.' to traverse dynamically sized sub objects. This keeps the code clean and readable.Read Access Performance Magic
Now for the biggest surprise of them all, the performance. Here is a simple algorithm implemented in terms of the native C++ data structure (not the serialized binary decoder).
Here is the same algorithm using the serialized form:
Here is the result of running this benchmark:
Implementing the algorithm in terms of the binary decoder is 40% faster than the same algorithm implemented in terms of the native code. The C++ optimizer was able to eliminate all of the overhead from the decoder interface and allow the algorithm to benefit from the improved memory locality. This is demonstrated by the comparison to the debug build which doesn't optimize away the decoder overhead. In debug mode the copy-free parsing is 10x slower than accessing the native c++ struct.
Encode / Decode Performance
I took complex data structure describe above and packed/unpacked it into various formats leveraging the reflection macro. In this case I am not using the code generated by a ProtoBuff schema, but instead using my own custom implementation of ProtoBuf serialization/deserialization. I am leveraging rapid-json for parsing. The resulting numbers are indicative of relative complexity of the format and like all benchmarks should be taken with a grain of salt.
For this particular data structure the Clarion Flat Buffer format is 40% larger that ProtoBuf and 52% larger than EOSIO binary format (but still 22% smaller than JSON). This is largely due to the extra padding from not using variable length integers and from extra pointers and size information encoded. Since most of this extra padding is 0's, using a fast compression algorithm borrowed from Cap'nProto we can get a "compressed" version where Clarion Flat Buffer is only 25% bigger than EOSIO binary and where Clarion Flat Buffer is actually 3% smaller than similarly compressed ProtoBuf.
We can conclude from this information using Cap'nProto compression adds about 9 ms to the decompression time and therefore unpacking a compressed buffer into a Zero-Copy read is 25% faster than reading EOSIO binary format without compression while only being 9% larger on the wire. The biggest draw back of Clarion Flat buffers is the initial serialization time which takes about 77% longer than EOSIO binary; however, EOSIO binary format lacks many of the other desirable properties. When compared to ProtoBuf, Clarion Flat buffers win by almost every metric except uncompressed wire format.
Utilizing in Browser
The Clarion IO library can be compiled to a Web Assembly module and used to convert to/from Json and ProtoBuf format. This technique can also be used with any language that supports embedded web assembly or linking to a C library. Eventually code generation can be used.
Why didn't I compare to Cap'nProto, SBE, or FlatBuffers ?
One of my design goals was to eliminate the use of a schema language and code generation. The wire formats for these other protocols are complex and would require significant effort to implement in terms of Clarion IO's reflection framework. All things considered I would expect them to perform in a similar manner to Clarion Flat format.
Wire Format
Trivial C Structs
For C++ structs which are "trivially copyable" the wire format is the same as c++ memory layout.
Strings
[32 bit size] [ size bytes ] [null term]
Structs with Dynamic Size Fields
Fixed size struct fields are encoded sequentially, while dynamic fields are encoded as an uint32_t offset pointer to memory
allocated after all fixed sized fields/offset pointers in the struct.
Example:
As an optimization, if the field dynamicB was an empty string then it could be represented by a 'null' offset_ptr which would serialize like so:
This works because a string of size 0 and a null offset ptr are both represented as a 32 bit integer with the same value (0) and
0 also happens to be a null terminator which allows the .c_str() api to work properly as well.
Arrays (std::vector)
Array's of trivial C structs or types are packed sequentially, whereas arrays of dynamically sized types are packed as an array of offset pointers.
[32 bit size ] [size bytes]
If a struct contains a dynamic array that happens to be empty, it can utilize the same "null" offset_ptr optimization that is used for strings. This is because a null offset_ptr and an empty array have the same binary format.
Optionals (std::optional)
These are implemented as offset_ptr to a type. This means optionals take up 4 zero-bytes when excluded. Fortunately this is compressed out by Cap'nProto compression algo.
Variants (std::variant<T1,T2,...>)
Variants are encoded as a 64 bit "type" followed by 8 bytes of data. If the content of the current type of the variant happens to be of dynamic size, then the last 4 bytes are interpreted as an offset ptr. If the variant contains a simple type (char, int, uint64_t, double, etc) then it can be stored "in-line".
The typeid is a unique number derived from either a base32 encoding of the struct type name (assume the type name only contains valid base32 characters and is less than 12 characters long) or the hash of the type name. Hash collisions should be rare and can be detected at compile time. If one is found, then it is easy to rename your type to avoid the collision.
Forward Compatibility
The most "generic" type one could use is a
vector<variant<>>
. This allows you to add any number of fields in the future and to have them be of any type. Alternatively you can add avector<variant<>>
as the last field in any struct to enable forward extensions. In principle this is similar to how ProtoBuf requires that you never reuse a sequence number, in this case you never reuse a type id. Nodes that don't understand the variant type can safely ignore it. With empty-vector optimization the overhead for having unused forward compatibility for all of your types is 4 zero-bytes per type. With Cap'nProto compression this is largely eliminated.Other Use Cases
With this serialization format, EOSIO smart contracts could gain added performance by avoiding the deserialization step every time they read an object from the database. If this was used as a block and transaction format then entire blockchain's could gain increases in efficiency and performance.
What do you think? How could the format be improved?
Beta Was this translation helpful? Give feedback.
All reactions