Data Serialization Format (CLIO) #6

bytemaster · 2021-03-03T02:48:09Z

bytemaster
Mar 3, 2021
Maintainer

The purpose of this discussion is to document the thoughts that went into the choice for a "custom" format and to gather community feedback on alternatives. There are many desirable properties that a format can have, some of them are mutually exclusive. Here is a brief list:

Self Describing

A self-describing is a data format that does not require the reader to have a schema in order to interact with it. Examples of this include JSON, BSON, and XML. This is in contrast to schema-driven formats like Protocol Buffers, Flat Buffers, and Cap'nProto.

Human Readable / Editable

A human readable format can be easily opened and manipulated in a text editor. This makes it easier to develop and debug.

Canonical Representation

A canonical representation means that for a given format there is a single, unambiguous, standardized way to write the data that can be reliably reproduced. Formats like JSON do not specify a canonical order for the keys and the "same" object can be represented several different ways. Likewise, Protocol Buffers lacks a canonical representation in its standard despite attempts to add Canonical Encoding Rules.

Backward / Forward Compatibility

The only constant in the universe is change. A backward compatible format allows new versions of a program to read old formats. A forward compatible format allows old code to interact with new formats. JSON, Protocol Buffers, and Cap'nProto both support backward/forward compatibility.

Encoding / Decoding Speed

The data format can have a huge impact on performance. Human Readable formats are almost always slower than Binary formats. That said, even among binary formats there can be massive differences in speed. Every time data is read or written to the network or database there is a potential encode/decode step. Manipulating JSON objects or unpacking Protocol Buffer blobs can become a major limit on throughput.

A corollary to speed is efficiency. When building a protocol that needs to minimize resource usage efficiency can be critical to achieving the desired scale.

Data Size

Data size has a direct impact on network bandwidth and database storage. Human Readable formats are often larger while binary formats can typically be smaller. Compression techniques can be utilized to close the gap at the expense of increased encoding/decoding time.

Language Support

Ideally a format should be easily read and written to from many languages. Self-describing formats, like JSON, are widely used because tools exist in almost every language. Schema-driven formats, such as Protocol Buffers and Cap'nProto rely upon code generators.

What Properties are Desirable for a Clarion IO?

Clarion relies heavily on cryptographic signatures and hash functions which means a conical representation is an absolute requirement. Typical decentralized systems suffer from poor performance and scalability issues, which means that Clarion should utilize a fast/efficient format. This pushes us toward binary-encoded messages and away from human readable messages.

The following feature matrix is borrowed and modified from Cap'nProto'

Feature	Clarion IO	Protobuf	Cap'n Proto	SBE	FlatBuffers
Schema evolution	yes	yes	yes	caveats	yes
Zero-copy	yes	no	yes	yes	yes
Random-access reads	yes	no	yes	no	yes
Safe against malicious input	opt-in upfront	yes	yes	yes	opt-in upfront
Reflection / generic algorithms	yes	yes	yes	yes	yes
Initialization order	any	any	any	preorder	bottom-up
Unknown field retention	yes	removed in proto3	yes	no	no
Schema language	C++	custom	custom	XML	custom
Usable as mutable state	partial	yes	no	no	no
Padding takes space on wire?	optional	no	optional	yes	yes
Unset fields take space on wire?	yes	no	yes	yes	no
Pointers take space on wire?	yes	no	yes	no	yes
C++	yes (C++14)	yes	yes (C++11)*	yes	yes
Other languages	no	lots!	6+ others*	no	no
Authors' preferred use case	distributed compute	distributed compute	platforms /sandboxing	financial trading	games

The primary difference between Clarion IO and all other approaches is that no code generation tooling is required which enables means the developer gets to use their native types. This makes the serialization library unintrusive. The other "code generators" push the schema code-style into your application. If you convert to/from native types then it defeats the purpose of zero-copy.

The Clarion IO reflection system enables it to encode/decode data structures in the following formats:

Flat
Packed Binary (custom format used by EOSIO, Hive, and BitShares)
ProtoBuf
JSON

It will even generate a ProtoBuf schema file which can be used by other tools. With a little bit of work it should be possible to write code-generators for other languages that can read the native flat format.

Example Usage

    /** a struct with no dynamically sized fields */
    struct no_sub {
        double   myd = 3.14;
        uint16_t my16 = 22;;
    };
    CLIO_REFLECT( no_sub, myd, my16 );

    /** a struct with a dynamic string and a simple sub structure */
    struct sub_obj {
        int a;
        int b;
        std::string substr;
        no_sub ns;
    };
    CLIO_REFLECT( sub_obj, a, b, ns, substr );

    /** a complex object with arrays of various complexity and recursively contains 
          an array of itself 
    */
    struct flat_object {
        uint32_t                 x;
        double                   y;
        std::string              z;
        std::vector<int>         veci;
        std::vector<std::string> vecstr = { "aaaaaaaaa", "bbbbbbbbbbb", "ccccccccccccc" };
        std::vector<no_sub>      vecns;
        std::vector<flat_object> nested;
        sub_obj                  sub;
    };
    CLIO_REFLECT( flat_object, x, y, z, veci, vecstr, vecns, nested, sub );

You can then initialize an object using the conventional C++ approach:

    flat_object tester;
    tester.z = "my oh my";
    tester.x = 99;
    tester.y = 99.99;
    tester.veci = { 1, 2, 3, 4, 6 };
    tester.vecns = {{.myd = 33.33 },{}};
    tester.nested = { { .x = 11, .y = 3.21, .nested = {{ .x = 88 }}  }, {.x=33, .y=.123} };
    tester.sub.substr = "sub str";
    tester.sub.ns.myd = .9876;
    tester.sub.ns.my16 = 33;
    tester.sub.a = 3;
    tester.sub.b = 4;
    tester.sub_view = sub_obj { .a = 987, .b = 654, .substr = "subviewstr" };

Then to serialize it into a buffer suitable for the wire or a database:

    clio::flat_ptr<flat_object> me(tester);
    /** me.data() returns the binary buffer and me.size() returns its length*/

API Magic

Here is where the magic happens, without decoding the data we can access the fields like so:

    std::cout << "sub.substr: "<<me->sub->substr <<"\n";
    std::cout << "x: "         << me->x <<"\n";
    std::cout << "veci[0]: "   << me->veci[0] <<"\n";
    std::cout << "veci[4]: "   << me->veci[4] <<"\n";
    std::cout << "vecstr[1]: " << me->vecstr[1] <<"\n";
    std::cout << "vecns[0].myd: " << me->vecns[0].myd <<"\n";
    std::cout << "vecns[0].my16: " << me->vecns[0].my16 <<"\n";
    std::cout << "vecns[1].myd: " << me->vecns[1].myd <<"\n";
    std::cout << "vecns[1].my16: " << me->vecns[1].my16 <<"\n";
    std::cout << "nested->size(): " << me->nested->size() <<"\n";
    std::cout << "nested[0].x: " << me->nested[0].x <<"\n";
    std::cout << "nested[0].nested[0].x: " << me->nested[0].nested[0].x <<"\n";
    std::cout << "nested[1].z: " << me->nested[1].z <<"\n";
    std::cout << "sub.a: " << me->sub->a<<"\n";
    std::cout << "sub.ns.my16: " << me->sub->ns->my16<<"\n";

You will notice that the syntax is almost identical to accessing the same data off of the tester object with the exception of using '->' instead of '.' to traverse dynamically sized sub objects. This keeps the code clean and readable.

Read Access Performance Magic

Now for the biggest surprise of them all, the performance. Here is a simple algorithm implemented in terms of the native C++ data structure (not the serialized binary decoder).

    uint64_t  sum = 0;
    for( uint32_t i = 0; i < 10000; ++i ) {
        for( uint32_t x = 0; x < tester.veci.size(); ++x )
            sum += tester.veci[x];
    }

Here is the same algorithm using the serialized form:

    uint64_t sum = 0;
    for( uint32_t i = 0; i < 10000; ++i ) {
        for( uint32_t x = 0; x < me->veci->size(); ++x )
            sum += me->veci[x];
    }

Here is the result of running this benchmark:

Format	Debug	Release
Native Type API	1.7 ms	.024 ms
Binary Reader API	0.19 ms	.015 ms

Implementing the algorithm in terms of the binary decoder is 40% faster than the same algorithm implemented in terms of the native code. The C++ optimizer was able to eliminate all of the overhead from the decoder interface and allow the algorithm to benefit from the improved memory locality. This is demonstrated by the comparison to the debug build which doesn't optimize away the decoder overhead. In debug mode the copy-free parsing is 10x slower than accessing the native c++ struct.

Encode / Decode Performance

I took complex data structure describe above and packed/unpacked it into various formats leveraging the reflection macro. In this case I am not using the code generated by a ProtoBuff schema, but instead using my own custom implementation of ProtoBuf serialization/deserialization. I am leveraging rapid-json for parsing. The resulting numbers are indicative of relative complexity of the format and like all benchmarks should be taken with a grain of salt.

Format	Size	Cap'nProto Compressed Size
Clarion Flat Buffers	581	374
EOSIO Binary	343	297
ProtoBuf	415	387
Json	750	754

For this particular data structure the Clarion Flat Buffer format is 40% larger that ProtoBuf and 52% larger than EOSIO binary format (but still 22% smaller than JSON). This is largely due to the extra padding from not using variable length integers and from extra pointers and size information encoded. Since most of this extra padding is 0's, using a fast compression algorithm borrowed from Cap'nProto we can get a "compressed" version where Clarion Flat Buffer is only 25% bigger than EOSIO binary and where Clarion Flat Buffer is actually 3% smaller than similarly compressed ProtoBuf.

Operation	Serialize from C++ Type	Deserialize to C++ Type	Validate Untrusted Zero-Copy reads
Clarion Flat	16 ms	12 ms (0 if Read in Place)	0.8ms
EOSIO Binary	9ms	12 ms
ProtoBuf	20 ms	18.5 ms
Json	103 ms	54 ms
Clarion Flat + Compress	34 ms	21 ms
EOSIO Binary + Compress		17 ms
ProtoBuf + Compress		23 ms

We can conclude from this information using Cap'nProto compression adds about 9 ms to the decompression time and therefore unpacking a compressed buffer into a Zero-Copy read is 25% faster than reading EOSIO binary format without compression while only being 9% larger on the wire. The biggest draw back of Clarion Flat buffers is the initial serialization time which takes about 77% longer than EOSIO binary; however, EOSIO binary format lacks many of the other desirable properties. When compared to ProtoBuf, Clarion Flat buffers win by almost every metric except uncompressed wire format.

Utilizing in Browser

The Clarion IO library can be compiled to a Web Assembly module and used to convert to/from Json and ProtoBuf format. This technique can also be used with any language that supports embedded web assembly or linking to a C library. Eventually code generation can be used.

Why didn't I compare to Cap'nProto, SBE, or FlatBuffers ?

One of my design goals was to eliminate the use of a schema language and code generation. The wire formats for these other protocols are complex and would require significant effort to implement in terms of Clarion IO's reflection framework. All things considered I would expect them to perform in a similar manner to Clarion Flat format.

Wire Format

Trivial C Structs

For C++ structs which are "trivially copyable" the wire format is the same as c++ memory layout.

Strings

[32 bit size] [ size bytes ] [null term]

Structs with Dynamic Size Fields

Fixed size struct fields are encoded sequentially, while dynamic fields are encoded as an uint32_t offset pointer to memory
allocated after all fixed sized fields/offset pointers in the struct.

Example:

 
     struct {
          uint32_t A;
          string     dynamicB;
          double   C
     };

Type	Size (bytes)	Value
uint32_t	4	A
offset_ptr	4	12 (size of offsetptr + sizeof C)
double	8	C
uint32_t	4	dynamicB.size()+1
char[]	dynamicB.size()	dynamicB data
char	1	'/0'

As an optimization, if the field dynamicB was an empty string then it could be represented by a 'null' offset_ptr which would serialize like so:

Type	Size (bytes)	Value
uint32_t	4	A
offset_ptr	4	0
double	8	C

This works because a string of size 0 and a null offset ptr are both represented as a 32 bit integer with the same value (0) and
0 also happens to be a null terminator which allows the .c_str() api to work properly as well.

Arrays (std::vector)

Array's of trivial C structs or types are packed sequentially, whereas arrays of dynamically sized types are packed as an array of offset pointers.

[32 bit size ] [size bytes]

If a struct contains a dynamic array that happens to be empty, it can utilize the same "null" offset_ptr optimization that is used for strings. This is because a null offset_ptr and an empty array have the same binary format.

Optionals (std::optional)

These are implemented as offset_ptr to a type. This means optionals take up 4 zero-bytes when excluded. Fortunately this is compressed out by Cap'nProto compression algo.

Variants (std::variant<T1,T2,...>)

Variants are encoded as a 64 bit "type" followed by 8 bytes of data. If the content of the current type of the variant happens to be of dynamic size, then the last 4 bytes are interpreted as an offset ptr. If the variant contains a simple type (char, int, uint64_t, double, etc) then it can be stored "in-line".

Type	Size (bytes)	Value
uint64_t	8	type_id
uint32_t	4	data (first 4 bytes of data for types up to 8 bytes long)
offset_ptr	4	to_dynamic_data (or last 4 bytes of data for types up to 8 bytes long)

The typeid is a unique number derived from either a base32 encoding of the struct type name (assume the type name only contains valid base32 characters and is less than 12 characters long) or the hash of the type name. Hash collisions should be rare and can be detected at compile time. If one is found, then it is easy to rename your type to avoid the collision.

Forward Compatibility

The most "generic" type one could use is a vector<variant<>>. This allows you to add any number of fields in the future and to have them be of any type. Alternatively you can add a vector<variant<>> as the last field in any struct to enable forward extensions. In principle this is similar to how ProtoBuf requires that you never reuse a sequence number, in this case you never reuse a type id. Nodes that don't understand the variant type can safely ignore it. With empty-vector optimization the overhead for having unused forward compatibility for all of your types is 4 zero-bytes per type. With Cap'nProto compression this is largely eliminated.

Other Use Cases

With this serialization format, EOSIO smart contracts could gain added performance by avoiding the deserialization step every time they read an object from the database. If this was used as a block and transaction format then entire blockchain's could gain increases in efficiency and performance.

What do you think? How could the format be improved?

UrsaPolarisRecords · 2021-03-03T07:45:51Z

UrsaPolarisRecords
Mar 3, 2021

While I agree that speed is very important in most decentralized applications, my sense is that a peer-to-peer cryptographic network would be more limited by bandwidth than by speed, as most of the bottleneck is in distributing the information horizontally, rather than the bottleneck being within the vertical of a given transaction. More precisely, if each user is effectively a node within the system, each node can simultaneously compute and generate a transaction to be broadcast with no effect on others’ transactions, while the broadcasting itself does have an effect. One could even make the transaction take longer, by making the device generate some proof of work, as a means to prevent spamming of the system.

I’m curious, what are the advantages of using the “->” means for addressing a substructure rather than the “.”? Seems more confusing to introduce a different means for accessing the data, forcing the person programming to remember whether to put a “.” or a “->”. I’m not much of a programmer though so take that particular opinion with a grain of salt.

3 replies

bytemaster Mar 3, 2021
Maintainer Author

It's a limitation of c++ not a choice to use ->.

UrsaPolarisRecords Mar 3, 2021

Lol, this is why you’re the one doing the coding and I’m the one doing armchair computer science 😆

MistaBitniq Dec 20, 2021

xDD very very sympathetic

UptownPhil · 2021-03-03T21:28:18Z

UptownPhil
Mar 3, 2021

Canonical Representation is a feature or a handicap? If the object schema changes trivially, the serialization could fail. So it's a tradeoff -- the gains come from assuming the schema will not change.

In a world where related services cannot be updated simultaneously, schema updates might become very delicate.

The trade-off only becomes economical in a very fast-paced environment, imo.

Love it!

0 replies

gleehokie · 2021-03-16T04:33:52Z

gleehokie
Mar 16, 2021

@bytemaster minor corrections, in the Example Usage section, for the second reflection macro, the final two parameters should be reversed.

Instead of:
CLIO_REFLECT( sub_obj, a, b, ns, substr );

It should be:
CLIO_REFLECT( sub_obj, a, b, ns, substr );

Also, search for '/0' and replace it with '\0'.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Serialization Format (CLIO) #6

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Data Serialization Format (CLIO) #6

bytemaster Mar 3, 2021 Maintainer

Self Describing

Human Readable / Editable

Canonical Representation

Backward / Forward Compatibility

Encoding / Decoding Speed

Data Size

Language Support

What Properties are Desirable for a Clarion IO?

Example Usage

API Magic

Read Access Performance Magic

Encode / Decode Performance

Utilizing in Browser

Why didn't I compare to Cap'nProto, SBE, or FlatBuffers ?

Wire Format

Trivial C Structs

Strings

Structs with Dynamic Size Fields

Arrays (std::vector)

Optionals (std::optional)

Variants (std::variant<T1,T2,...>)

Forward Compatibility

Other Use Cases

Replies: 3 comments · 3 replies

UrsaPolarisRecords Mar 3, 2021

bytemaster Mar 3, 2021 Maintainer Author

UrsaPolarisRecords Mar 3, 2021

MistaBitniq Dec 20, 2021

UptownPhil Mar 3, 2021

gleehokie Mar 16, 2021

bytemaster
Mar 3, 2021
Maintainer

Replies: 3 comments 3 replies

UrsaPolarisRecords
Mar 3, 2021

bytemaster Mar 3, 2021
Maintainer Author

UptownPhil
Mar 3, 2021

gleehokie
Mar 16, 2021