Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace JSON-based marshal with dense binary encoding #2589

Open
warner opened this issue Mar 6, 2021 · 13 comments
Open

replace JSON-based marshal with dense binary encoding #2589

warner opened this issue Mar 6, 2021 · 13 comments
Assignees
Labels
enhancement New feature or request marshal package: marshal SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Mar 6, 2021

What is the Problem Being Solved?

(I thought we had a ticket for this already, but I couldn't find it.. feel free to close this as a duplicate if you find one)

Our marshal library currently uses a JSON-based encoding scheme to convert capability-bearing object graphs into pure data ("capdata", which is an object with a body string and slots array of object reference strings). It works by building a tree of objects from the original object graph, then using the built-in JSON.stringify to convert the tree into a single string. The deserialize function does the same thing in reverse.

The swingset comms vat has a related encoding scheme to convert messages to/from the "comms protocol", which adds additional control messages to the mix (deliver, notify, dropImport, and sequence numbers). It basically takes the two-part capdata object and encodes it into a single string.

The resulting output formats are pretty easy to specify, and nicely neutral (it would be pretty straightforward to port them to other languages than JavaScript), and convenient for humans to read the encoded output. But they're awfully fluffy. Common JS objects like undefined and BigInt(1) expand to 23- and 33- byte strings respectively.

#1661 touches on shrinking the representation of undefined, but points out that it would be a distraction: the real task is to replace the encoding entirely, with some dense binary format.

Given our heritage and community, CapnProto is a likely choice. #1827 explores the idea (in a different context). We might also consider CBOR (RFC 8949), which I think is the encoding of preference in the IPFS world.

This touches on the question of how we want to represent binary data in our system: currently we can use the JS ByteArray in vat code, but they are mutable, and marshal can't handle them, so we're left in the awkward position of recommending base64-encoded strings, eww. #46 ("large string device") touches on this, as the data it manages might be arbitrary bytes.

Compatibility Issues

Every vat within a single kernel should probably use the same encoding scheme, because the messages created by liveslots in one vat are sent into the kernel, (sometimes) stored in kernel promise queues, and delivered to other vats. The kernel should not be in the business of interpreting the body of capdata, so I don't think I'd want it responsible for conversion. User-level vat code does not generally see the output of marshal, because liveslots and/or the virtual objects layer abstracts that away, although I believe userspace is given access to serialize (but not deserialize) for some reason I don't currently understand.

Converting an existing kernel to the new format could be tricky. We'd need to convert all vats' liveslots code at the same time, and simultaneously convert anything on the run-queue and the various unresolved-promise message queues (which assumes we could write a function that converts an old-format message to a new-format message, which at least sounds plausible).

Two different swingsets, connected by a comms connection, have a similar problem. We might lean more heavily on that hypothetical format-conversion function, and make the comms vat tolerant of the other version (upconverting upon receipt, downconverting on transmit).

@warner warner added enhancement New feature or request marshal package: marshal SwingSet package: SwingSet labels Mar 6, 2021
@dtribble
Copy link
Member

dtribble commented Mar 7, 2021

We should also switch before this to a more efficient text representation. That makes integration and debugging in browsers and other web infrastructure much easier. There's no reason not to have a format that is more readable and incidentally more efficient than the JSON encoding.

@erights
Copy link
Member

erights commented Mar 7, 2021

There's no reason not to have a format that is more readable and incidentally more efficient than the JSON encoding.

There are plenty of reasons. Whether this is the right thing to do or not, it is a tradeoff. For one thing, any text format other than JSON will need a parser written in JS, so it will be hard to beat JSON.parse on speed.

@dckc
Copy link
Member

dckc commented Mar 8, 2021

I'm struggling to understand the problem statement. All I can find is "they're awfully fluffy". What are the symptoms of this problem? What suggests these symptoms are going to have sufficient impact that it's worth a change?

@dckc
Copy link
Member

dckc commented Mar 12, 2021

@erights
Copy link
Member

erights commented Mar 12, 2021

Wow, that is much more different than I expected! Definitely something we should look at closely.

@dckc
Copy link
Member

dckc commented Sep 6, 2022

@dckc
Copy link
Member

dckc commented Sep 6, 2022

@dtribble also had an idea:

  • $ is slot
  • + or - is bigint
  • # is constant (e.g., #NaN)
  • ' is quote so that any of the above can be escaped
'{"body":"{\"current\":{\"Electorate\":{\"type\":\"invitation\",\"value\":{\"brand\":\"$0.Alleged:
  Zoe Invitation brand\",\"value\":[{\"description\":\"questionPoser\",\"handle\":\"$1.Alleged:
  InvitationHandle\",\"installation\":\"$2.Alleged:
  BundleInstallation\",\"instance\":\"$3.Alleged:
  InstanceHandle\"}]}},\"GiveStableFee\":{\"type\":\"ratio\",\"value\":{\"denominator\":{\"brand\":\"$4.Alleged:
  IST brand\",\"value\":\"+10000\",\"numerator\":{\"brand\":\"$4\",\"value\":\"+3\"}}},\"MintLimit\":{\"type\":\"amount\",\"value\":{\"brand\":\"$5.Alleged:
  AUSD brand\",\"value\":\"+20000000000000\"}},\"WantStableFee\":{\"type\":\"ratio\",\"value\":{\"denominator\":{\"brand\":\"$4\",\"value\":\"+10000\"},\"numerator\":{\"brand\":\"$4\",\"value\":\"+1\"}}}}}}",
  "slots":["board03125","board03928","board05311","board00613","board0074","board05815"]}'

@erights
Copy link
Member

erights commented Sep 6, 2022

Provided there is an outer novel qclass record to prevent misinterpretation across the version skew, I endorse @dtribble 's idea. Both in absolute terms and in comparison to the current state of encodePassable. @warner 's criticisms of encodePassable at endojs/endo#1260 (comment) are currently all correct, and we only know how to mitigate some of these.

@erights
Copy link
Member

erights commented May 19, 2023

See endojs/endo#1349 .
Should be closed by whatever closes that.

Well, except the replacing part. It will co-exist with smallcaps, not replace it.

@erights
Copy link
Member

erights commented Jun 23, 2023

Hopefully fixed by endojs/endo#1594

Though still, except the replacing part. It will co-exist with smallcaps, not replace it.

@mhofman
Copy link
Member

mhofman commented Jan 4, 2024

I am wondering more and more if direct CBOR serialization is not a better approach to avoid all the string manipulation logic, even with more efficient passable encodings, and approaches like endojs/endo#1478 which effectively defers some serialization with marshal only taking care of encoding into a serializable object.

CBOR seems to support streaming and can easily nest pre-serialized data by wrapping it (which can be handled using a list of ArrayBuffers in the related APIs). The main problem is that the story for "frozen" ArrayBuffer in JS is still not good at all (and handling of binary data still very primitive).

@mhofman
Copy link
Member

mhofman commented Oct 21, 2024

If we ever get around to using the compact-order encoding from endojs/endo#1594, updating liveslots might be complex. Besides migrating existing data in legacy encoding, the legacy encoding had the property that all output sorted before | but all compact-order sorts after | because it's prefixed with ~. Some liveslots logic does rely on this ordering of data before metadata in collections.

@gibson042
Copy link
Member

Endo issue: endojs/endo#2600

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request marshal package: marshal SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

7 participants