Somewhat Organized ES5 Feedback #25

AljoschaMeyer · 2022-12-23T18:43:31Z

And here a more structured version of the feedback. Happy holidays?

Supersedes and closes #23

Here are the main issues I found, and my first instincts on how to approach them:

Timestamps do not use hashes for deterministic tie-breaking.
- Introduce deterministic tie-breaking and everything becomes fine.
Path syntax is highly ambiguous and fairly complicated.
- Burn it down and create something nice; don't avoid breaking changes but try to make common (non-edge) cases work the same as before.
Attachments have some inconsistencies about their number and homogeneity.
- Fairly straightforward to make consistent, not much conceptual work shall be necessary.
Attachments rely on the more-or-less undefined notion of "file extensions"
- Burn it down and try to avoid the whole issue in the first place.
The whole specification is written to define es5 and reject everything else, but somehow older formats are supposed to magically work as well, as well as future formats?
- Precisely define how to handle old and unknown formats.
Miscellaneous issues, often around inappropriate requirements language (SHOULD, MUST, etc), concepts that might be out of scope for this particular specification, minor inconsistencies, and technicalities.
- Fix them on a case-by-case basis, none of these are terribly far-reaching. Clearly define the intended scope of this specification and rigorously adhere to it.

What follows are the same comments as in #23, but clustered by topic.

List of the separate clusters:

Timestamps
Paths
Attachments
Documents and ES5
Miscellaneous: everything else

I left some technicalities in the topic in which they appear rather than moving them into the Miscellaneous category.

Timestamps

if existingDoc exists && existingDoc.timestamp >= newDoc.timestamp; return "ignored an obsolete doc"

This uses an effectful tiebreaker rather than a pure one such as comparing hashes. Any chance I can convince you to change this?

"Newest" is determined by comparing the timestamp field in the documents. See the next section for details about trusting timestamps.

What happens in case of equal timestamps?

For this reason, peers MUST NOT save two documents at the same path with the same timestamp.

This is a very delicate topic in conjunction with the admirable aspiration "It's safe to use the identity keypair from multiple devices simultaneously.". As written, the specification successfully ensures that a single peer is free from conflicting timestamps, but at the cost of forcing distinct peers to choose nondeterministically between any two documents with equal (path, identity, timestamp). The consequence of this that you cannot rely on distinct peers eventually settling on the same view of the share, even after an arbitrarily long time without new document creations. Which is to say, earthstar is not even eventually consistent. I can go into more details in a different context if you'd like. But this is probably unintended and bad?

Again I would recommend using hashes as tiebreakers (as freedom of hash collisions is indeed assumed to hold) to make those choices deterministic for and across all peers.

Paths

NAME = one ALPHA_LOWER followed by 0 to 14 ALPHA_LOWER_OR_DIGIT

14 inclusively or exclusively?

[the name] MUST be 1 to 15 characters long, inclusive.

I'm sensing conflicting signals here.

A path MUST NOT contain // (because each PATH_SEGMENT must have at least one PATH_CHARACTER)

But / is a PATH_CHARACTER, so // is a PATH_SEGMENT of two PATH_CHARACTERs.

A path MUST contain one or more ! characters, anywhere, IF AND ONLY IF the document is ephemeral

This usage of MUST (effectively a definition) does not conform to the meaning outlined in rfc2119. I also expect ephemerality to be defined elsewhere too, and duplicate definitions are usually a bad idea.

! - used if and only if the document is ephemeral

See above, perhaps rephrase in the spirit of "hey, if you come across a path that contains !, that means the document is ephemeral".

/ - starts paths; separates path segments

Tricky again. The whole path definition might require a careful rewrite because of its lack of escaping.

+@. - used in share and author addresses but allowed elsewhere too

What is this supposed to tell me?

A path ending with a file extension denotes a document has an attachment

"file extension" is not defined (and good luck =/). Later: nvm, might want a forward pointer here though. Slightly later still: nvm the nvm, the section on file extensions doesn't define them either.

If you're handling path segments (the parts between slashes),

Just marking this as another thing that needs to be rewritten when tackling the / situation.

The tilde + identity address pattern can occur anywhere in the path: beginning, middle or end.

Neither "beginning", "middle" or "end" are defined.

Note that documents are mutable but their path can never change (or it would be a different document!) so the ownership of a particular path/document is permanent.

I genuinely don't know which of several interpretations of "path/document" is intended here.

general remark by Aljoscha

Does the specification gain anything by introducing the concept of path segments, as opposed to considering a path as a flat sequence of characters? Humans might like to group paths into meaningful segments based on the occurrence of the / character, but they can indulge those silly human desires without forcing the specification and implementations to deal with that complexity.

allowing only documents with attachments with file extensions means that documents with and without attachments can never collide on the same path.

Is that so? . is a regular PATH_CHARACTER just like Q. I guess this boils down to the definition of a file extension (which would already have been complicated without . being a PATH_CHARACTER).

Attachments

Newly arrived attachments are only saved if a <- this word is important corresponding document is known

This sentence heavily implies that attachments can arrive unrelated to and independently from documents. I personally think this is an elegant and ultimately simple choice. Yet, the attachment ingestion algorithm that follows only checks for ingesting an attachment in combination with a particular document. How about letting the attachment ingestion algorithm start by querying for documents that refer to the would-be-new attachment, and abort if the set of such documents is empty?

Documents may have arbitrary binary bytes associated with them, referred to as attachments.

Attachments have already been defined at this point, reference that definition boldly rather than sorta kinda reintroducing the terminology.

For example, documents with JPEG image attachments should have paths ending in .jpg, and documents with MP3 audio attachments should have paths ending in .mp3.

So I cannot reference multiple attachments of different file extensions without violating a "should" (changed to SHOULD in my pr btw). That makes me sad =(

Documents without attachments MUST NOT use the filename extension to indicate how to interpret their contents.

If you invoke the power of MUST NOT, do not give into the temptation of listing possible reasons for the disallowed behavior; you might encourage people with other reasons. In particular, I'm also not allowed to use filename extensions that have nothing to do with my content.

Furthermore, "using" a filename extension is not defined, be consistent in talking about "paths ending with a file extension".

Instead, documents SHOULD enclose the file extension of the text contents in parentheses, excluding .:

Do you have recommendations for documents without a clear interpretation of the text contents? And in particular for "directories" like they are used in the wiki-like examples throughout the pacification? Are those an anti-pattern because paths are supposed to correspond to text files rather than directories?

displayNamePath = "/about/~" + identityAddress + "/displayName.txt"

(way up high in the specification) Contradicted by the file extension path rules. (And also by the example 2 lines below that definition, now that I'm looking closely at this:).

/about/[email protected]/displayName

Should be "[...]/displayName(txt)"? Or should only Windows users do that? =P

To erase an attachment

Should this also be reflected in the IngestAttachment pseudocode?

Documents and ES5

This example document is shown as JSON though it can exist in many serialization formats:

Remove trailing , of the example to make it valid JSON.

text: string; // an arbitary string of utf-8

Is there no length limit?

format: "es.5"; // the format version that this document adheres to.

Does that mean that so-called documents of older or newer format are not really documents?

timestamp: number; // integer. when the document was created

"integer" is not defined, unless you really mean this (you don't). Later: well, this is another good place for a forward pointer.

Here we use the words "fields" and "properties" to mean the same thing.

And what is that thing?

Extra fields are FORBIDDEN as part of this core document schema.

Can this be reformulated to "Extra fields are FORBIDDEN." without losing information?

All string fields MUST BE limited to PRINTABLE_ASCII characters except for text, which is utf-8, or string fields can be null if specified above.

No string field is specified above as allowing null.

All number fields MUST BE integers, and cannot be NaN or Infinity, but they can be null if specified above

See above.

The order of fields is unspecified except for hashing and signing purposes (see section below).

Does this sentence carry meaning? "order of fields" is not a well-defined concept, and listing things a specification does not define within that specification is probably not a good idea anyways.

Document Validity

This section should probably backpoint to the document ingestion algorithm.

Invalid documents MUST be individually ignored when peers are syncing, and the sync MUST NOT be halted just because an invalid document was encountered. Continue syncing in case there are more valid documents.

"syncing" is neither defined as a concept nor specified in this document, why do we suddenly have MUSTs relating to it?

Documents can be temporarily invalid depending on their timestamp and the current wall clock.

Doesn't this contradict that "Invalid documents MUST be individually ignored when peers are syncing". If that rule was followed, no invalid documents can exist in the "storage" (would be nice to have a well-defined name for the data held by a peer) at any time. Because they cannot originate from a local write either by the definition of how to assign timestamps.

signature and shareSignature are each a base32 string with a leading b. For the es.5 format it must be 104 characters long including the b.

And for formats I do not know about (yet)? What about es.3? Do I have to know about that one? And what's so special about this es.5 anyways that it's being mentioned here?

Inside a MUST is not a good place for details that are meant to be helpful.

share is a valid share address string which matches the local share we are intending to write the document to

Does this differ from "share is a valid share address stringEND_OF_LINE"?

Signature is cryptographically valid

What about shareSignature?

The text field contains arbitrary utf-8 encoded data. If the data is not valid utf-8, the document is still considered valid but the library's behavior is undefined when trying to access the content.

That second sentence is a really bad idea. Luckily, the first sentence ensures that the "If the data is not valid utf-8" condition is never satisfied.

If you really absolutely undoubtedly wish to introduce undefined behavior, you need to reformulate the second sentence. In that case, please give me a chance to talk you out of it or at least discuss some better alternatives than introducing "undefined behavior" ("undefined" is not defined in this context btw, and it is probably a good idea to keep things that way).

This is measured as "bytes when encoded as utf-8", not naive string length.

"naive string length" is not defined and doesn't need to be, considered dropping those last four words.

When a document has an associated attachment, the text field's contents MUST have a length greater than zero and SHOULD be used to describe the contents of the attachment. This description can be used by peers to evaluate the content of an attachment before downloading it.

This whole paragraph assumes that documents have at most one attachment, which is not true.

The text is the sha256 hash of the text data.

The textHash (typo already fixed in the pr) is just an arbitrary string adhering to the formatting rules of a hash. It MUST be equal to the sha256 hash of the text in order for the documents to be valid however.

The format is a short string

"short string" is not defined.

The current format version is es.5

Cool. What does that mean for me? And how do I have to react to older formats? How to newer ones? How to incomparable ones (e.g. es7, eS.5, ohmygoshitssonicetomeetyouimaverylongstring.42, `` <- the empty string, let's pretend this was a string with a length of 2^53 bytes)?

General note: be careful that you don't exclude documents adhering to an old format from the set of valid documents (unless that is exactly your intention). Currently, you pretty much do.

If the specification is changed in a way that breaks forwards or backwards compatability

Do changes that break forward compatibility exist? Does forward compatibility exist?

the format version MUST be incremented. The version number SHOULD be a single integer, not a semver.

You are specifying how to write earthstar specifications here, which is not part of the stated scope of this specification. The intended audience doesn't care about any of this, the particularly if the MUSTy guarantee they might gleam from this is followed by a (completely unnecessary? and) confusing SHOULD.

The path MUST contain at least one ! character, anywhere, IF AND ONLY IF the document is ephemeral (has non-absent deleteAfter).

Now this is the definition of being ephemeral that I was looking forward to. And I'm not disappointed =)

Timestamps MUST NOT be from the future from the perspective of the peer accepting them in a sync; but a limited tolerance is allowed to account for clock skew between devices.

Using "MUST NOT" and then immediately giving wiggling room goes against rfc2119 (See what you've done, you've made me memorize the number of an rfc =/).

The recommended value for the future tolerance threshold is 10 minutes, but this can be adjusted depending on the clock accuracy of devices in a deployment scenario.

This contradicts "In order to mitigate scenarios 2, 3, and 4, peers MUST consider documents with timestamps further than ten minutes in the future invalid."

In addition libraries MUST not return the attachments for expired documents.

"attachments for a document" is not defined.

This is the responsibility of the Formatter.

Which is not required to exist, so should it really be mentioned here?

Regular, non-ephemeral documents omit the deleteAfter field.

Implied by definition of ephemeral documents. This would need a MUST, but probably easiest to just cut it?

The ed25519 signature by the author, encoded in base32 with a leading b.

signature of what? And again, you cannot assume that this is the case, you have to state that the document is invalid if it is not the case.

The ed25519 signature by the share, encoded in base32 with a leading b.

See above. also, who computes this? probably not the share, right?

An attachmentHash field with the sha256 hash of the attachment encoded as base32.

No leading b?

// All string fields must be printable ASCII only

But text can be arbitrary utf8?

// Fields must have one of the following types:

Why are you telling me this? Which Fields exists is predetermined by the format.

Miscellaneous

Apps can also query for the full set of document versions at a path; the older ones are called history documents. [...] Earthstar libraries MUST actually delete the older, overwritten document.

Isn't this a contradiction?

Apps MUST assume that any combination of docs may be missing.

Use SHOULD instead?

Apps MUST assume that any combination of attachments may be absent.

SHOULD? the specification is not about defining apps - who would be enforcing MUST requirements of apps, and why?

Identity Display Names and Profile Info

Does this section really belong to the spec?

They are not part of this lower-level specification.

I'm feeling validated in my observations.

This is a lot of specification real estate for nothing but an app-level SHOULD. But I will admit that this is more about style then a "real flaw", whatever that even means.

A peer MUST NOT hold attachments with no corresponding documents.

Any particular reason you prescribe this? If I fork the reference implementation and make it add always add the byte string "I like trains" to the attachment database on start up, would you consider it to not be an implementation of earthstar anymore? Later: I see, this is about enforcing deletion of removed content. Perhaps foreshadow this here already?

// Check if it's obsolete
// (Do we already have an attachment with the same hash, for the same format?)
let existingAttachment = query({hash: doc.attachmentHash, format: doc.format });
if existingAttachment exists
return "already have this attachment"

Isn't this unnecessary, considering that ingestion is idempotent?

[share address] Private keys (called "secrets") are also 32 bytes of binary data (just the secret integer), encoded as base32 in the same way as the public key.

Probably shouldn't put secrets in the list that introduces the public parts of the public address. Especially as they aren't referenced in this (sub-) section.

[Identity address] Private keys (called "secrets") are also 32 bytes of binary data (just the secret integer), encoded as base32 in the same way as the public key.

Again, probably best to simply omit this.

Note that anyone can instantiate a replica for a share

Neither "replica" nor their "instantiation" have been defined.

/** Whether to fetch all historical versions of a document or just the latest versions. */

Aren't old versions required to be deleted?

// "path DESC" is the reverse of that

"reverse of that" is not well-defined in this context

/** Only fetch documents which come after a certain point. */

"coming after a certain point" is not defined

localIndex

What is that? is not elaborated upon, it never occurs outside the pseudocode.

Only documents after this localIndex.

"after a localIndex" is not defined

Only documents after this path.

"after a path" is not defined

// then apply filters, if any

Either specify their semantics or don't put them in the specification.

Paths can encode information about which identities are allowed to write to them. Documents that break these rules are invalid and will be ignored.

This should be mentioned in the document ingestion algorithm, at the very least as a comment to the isValid(newDoc) check.

Preconditions that make this work:

Again, the reader doesn't need to care.

This query format will become standardized because it will be used for querying from one peer to another.

Bold predictions =P

The text was updated successfully, but these errors were encountered:

AljoschaMeyer mentioned this issue Dec 23, 2022

es5 spec feedback #23

Closed

sgwilym self-assigned this Jan 3, 2023

sgwilym added completeness documentation Improvements or additions to documentation labels Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Somewhat Organized ES5 Feedback #25

Somewhat Organized ES5 Feedback #25

AljoschaMeyer commented Dec 23, 2022

Somewhat Organized ES5 Feedback #25

Somewhat Organized ES5 Feedback #25

Comments

AljoschaMeyer commented Dec 23, 2022

Timestamps

Paths

Attachments

Documents and ES5

Miscellaneous