From 2bce64003019df7ad7ce77153838a82b365b2474 Mon Sep 17 00:00:00 2001 From: Oscar Westra van Holthe - Kind Date: Thu, 17 Aug 2023 10:15:22 +0200 Subject: [PATCH 1/5] AVRO-3833: Clarify uniqueness of names and aliases --- doc/content/en/docs/++version++/Specification/_index.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/doc/content/en/docs/++version++/Specification/_index.md b/doc/content/en/docs/++version++/Specification/_index.md index 30494e073f2..b6a61c743f1 100755 --- a/doc/content/en/docs/++version++/Specification/_index.md +++ b/doc/content/en/docs/++version++/Specification/_index.md @@ -180,9 +180,9 @@ For example, 16-byte quantity may be declared with: ``` ### Names {#names} -Record, enums and fixed are named types. Each has a fullname that is composed of two parts; a name and a namespace, separated by a dot. Equality of names is defined on the fullname. +Record, enums and fixed are named types. Each has a fullname that is composed of two parts; a name and a namespace, separated by a dot. Equality of names is defined on the fullname: it is an error to specify two different types with the same name. -Record fields and enum symbols have names as well (but no namespace). Equality of fields and enum symbols is defined on the name of the field/symbol within its scope (the record/enum that defines it). Fields and enum symbols across scopes are never equal. +Record fields and enum symbols have names as well (but no namespace). Equality of field names and enum symbols is defined within their scope (the record/enum that defines them): it is an error to define multiple fields or enum symbols with the same name in a single type. Fields and enum symbols across scopes are never equal, so field names and enum symbols can be reused in a different type. The name portion of the fullname of named types, record field names, and enum symbols must: @@ -266,6 +266,8 @@ Aliases function by re-writing the writer's schema using aliases from the reader A type alias may be specified either as a fully namespace-qualified, or relative to the namespace of the name it is an alias for. For example, if a type named "a.b" has aliases of "c" and "x.y", then the fully qualified names of its aliases are "a.c" and "x.y". +Aliases are alternative names, and thus subject to the same uniqueness constraints as names. Aliases should be valid names, but this is not required: any string is accepted as an alias. This allows schema evolution to correct illegal names in old schemata. + ## Data Serialization and Deserialization Binary encoded Avro data does not include type information or field names. The benefit is that the serialized data is small, but as a result a schema must always be used in order to read Avro data correctly. The best way to ensure that the schema is structurally identical to the one used to write the data is to use the exact same schema. From 36fc8c86491a7a07e62f8148b19f1a223c1de096 Mon Sep 17 00:00:00 2001 From: Oscar Westra van Holthe - Kind Date: Thu, 17 Aug 2023 10:49:37 +0200 Subject: [PATCH 2/5] AVRO-3833: Add section explaining schema fixes --- .../en/docs/++version++/Specification/_index.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/doc/content/en/docs/++version++/Specification/_index.md b/doc/content/en/docs/++version++/Specification/_index.md index b6a61c743f1..ea187d0bc2a 100755 --- a/doc/content/en/docs/++version++/Specification/_index.md +++ b/doc/content/en/docs/++version++/Specification/_index.md @@ -259,7 +259,7 @@ Complex types (`record`, `enum`, `array`, `map`, `fixed`) have no namespace, but A schema or protocol may not contain multiple definitions of a fullname. Further, a name must be defined before it is used ("before" in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come "before" the messages attribute.) -### Aliases +### Aliases {#aliases} Named types and fields may have aliases. An implementation may optionally use aliases to map a writer's schema to the reader's. This facilitates both schema evolution as well as processing disparate datasets. Aliases function by re-writing the writer's schema using aliases from the reader's schema. For example, if the writer's schema was named "Foo" and the reader's schema is named "Bar" and has an alias of "Foo", then the implementation would act as though "Foo" were named "Bar" when reading. Similarly, if data was written as a record with a field named "x" and is read as a record with a field named "y" with alias "x", then the implementation would act as though "x" were named "y" when reading. @@ -268,6 +268,20 @@ A type alias may be specified either as a fully namespace-qualified, or relative Aliases are alternative names, and thus subject to the same uniqueness constraints as names. Aliases should be valid names, but this is not required: any string is accepted as an alias. This allows schema evolution to correct illegal names in old schemata. +## Fixing an invalid, but previously accepted, schema +Over time, rules and validations on schemas have changed. It is therefore possible that a schema used to work with an older version of Avro, but now fails to parse. + +This can have several reasons, as listed below. Each reason also describes a fix, which can be applied using [schema resolution]({{< ref "#schema-resolution" >}}): you fix the problems in the schema in a way that is compatible, and then you can use the new schema to read the old data. + +### Invalid names +Invalid names of types and fields can be corrected by renaming (using an [alias]({{< ref "#aliases" >}})). This works for simple names, namespaces and fullnames. + +Ths fix is twofold: first, you add the invalid name as an alias to the type/field. Then, you change the name to any valid name. + +### Invalid defaults +Default values are only used to fill in missing data when reading. Invalid defaults create invalid values in these cases. The fix is to correct the default values. + + ## Data Serialization and Deserialization Binary encoded Avro data does not include type information or field names. The benefit is that the serialized data is small, but as a result a schema must always be used in order to read Avro data correctly. The best way to ensure that the schema is structurally identical to the one used to write the data is to use the exact same schema. From eb6ef91a886b5280b896598e31c58a3169e2a099 Mon Sep 17 00:00:00 2001 From: Oscar Westra van Holthe - Kind Date: Thu, 17 Aug 2023 13:37:06 +0200 Subject: [PATCH 3/5] AVRO-3833: Fix typo --- doc/content/en/docs/++version++/Specification/_index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/content/en/docs/++version++/Specification/_index.md b/doc/content/en/docs/++version++/Specification/_index.md index ea187d0bc2a..c589b8d10f3 100755 --- a/doc/content/en/docs/++version++/Specification/_index.md +++ b/doc/content/en/docs/++version++/Specification/_index.md @@ -276,7 +276,7 @@ This can have several reasons, as listed below. Each reason also describes a fix ### Invalid names Invalid names of types and fields can be corrected by renaming (using an [alias]({{< ref "#aliases" >}})). This works for simple names, namespaces and fullnames. -Ths fix is twofold: first, you add the invalid name as an alias to the type/field. Then, you change the name to any valid name. +This fix is twofold: first, you add the invalid name as an alias to the type/field. Then, you change the name to any valid name. ### Invalid defaults Default values are only used to fill in missing data when reading. Invalid defaults create invalid values in these cases. The fix is to correct the default values. From 79944cf4c19e283eae8ed7af88f7d5e33497a4f7 Mon Sep 17 00:00:00 2001 From: Oscar Westra van Holthe - Kind Date: Mon, 4 Sep 2023 16:23:27 +0200 Subject: [PATCH 4/5] AVRO-3833: Apply review improvements --- doc/content/en/docs/++version++/Specification/_index.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/content/en/docs/++version++/Specification/_index.md b/doc/content/en/docs/++version++/Specification/_index.md index c589b8d10f3..b5c1c3e6678 100755 --- a/doc/content/en/docs/++version++/Specification/_index.md +++ b/doc/content/en/docs/++version++/Specification/_index.md @@ -179,10 +179,10 @@ For example, 16-byte quantity may be declared with: {"type": "fixed", "size": 16, "name": "md5"} ``` -### Names {#names} -Record, enums and fixed are named types. Each has a fullname that is composed of two parts; a name and a namespace, separated by a dot. Equality of names is defined on the fullname: it is an error to specify two different types with the same name. +### Names +Record, enums and fixed are named types. Each has a fullname that is composed of two parts: a name and a namespace, separated by a dot. Equality of names is defined on the fullname – it is an error to specify two different types with the same name. -Record fields and enum symbols have names as well (but no namespace). Equality of field names and enum symbols is defined within their scope (the record/enum that defines them): it is an error to define multiple fields or enum symbols with the same name in a single type. Fields and enum symbols across scopes are never equal, so field names and enum symbols can be reused in a different type. +Record fields and enum symbols have names as well (but no namespace). Equality of field names and enum symbols is defined within their scope (the record/enum that defines them). It is an error to define multiple fields or enum symbols with the same name in a single type. Fields and enum symbols across scopes are never equal, so field names and enum symbols can be reused in a different type. The name portion of the fullname of named types, record field names, and enum symbols must: @@ -259,7 +259,7 @@ Complex types (`record`, `enum`, `array`, `map`, `fixed`) have no namespace, but A schema or protocol may not contain multiple definitions of a fullname. Further, a name must be defined before it is used ("before" in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come "before" the messages attribute.) -### Aliases {#aliases} +### Aliases Named types and fields may have aliases. An implementation may optionally use aliases to map a writer's schema to the reader's. This facilitates both schema evolution as well as processing disparate datasets. Aliases function by re-writing the writer's schema using aliases from the reader's schema. For example, if the writer's schema was named "Foo" and the reader's schema is named "Bar" and has an alias of "Foo", then the implementation would act as though "Foo" were named "Bar" when reading. Similarly, if data was written as a record with a field named "x" and is read as a record with a field named "y" with alias "x", then the implementation would act as though "x" were named "y" when reading. From dedcffb8eeffa047778c97567eb479f5c719acd4 Mon Sep 17 00:00:00 2001 From: Oscar Westra van Holthe - Kind Date: Wed, 27 Sep 2023 10:40:26 +0200 Subject: [PATCH 5/5] AVRO-3833: Minor clarification --- doc/content/en/docs/++version++/Specification/_index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/content/en/docs/++version++/Specification/_index.md b/doc/content/en/docs/++version++/Specification/_index.md index b5c1c3e6678..c137594a8da 100755 --- a/doc/content/en/docs/++version++/Specification/_index.md +++ b/doc/content/en/docs/++version++/Specification/_index.md @@ -266,7 +266,7 @@ Aliases function by re-writing the writer's schema using aliases from the reader A type alias may be specified either as a fully namespace-qualified, or relative to the namespace of the name it is an alias for. For example, if a type named "a.b" has aliases of "c" and "x.y", then the fully qualified names of its aliases are "a.c" and "x.y". -Aliases are alternative names, and thus subject to the same uniqueness constraints as names. Aliases should be valid names, but this is not required: any string is accepted as an alias. This allows schema evolution to correct illegal names in old schemata. +Aliases are alternative names, and thus subject to the same uniqueness constraints as names. Aliases should be valid names, but this is not required: any string is accepted as an alias. When aliases are used "to map a writer's schema to the reader's" (see above), this allows schema evolution to correct illegal names in old schemata. ## Fixing an invalid, but previously accepted, schema Over time, rules and validations on schemas have changed. It is therefore possible that a schema used to work with an older version of Avro, but now fails to parse.