Skip to content

Commit

Permalink
Allow specified fields to be absent
Browse files Browse the repository at this point in the history
Ignore some required fields when matching write and read schemata. This
allows records to be parsed, that are augmented with metadata after
parsing. E.g., by adding a "parsed at" timestamp, or similar.
  • Loading branch information
opwvhk committed May 17, 2024
1 parent 05c3d04 commit 8bbf40a
Show file tree
Hide file tree
Showing 4 changed files with 188 additions and 173 deletions.
34 changes: 23 additions & 11 deletions doc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,34 +15,46 @@ This document describes the various functionality in more detail.
Parsing
-------

The main day-to-day use of this library is to parse records in various formats into Avro. As such,
you won't find a converter for (for example) CSV files: these are container files with multiple
records.
The main day-to-day use of this library is to parse single records in various formats into Avro. As
a result, you won't find a converter for (for example) CSV files: these are container files with
multiple records.

The following formats can be converted to Avro:

| Format | Parser constructor |
|--------------------|----------------------------------------------------------------------------------------------|
| JSON (with schema) | `opwvhk.avro.json.JsonAsAvroParser#JsonAsAvroParser(URI, boolean, Schema, GenericData)` |
| JSON (unvalidated) | `opwvhk.avro.json.JsonAsAvroParser#JsonAsAvroParser(Schema, GenericData)` |
| XML (with XSD) | `opwvhk.avro.xml.XmlAsAvroParser#XmlAsAvroParser(URL, String, boolean, Schema, GenericData)` |
| XML (unvalidated) | `opwvhk.avro.xml.XmlAsAvroParser#XmlAsAvroParser(Schema, GenericData)` |
| Format | Parser class |
|--------|-------------------------------------|
| JSON | `opwvhk.avro.json.JsonAsAvroParser` |
| XML | `opwvhk.avro.xml.XmlAsAvroParser` |

Parsers all use both a write schema and a read schema, just like Avro does. The write schema is used
to validate the input, and the read schema is used to describe the result.
Parsers require a read schema and an Avro model, determining the Avro record type to parse data into
and how to create the these records, respectively. Additionally, they support a format dependent
"write schema" (i.e., JSON schema, XSD, …), which is used for schema validation, and can be
used for input validation.

### Schema evolution

When parsing/converting data, the conversion can do implicit conversions that "fit". This includes
like widening conversions (like int→long), lossy conversions (like decimal→float or anything→string)
and parsing dates. With a write schema, binary conversions (from hexadecimal/base64 encoded text)
are also supported.

In addition, the read schema is used for schema evolution:

* removing fields: fields that are not present in the read schema will be ignored
* adding fields: fields that are not present in the input will be filled with the default values
from the read schema
* renaming fields: field aliases are also used to match incoming data, effectively renaming these
fields

### Source schema optional but encouraged

The parsers support as much functionality as possible when the write (source) schema is omitted.
However, this is discouraged. The reason is that significant functionality is missing:

* No check on required fields:
The parsers will happily generate incomplete records, which **will** break when using them.
* No check on compatibility:
Incompatible data cannot be detected, which **will** break the parsing process.
* No input validation:
Without a schema, a parser cannot validate input. This can cause unpredictable failures later on.

Expand Down
5 changes: 3 additions & 2 deletions src/main/java/opwvhk/avro/io/AsAvroParserBase.java
Original file line number Diff line number Diff line change
Expand Up @@ -390,8 +390,9 @@ protected ValueResolver createResolver(WriteSchema writeSchema, Schema readSchem
*
* <ul>
*
* <li>There is no early detection of incompatible types; exceptions due to incompatibilities will happen (at best) while parsing, but can occur later
* .</li>
* <li>
* There is no early detection of incompatible types; exceptions due to incompatibilities will happen (at best) while parsing, but can occur later.
* </li>
*
* <li>
* Specifically, it is impossible to determine if required fields may be missing. This will not cause problems when parsing, but will cause problems
Expand Down
23 changes: 12 additions & 11 deletions src/main/java/opwvhk/avro/json/JsonAsAvroParser.java
Original file line number Diff line number Diff line change
Expand Up @@ -106,25 +106,26 @@ private static boolean isValidEnum(SchemaProperties writeType, Schema readSchema
* @param model the Avro model used to create records
*/
public JsonAsAvroParser(URI jsonSchemaLocation, Schema readSchema, GenericData model) {
this(jsonSchemaLocation, true, readSchema, model);
this(jsonSchemaLocation, true, readSchema, Set.of(), model);
}

/**
* Create a JSON parser using only the specified Avro schema. The parse result will match the schema, but can be invalid if {@code validateInput} is set to
* {@code false}.
*
* @param jsonSchemaLocation the location of the JSON (write) schema (schema of the JSON data to parse)
* @param validateInput if {@code true}, validate the input when parsing
* @param readSchema the read schema (schema of the resulting records)
* @param model the Avro model used to create records
* @param jsonSchemaLocation the location of the JSON (write) schema (schema of the JSON data to parse)
* @param validateInput if {@code true}, validate the input when parsing
* @param readSchema the read schema (schema of the resulting records)
* @param fieldsAllowedMissing fields in the read schema that are allowed to be missing, even when this yields invalid records
* @param model the Avro model used to create records
*/
public JsonAsAvroParser(URI jsonSchemaLocation, boolean validateInput, Schema readSchema, GenericData model) {
this(model, analyseJsonSchema(jsonSchemaLocation), readSchema, validateInput);
public JsonAsAvroParser(URI jsonSchemaLocation, boolean validateInput, Schema readSchema, Set<Schema.Field> fieldsAllowedMissing, GenericData model) {
this(model, analyseJsonSchema(jsonSchemaLocation), readSchema, fieldsAllowedMissing, validateInput);
}

private static SchemaProperties analyseJsonSchema(URI jsonSchemaLocation) {
SchemaAnalyzer schemaAnalyzer = new SchemaAnalyzer();
return schemaAnalyzer.parseJsonProperties(jsonSchemaLocation);
return schemaAnalyzer.parseJsonProperties(jsonSchemaLocation);
}

/**
Expand All @@ -135,11 +136,11 @@ private static SchemaProperties analyseJsonSchema(URI jsonSchemaLocation) {
* @param model the Avro model used to create records
*/
public JsonAsAvroParser(Schema readSchema, GenericData model) {
this(model, null, readSchema, false);
this(model, null, readSchema, Set.of(), false);
}

private JsonAsAvroParser(GenericData model, SchemaProperties schemaProperties, Schema readSchema, boolean validateInput) {
super(model, schemaProperties, readSchema, Set.of());
private JsonAsAvroParser(GenericData model, SchemaProperties schemaProperties, Schema readSchema, Set<Schema.Field> fieldsAllowedMissing, boolean validateInput) {
super(model, schemaProperties, readSchema, fieldsAllowedMissing);
resolver = createResolver(schemaProperties, readSchema);
mapper = new ObjectMapper();
if (validateInput) {
Expand Down
Loading

0 comments on commit 8bbf40a

Please sign in to comment.