Allow specified fields to be absent

Ignore some required fields when matching write and read schemata. This allows records to be parsed, that are augmented with metadata after parsing. E.g., by adding a "parsed at" timestamp, or similar.
opwvhk · May 17, 2024 · 8bbf40a · 8bbf40a
1 parent 05c3d04
commit 8bbf40a
Show file tree

Hide file tree

Showing 4 changed files with 188 additions and 173 deletions.
diff --git a/doc/index.md b/doc/index.md
@@ -15,34 +15,46 @@ This document describes the various functionality in more detail.
 Parsing
 -------
 
-The main day-to-day use of this library is to parse records in various formats into Avro. As such,
-you won't find a converter for (for example) CSV files: these are container files with multiple
-records.
+The main day-to-day use of this library is to parse single records in various formats into Avro. As
+a result, you won't find a converter for (for example) CSV files: these are container files with
+multiple records.
 
 The following formats can be converted to Avro:
 
-| Format             | Parser constructor                                                                           |
-|--------------------|----------------------------------------------------------------------------------------------|
-| JSON (with schema) | `opwvhk.avro.json.JsonAsAvroParser#JsonAsAvroParser(URI, boolean, Schema, GenericData)`      |
-| JSON (unvalidated) | `opwvhk.avro.json.JsonAsAvroParser#JsonAsAvroParser(Schema, GenericData)`                    |
-| XML (with XSD)     | `opwvhk.avro.xml.XmlAsAvroParser#XmlAsAvroParser(URL, String, boolean, Schema, GenericData)` |
-| XML (unvalidated)  | `opwvhk.avro.xml.XmlAsAvroParser#XmlAsAvroParser(Schema, GenericData)`                       |
+| Format | Parser class                        |
+|--------|-------------------------------------|
+| JSON   | `opwvhk.avro.json.JsonAsAvroParser` |
+| XML    | `opwvhk.avro.xml.XmlAsAvroParser`   | 
 
-Parsers all use both a write schema and a read schema, just like Avro does. The write schema is used
-to validate the input, and the read schema is used to describe the result.
+Parsers require a read schema and an Avro model, determining the Avro record type to parse data into
+and how to create the these records, respectively. Additionally, they support a format dependent
+"write schema" (i.e., JSON schema, XSD, &hellip;), which is used for schema validation, and can be
+used for input validation.
+
+### Schema evolution
 
 When parsing/converting data, the conversion can do implicit conversions that "fit". This includes
 like widening conversions (like int→long), lossy conversions (like decimal→float or anything→string)
 and parsing dates. With a write schema, binary conversions (from hexadecimal/base64 encoded text)
 are also supported.
 
+In addition, the read schema is used for schema evolution:
+
+* removing fields: fields that are not present in the read schema will be ignored
+* adding fields: fields that are not present in the input will be filled with the default values
+  from the read schema
+* renaming fields: field aliases are also used to match incoming data, effectively renaming these
+  fields
+
 ### Source schema optional but encouraged
 
 The parsers support as much functionality as possible when the write (source) schema is omitted.
 However, this is discouraged. The reason is that significant functionality is missing:
 
 * No check on required fields:
   The parsers will happily generate incomplete records, which **will** break when using them.
+* No check on compatibility:
+  Incompatible data cannot be detected, which **will** break the parsing process.
 * No input validation:
   Without a schema, a parser cannot validate input. This can cause unpredictable failures later on.
 

diff --git a/src/main/java/opwvhk/avro/io/AsAvroParserBase.java b/src/main/java/opwvhk/avro/io/AsAvroParserBase.java
@@ -390,8 +390,9 @@ protected ValueResolver createResolver(WriteSchema writeSchema, Schema readSchem
 	 *
 	 * <ul>
 	 *
-	 * <li>There is no early detection of incompatible types; exceptions due to incompatibilities will happen (at best) while parsing, but can occur later
-	 * .</li>
+	 * <li>
+	 *     There is no early detection of incompatible types; exceptions due to incompatibilities will happen (at best) while parsing, but can occur later.
+	 * </li>
 	 *
 	 * <li>
 	 *     Specifically, it is impossible to determine if required fields may be missing. This will not cause problems when parsing, but will cause problems

diff --git a/src/main/java/opwvhk/avro/json/JsonAsAvroParser.java b/src/main/java/opwvhk/avro/json/JsonAsAvroParser.java
@@ -106,25 +106,26 @@ private static boolean isValidEnum(SchemaProperties writeType, Schema readSchema
 	 * @param model              the Avro model used to create records
 	 */
 	public JsonAsAvroParser(URI jsonSchemaLocation, Schema readSchema, GenericData model) {
-		this(jsonSchemaLocation, true, readSchema, model);
+		this(jsonSchemaLocation, true, readSchema, Set.of(), model);
 	}
 
 	/**
 	 * Create a JSON parser using only the specified Avro schema. The parse result will match the schema, but can be invalid if {@code validateInput} is set to
 	 * {@code false}.
 	 *
-	 * @param jsonSchemaLocation the location of the JSON (write) schema (schema of the JSON data to parse)
-	 * @param validateInput      if {@code true}, validate the input when parsing
-	 * @param readSchema         the read schema (schema of the resulting records)
-	 * @param model              the Avro model used to create records
+	 * @param jsonSchemaLocation   the location of the JSON (write) schema (schema of the JSON data to parse)
+	 * @param validateInput        if {@code true}, validate the input when parsing
+	 * @param readSchema           the read schema (schema of the resulting records)
+	 * @param fieldsAllowedMissing fields in the read schema that are allowed to be missing, even when this yields invalid records
+	 * @param model                the Avro model used to create records
 	 */
-	public JsonAsAvroParser(URI jsonSchemaLocation, boolean validateInput, Schema readSchema, GenericData model) {
-		this(model, analyseJsonSchema(jsonSchemaLocation), readSchema, validateInput);
+	public JsonAsAvroParser(URI jsonSchemaLocation, boolean validateInput, Schema readSchema, Set<Schema.Field> fieldsAllowedMissing, GenericData model) {
+		this(model, analyseJsonSchema(jsonSchemaLocation), readSchema, fieldsAllowedMissing, validateInput);
 	}
 
 	private static SchemaProperties analyseJsonSchema(URI jsonSchemaLocation) {
 		SchemaAnalyzer schemaAnalyzer = new SchemaAnalyzer();
-        return schemaAnalyzer.parseJsonProperties(jsonSchemaLocation);
+		return schemaAnalyzer.parseJsonProperties(jsonSchemaLocation);
 	}
 
 	/**
@@ -135,11 +136,11 @@ private static SchemaProperties analyseJsonSchema(URI jsonSchemaLocation) {
 	 * @param model      the Avro model used to create records
 	 */
 	public JsonAsAvroParser(Schema readSchema, GenericData model) {
-		this(model, null, readSchema, false);
+		this(model, null, readSchema, Set.of(), false);
 	}
 
-	private JsonAsAvroParser(GenericData model, SchemaProperties schemaProperties, Schema readSchema, boolean validateInput) {
-		super(model, schemaProperties, readSchema, Set.of());
+	private JsonAsAvroParser(GenericData model, SchemaProperties schemaProperties, Schema readSchema, Set<Schema.Field> fieldsAllowedMissing, boolean validateInput) {
+		super(model, schemaProperties, readSchema, fieldsAllowedMissing);
 		resolver = createResolver(schemaProperties, readSchema);
 		mapper = new ObjectMapper();
 		if (validateInput) {