Improve documentation

opwvhk · Feb 9, 2024 · 6f17658 · 6f17658
1 parent 4f3dcd3
commit 6f17658
Show file tree

Hide file tree

Showing 2 changed files with 137 additions and 39 deletions.
diff --git a/README.md b/README.md
@@ -5,65 +5,57 @@
 Avro Conversions
 ================
 
-These Avro tools provide means to manipulate and/or describe schemas, including converting JSON and
-XML Schema definitions (XSD) into Avro schemas. Additionally, it provides a way to parse non-Avro
-data into Avro data structures.
+Provides parsers to read various formats as Avro records (currently JSON and XML), as well as tools
+to manipulate and/or describe schemas. The latter includes the conversion of XML Schema definitions
+(XSD) and JSON Schema into Avro schemas.
 
-Usage
------
 
-The class below describes how to use all functionality. It is a contrived example, as (for example)
-describing a schema in Markdown format and parsing data are not usually combined…
+Documentation
+-------------
+
+The documentation is split into two parts. The impatient among us can start with the "Quickstart"
+section below. For the others, there is [more elaborate documentation](doc/index.md).
+
+
+Quickstart
+----------
+
+Want to jump right in? Here's a simple example to readan Avro schema, JSON schema, and JSON file,
+and parse the latter using the first two:
 
 ```java
 import opwvhk.avro.SchemaManipulator;
 import opwvhk.avro.json.JsonAsAvroParser;
-import opwvhk.avro.xml.XmlAsAvroParser;
 import org.apache.avro.Schema;
 import org.apache.avro.generic.GenericRecord;
 
 class Example {
 	public static void main(String[] args) throws Exception {
-		Schema fromAvsc = SchemaManipulator.fromAvro(new URL(args[0]))
-				// By default, renaming fields/schemata will also add the old name as an alias.
-				.renameWithoutAliases()
-				.renameField("newName", "path", "to", "field")
-				.renameSchema("NewName", "OldName")
-				.finish();
-
-		GenericData model = GenericData.get();
-
-		if (!args[1].endsWith(".xsd")) {
-			Schema fromJson = SchemaManipulator.startFromJsonSchema(new URL(args[1])).finish();
-            // Note: JsonAsAvroParser can also handle JSON without a JSON schema,
-			// but in that case it cannot fail early if the data is invalid
-			JsonAsAvroParser parser = new JsonAsAvroParser(new URL(args[1]), fromJson, model);
-			// The record type depends on the class generated by GenericData.get() (you can also use SpecificData or ReflectiveData).
-			GenericRecord record = parser.parse(new URL(args[2]));
-			System.out.println(record);
-			return;
-		}
+		// Expected are 3 arguments:
+		// * the location of an Avro schema file
+		// * the location of an JSON schema file
+		// * the location of a JSON file
 
 		StringBuilder buffer = new StringBuilder();
-		Schema fromXsd = SchemaManipulator.fromXsd(new URL(args[1]))
-				// Unwrapping drops the single-field schema and makes the schema of its field the schema of the wrapping field.
-				.unwrapArray("path", "to", "non-array", "field", "with", "single-element", "schema")
-				// Unwraps all non-array fields with single-array-field schemata,
-				// provided the field names (except up to the last 3 characters) are the same.
-				// Note: 3 is a good number for English names
-				.unwrapArrays(3)
+		// Read an Avro schema; other methods can convert an XSD or JSON schema.
+		Schema readSchema = SchemaManipulator.startFromAvro(new URL(args[0]))
+				// By default, renaming fields/schemata will also add the old name as an alias.
+				.renameWithoutAliases()
 				.renameField("newName", "path", "to", "field")
 				.renameSchema("NewName", "OldName")
 				.alsoDocumentAsMarkdownTable(buffer)
 				.finish();
-
 		// Print the schema as a Markdown table
 		System.out.println(buffer);
 
-		XmlAsAvroParser parser = new XmlAsAvroParser(new URL(args[1]), args[2], fromXsd, model);
-		// The record type depends on the class generated by the model, set to GenericData.get() above.
-		// You can also use SpecificData or ReflectiveData to create different record types.
-		GenericRecord record = parser.parse(new URL(args[3]));
+		// Create a parser that reads JSON, validated according to the JSON schema, into Avro
+		// records using a read schema.
+		GenericData model = GenericData.get();
+		URL jsonSchemaLocation = new URL(args[1]);
+		JsonAsAvroParser parser = new JsonAsAvroParser(jsonSchemaLocation, readSchema, model);
+
+		// The record type depends on the class generated by GenericData.get() (you can also use SpecificData or ReflectiveData).
+		GenericRecord record = parser.parse(new URL(args[2]));
 		System.out.println(record);
 	}
 }

diff --git a/doc/index.md b/doc/index.md
@@ -0,0 +1,106 @@
+Avro Conversions - Documentation
+================================
+
+Usage
+-----
+
+This library is intended to parse data and to manipulate schemas for that. As such, this
+documentation is split into two parts: parsing & schema manipulation.
+
+Want to get started quickly? See the [Quickstart](../README.md#quickstart) section in the README.
+
+This document describes the various functionality in more detail.
+
+
+Parsing
+-------
+
+The main day-to-day use of this library is to parse records in various formats into Avro. As such,
+you won't find a converter for (for example) CSV files: these are container files with multiple
+records.
+
+The following formats can be converted to Avro:
+
+| Format             | Parser constructor                                  | Parser class                        |
+|--------------------|-----------------------------------------------------|-------------------------------------|
+| JSON (with schema) | `JsonAsAvroParser(URI, Schema, GenericData)`        | `opwvhk.avro.json.JsonAsAvroParser` |
+| JSON (unvalidated) | `JsonAsAvroParser(Schema, GenericData)`             | `opwvhk.avro.json.JsonAsAvroParser` |
+| XML (with XSD)     | `XmlAsAvroParser(URL, String, Schema, GenericData)` | `opwvhk.avro.xml.XmlAsAvroParser`   |
+| XML (unvalidated)  | `XmlAsAvroParser(Schema, GenericData)`              | `opwvhk.avro.xml.XmlAsAvroParser`   |
+
+Parsers all use both a write schema and a read schema, just like Avro does. The write schema is used
+to validate the input, and the read schema is used to describe the result.
+
+When parsing/converting data, the conversion can do implicit conversions that "fit". This includes
+like widening conversions (like int→long), lossy conversions (like decimal→float or anything→string)
+and parsing dates. With a write schema, binary conversions (from hexadecimal/base64 encoded text)
+are also supported.
+
+
+### Source schema optional but encouraged
+
+The parsers support as much functionality as possible when the write (source) schema is omitted.
+However, this is discouraged. The reason is that significant functionality is missing:
+
+* No check on required fields:
+  The parsers will happily generate incomplete records, which **will** break when using them.
+* No input validation:
+  Without a schema, a parser cannot validate input. This can cause unpredictable failures later on.
+
+Summary: you should always use a write (source) schema whenever possible.
+
+
+### Supported conversions
+
+When parsing, these Avro types are supported:
+
+| Avro                    | JSON (schema) | JSON | XML (schema) | XML |
+|-------------------------|---------------|------|--------------|-----|
+| Record                  | ✅             | ✅    | ✅            | ✅   |
+| Map                     | ❌             | ❌    | ❌            | ❌   |
+| Array                   | ✅             | ✅    | ✅            | ✅   |
+| Enum                    | ✅             | ✅    | ✅            | ✅   |
+| Boolean                 | ✅             | ✅    | ✅            | ✅   |
+| Integer                 | ✅             | ✅    | ✅            | ✅   |
+| Long                    | ✅             | ✅    | ✅            | ✅   |
+| Float                   | ✅             | ✅    | ✅            | ✅   |
+| Double                  | ✅             | ✅    | ✅            | ✅   |
+| String                  | ✅             | ✅    | ✅            | ✅   |
+| Fixed (hex)             | ❌             | ❌    | ❌            | ❌   |
+| Fixed (base64)          | ❌             | ❌    | ❌            | ❌   |
+| Bytes (hex)             | ✅             | ❌    | ✅            | ❌   |
+| Bytes (base64)          | ✅             | ❌    | ✅            | ❌   |
+| Decimal                 | ✅             | ✅    | ✅            | ✅   |
+| Datetime (millis)       | ✅             | ✅    | ✅            | ✅   |
+| Datetime (micros)       | ✅             | ✅    | ✅            | ✅   |
+| Local Datetime (millis) | ✅             | ✅    | ✅            | ✅   |
+| Local Datetime (micros) | ✅             | ✅    | ✅            | ✅   |
+| Date                    | ✅             | ✅    | ✅            | ✅   |
+| Time (millis)           | ✅             | ✅    | ✅            | ✅   |
+| Time (micros)           | ✅             | ✅    | ✅            | ✅   |
+
+
+
+Schema manipulations
+--------------------
+
+The class to convert schemas into Avro schemas and other manipulations is
+`opwvhk.avro.SchemaManipulator`.
+
+There are multiple starting points:
+* `startFromAvro(String)` to parse an Avro schema from a String
+* `startFromAvro(URL)` to read an Avro schema from a location
+* `startFromJsonSchema(URL)` to read a JSON schema and convert it into an Avro schema
+* `startFromXsd(URL)` to read an XML Schema Definition and convert it into an Avro schema
+
+Next, you can rename schemas and fields or unwrap arrays (especially useful for XML). See the
+various methods on the class for details.
+
+Note that by default, any rename also adds the previous name as an alias. This allows you to use
+the same source schema (be it a JSON schema or XSD) as input for both the parser and schema
+manipulation. The advantage is that this causes fields and schemata to be *renamed while parsing*.
+
+And finally, you can document the schema in a Markdown table. This can be your goal (using
+`asMarkdownTable()`) or a by-product (using `alsoDocumentAsMarkdownTable(StringBuilder)` and
+`finish()`).
+