Skip to content

Commit

Permalink
Improve documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
opwvhk committed Feb 9, 2024
1 parent 4f3dcd3 commit 6f17658
Show file tree
Hide file tree
Showing 2 changed files with 137 additions and 39 deletions.
70 changes: 31 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,65 +5,57 @@
Avro Conversions
================

These Avro tools provide means to manipulate and/or describe schemas, including converting JSON and
XML Schema definitions (XSD) into Avro schemas. Additionally, it provides a way to parse non-Avro
data into Avro data structures.
Provides parsers to read various formats as Avro records (currently JSON and XML), as well as tools
to manipulate and/or describe schemas. The latter includes the conversion of XML Schema definitions
(XSD) and JSON Schema into Avro schemas.

Usage
-----

The class below describes how to use all functionality. It is a contrived example, as (for example)
describing a schema in Markdown format and parsing data are not usually combined…
Documentation
-------------

The documentation is split into two parts. The impatient among us can start with the "Quickstart"
section below. For the others, there is [more elaborate documentation](doc/index.md).


Quickstart
----------

Want to jump right in? Here's a simple example to readan Avro schema, JSON schema, and JSON file,
and parse the latter using the first two:

```java
import opwvhk.avro.SchemaManipulator;
import opwvhk.avro.json.JsonAsAvroParser;
import opwvhk.avro.xml.XmlAsAvroParser;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;

class Example {
public static void main(String[] args) throws Exception {
Schema fromAvsc = SchemaManipulator.fromAvro(new URL(args[0]))
// By default, renaming fields/schemata will also add the old name as an alias.
.renameWithoutAliases()
.renameField("newName", "path", "to", "field")
.renameSchema("NewName", "OldName")
.finish();

GenericData model = GenericData.get();

if (!args[1].endsWith(".xsd")) {
Schema fromJson = SchemaManipulator.startFromJsonSchema(new URL(args[1])).finish();
// Note: JsonAsAvroParser can also handle JSON without a JSON schema,
// but in that case it cannot fail early if the data is invalid
JsonAsAvroParser parser = new JsonAsAvroParser(new URL(args[1]), fromJson, model);
// The record type depends on the class generated by GenericData.get() (you can also use SpecificData or ReflectiveData).
GenericRecord record = parser.parse(new URL(args[2]));
System.out.println(record);
return;
}
// Expected are 3 arguments:
// * the location of an Avro schema file
// * the location of an JSON schema file
// * the location of a JSON file

StringBuilder buffer = new StringBuilder();
Schema fromXsd = SchemaManipulator.fromXsd(new URL(args[1]))
// Unwrapping drops the single-field schema and makes the schema of its field the schema of the wrapping field.
.unwrapArray("path", "to", "non-array", "field", "with", "single-element", "schema")
// Unwraps all non-array fields with single-array-field schemata,
// provided the field names (except up to the last 3 characters) are the same.
// Note: 3 is a good number for English names
.unwrapArrays(3)
// Read an Avro schema; other methods can convert an XSD or JSON schema.
Schema readSchema = SchemaManipulator.startFromAvro(new URL(args[0]))
// By default, renaming fields/schemata will also add the old name as an alias.
.renameWithoutAliases()
.renameField("newName", "path", "to", "field")
.renameSchema("NewName", "OldName")
.alsoDocumentAsMarkdownTable(buffer)
.finish();

// Print the schema as a Markdown table
System.out.println(buffer);

XmlAsAvroParser parser = new XmlAsAvroParser(new URL(args[1]), args[2], fromXsd, model);
// The record type depends on the class generated by the model, set to GenericData.get() above.
// You can also use SpecificData or ReflectiveData to create different record types.
GenericRecord record = parser.parse(new URL(args[3]));
// Create a parser that reads JSON, validated according to the JSON schema, into Avro
// records using a read schema.
GenericData model = GenericData.get();
URL jsonSchemaLocation = new URL(args[1]);
JsonAsAvroParser parser = new JsonAsAvroParser(jsonSchemaLocation, readSchema, model);

// The record type depends on the class generated by GenericData.get() (you can also use SpecificData or ReflectiveData).
GenericRecord record = parser.parse(new URL(args[2]));
System.out.println(record);
}
}
Expand Down
106 changes: 106 additions & 0 deletions doc/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
Avro Conversions - Documentation
================================

Usage
-----

This library is intended to parse data and to manipulate schemas for that. As such, this
documentation is split into two parts: parsing & schema manipulation.

Want to get started quickly? See the [Quickstart](../README.md#quickstart) section in the README.

This document describes the various functionality in more detail.


Parsing
-------

The main day-to-day use of this library is to parse records in various formats into Avro. As such,
you won't find a converter for (for example) CSV files: these are container files with multiple
records.

The following formats can be converted to Avro:

| Format | Parser constructor | Parser class |
|--------------------|-----------------------------------------------------|-------------------------------------|
| JSON (with schema) | `JsonAsAvroParser(URI, Schema, GenericData)` | `opwvhk.avro.json.JsonAsAvroParser` |
| JSON (unvalidated) | `JsonAsAvroParser(Schema, GenericData)` | `opwvhk.avro.json.JsonAsAvroParser` |
| XML (with XSD) | `XmlAsAvroParser(URL, String, Schema, GenericData)` | `opwvhk.avro.xml.XmlAsAvroParser` |
| XML (unvalidated) | `XmlAsAvroParser(Schema, GenericData)` | `opwvhk.avro.xml.XmlAsAvroParser` |

Parsers all use both a write schema and a read schema, just like Avro does. The write schema is used
to validate the input, and the read schema is used to describe the result.

When parsing/converting data, the conversion can do implicit conversions that "fit". This includes
like widening conversions (like int→long), lossy conversions (like decimal→float or anything→string)
and parsing dates. With a write schema, binary conversions (from hexadecimal/base64 encoded text)
are also supported.


### Source schema optional but encouraged

The parsers support as much functionality as possible when the write (source) schema is omitted.
However, this is discouraged. The reason is that significant functionality is missing:

* No check on required fields:
The parsers will happily generate incomplete records, which **will** break when using them.
* No input validation:
Without a schema, a parser cannot validate input. This can cause unpredictable failures later on.

Summary: you should always use a write (source) schema whenever possible.


### Supported conversions

When parsing, these Avro types are supported:

| Avro | JSON (schema) | JSON | XML (schema) | XML |
|-------------------------|---------------|------|--------------|-----|
| Record |||||
| Map |||||
| Array |||||
| Enum |||||
| Boolean |||||
| Integer |||||
| Long |||||
| Float |||||
| Double |||||
| String |||||
| Fixed (hex) |||||
| Fixed (base64) |||||
| Bytes (hex) |||||
| Bytes (base64) |||||
| Decimal |||||
| Datetime (millis) |||||
| Datetime (micros) |||||
| Local Datetime (millis) |||||
| Local Datetime (micros) |||||
| Date |||||
| Time (millis) |||||
| Time (micros) |||||



Schema manipulations
--------------------

The class to convert schemas into Avro schemas and other manipulations is
`opwvhk.avro.SchemaManipulator`.

There are multiple starting points:
* `startFromAvro(String)` to parse an Avro schema from a String
* `startFromAvro(URL)` to read an Avro schema from a location
* `startFromJsonSchema(URL)` to read a JSON schema and convert it into an Avro schema
* `startFromXsd(URL)` to read an XML Schema Definition and convert it into an Avro schema

Next, you can rename schemas and fields or unwrap arrays (especially useful for XML). See the
various methods on the class for details.

Note that by default, any rename also adds the previous name as an alias. This allows you to use
the same source schema (be it a JSON schema or XSD) as input for both the parser and schema
manipulation. The advantage is that this causes fields and schemata to be *renamed while parsing*.

And finally, you can document the schema in a Markdown table. This can be your goal (using
`asMarkdownTable()`) or a by-product (using `alsoDocumentAsMarkdownTable(StringBuilder)` and
`finish()`).

0 comments on commit 6f17658

Please sign in to comment.