This repository has been archived by the owner on Dec 20, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 307
Fails to write record containing map of array of record #272
Comments
Due to #92 the generated schema for the Spark Row has all its fields nullable, which gives the following output schema: {
"type": "record",
"name": "topLevelRecord",
"fields": [{
"name": "properties",
"type": [{
"type": "map",
"values": [{
"type": "array",
"items": [{
"type": "record",
"name": "properties",
"fields": [{
"name": "string",
"type": ["string", "null"]
}
]
}, "null"]
}, "null"]
}, "null"]
}
]
} When I try to use this schema with the json object to generate a new avro file: Exception in thread "main" org.apache.avro.AvroTypeException: Unknown union branch object at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:445) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:99) at org.apache.avro.tool.Main.run(Main.java:87) at org.apache.avro.tool.Main.main(Main.java:76) Is the generated schema invalid or is it a bug in Avro? |
It is linked to the nullable issue (#92), the expected new input data for the generated schema is: {
"properties": {
"map": {
"object": {
"array": [
{"properties": {"string": {"string": "one"}}},
{"properties": {"string": {"string": "two"}}}
]
}
}
}
} instead of the initial input {
"properties": {
"object": [
{ "string": "one" },
{ "string": "two" }
]
}
} |
I found a workaround by using the schema to generate the right StructType with SchemaConverters and creating a new dataset with this StructType: public static void main(String[] args) throws Exception {
Schema schema = new Schema.Parser().parse(Main.class.getClassLoader().getResourceAsStream("schema.json"));
DataType dataType = SchemaConverters.toSqlType(schema).dataType();
StructType structType = (StructType)dataType;
URL avroResource = Main.class.getClassLoader().getResource("event.avro");
if (avroResource == null) {
throw new RuntimeException("Missing resource event.avro");
}
SparkSession sparkSession = SparkSession.builder()
.appName("com.laymain.sandbox.avro")
.master("local[*]")
.getOrCreate();
Dataset<Row> dataset = sparkSession
.read()
.format("com.databricks.spark.avro")
.load(avroResource.getPath());
dataset
.sqlContext()
.createDataFrame(dataset.rdd(), structType)
.write()
.mode(SaveMode.Overwrite)
.format("com.databricks.spark.avro")
.save("output");
} |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hi,
Spark-avro fails to write a record that contains map of array of record with the following error:
schema.json
event.json
Avro file generated using avro-tools:
java -jar avro-tools-1.8.2.jar --schema-file schema.json event.json > event.avro
Spark code:
The text was updated successfully, but these errors were encountered: