-
Notifications
You must be signed in to change notification settings - Fork 307
Namespace name is set when undefined #255
Comments
Hi, the behavior you mentioned is from this commit: From avro spec:
So namespace with leading dot should be OK. Can you specify where the problem is? |
Problem is that when namespace is null namespace value should be empty or null. when namespace is not specified, we are expecting namespace to be |
I am actually having the same issue as above. And it is coming back to bite us because we are trying to load the data into a Big Query table. We don't set the namespace, and one is automatically generated that begins with a dot. We then get the following error:
|
After this change was merged: loading a dataset, then saving and loading it again with the same schema, since every nested record is prefixed with invalid namespace (example below). val schema = Schema.Parser().parse("""{
"type": "record",
"name": "TestRecord",
"namespace": "a.b.c",
"fields": [
{
"name": "key",
"type": [
{
"type": "record",
"name": "key",
"fields": [
{
"name": "email",
"type": [ "string", "null"]
}
]
},
"null"
]
}
]
}"""
val df = sql.read.format("com.databricks.spark.avro")
.option("avroSchema", schema)
.load("/tmp/random.avro")
df.show(false) // so far so good
df.write.format("com.databricks.spark.avro")
.option("recordName", "TestRecord")
.option("recordNamespace", "a.b.c")
.option("avroSchema", schema)
.save("/tmp/random.out")
val loaded = sql.read.format("com.databricks.spark.avro")
.option("recordName", "TestRecord")
.option("recordNamespace", "a.b.c")
.option("avroSchema", schema)
.load("/tmp/random.out")
loaded.show(false) // Failure! AvroTypeException is thrown
Caused by: org.apache.avro.AvroTypeException: Found a.b.c.key.key, expecting union
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1$$anon$1.next(DefaultSource.scala:228)
at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1$$anon$1.next(DefaultSource.scala:205)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) |
Even worse - each time you save/load/save/ your dataset it prepends a field name into the namespace. |
I made a patched release by undoing the #249 PR - https://jitpack.io/#relateiq/spark-avro |
Fix it here: apache/spark#21974 |
@gengliangwang try adding a test case I suggested above ^ |
This is related to linkedin/goavro#96
Since 4.0.0, within the nested structure we are seeing that namespace is defined despite us never explicitly setting them. Is this defined in spec?
If we set
Map("recordName" -> "usageData", "recordNamespace" -> "abc")
than namespace becomes "abc.usageData".The text was updated successfully, but these errors were encountered: