DB-6045 Support compression for avro external table #2065

changli6 · 2018-07-17T17:28:48Z

No description provided.

jyuanca · 2018-07-18T15:29:05Z

hbase_sql/src/main/java/com/splicemachine/derby/stream/spark/SparkDataSetProcessor.java

+                }
+                else if (storedAs.toLowerCase().equals("a")) {
+                    empty.write().partitionBy(partitionByCols.toArray(new String[partitionByCols.size()]))
+                            .mode(SaveMode.Append).format("com.databricks.spark.avro").save(location);


why option("compression", compression) does not work for avro?

I tried but not work.
In the Spark-Avro doc (https://docs.databricks.com/spark/latest/data-sources/read-avro.html), the example showed the compression code was accepted by spark conf settting.

You can also specify Avro compression options:
Copy to clipboardCopy
import com.databricks.spark.avro._
// configuration to use deflate compression
spark.conf.set("spark.sql.avro.compression.codec", "deflate")

Others raised the same confusion of compression code setting, but was not responded.
(databricks/spark-avro#259)

jyuanca · 2018-07-18T15:33:43Z

hbase_sql/src/main/java/com/splicemachine/derby/stream/spark/SparkDataSetProcessor.java

+                if (compression.toLowerCase().equals("zlib"))
+                    compression = "deflate";
+                spliceSpark.conf().set("spark.sql.avro.compression.codec",compression);
+            }


spark session and configuration is shared by all spark jobs. If two "create external table" are being executed, and one wants to compress avro file, the other does not, or they try to compress in different format, setting a global configuration may not work

Thanks @jyuanca Global session is improper, let me address it.

jyuanca · 2018-07-18T20:20:22Z

...e_sql/src/test/java/com/splicemachine/derby/impl/sql/execute/operations/ExternalTableIT.java

+        int insertCount = methodWatcher.executeUpdate(String.format("insert into compressed_zlib_avro_test values ('XXXX')," +
+                "('YYYY')"));
+        Assert.assertEquals("insertCount is wrong",2,insertCount);
+        ResultSet rs = methodWatcher.executeQuery("select * from compressed_zlib_avro_test");
        Assert.assertEquals("COL1 |\n" +


The test cannot tell whether the table is compressed

changli6 · 2018-07-19T11:29:59Z

Global session level setting of Avro compression may impact other Avro-related spark jobs, and spark-avro doesn't support DataFrame level setting for compression so far, we have to defer this feature until Spark implemented the DataFrame level setting for Avro file.

DB-6045 Support compression for avro external table

50c8e3a

changli6 assigned dgomezferro, wb14123, msirek, jyuanca, OlegMazurov, wtang16 and yxia92 Jul 17, 2018

jyuanca reviewed Jul 18, 2018

View reviewed changes

changli6 closed this Jul 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB-6045 Support compression for avro external table #2065

DB-6045 Support compression for avro external table #2065

changli6 commented Jul 17, 2018

jyuanca Jul 18, 2018

changli6 Jul 18, 2018

jyuanca Jul 18, 2018

changli6 Jul 18, 2018

jyuanca Jul 18, 2018

changli6 commented Jul 19, 2018

DB-6045 Support compression for avro external table #2065

DB-6045 Support compression for avro external table #2065

Conversation

changli6 commented Jul 17, 2018

jyuanca Jul 18, 2018

Choose a reason for hiding this comment

changli6 Jul 18, 2018

Choose a reason for hiding this comment

jyuanca Jul 18, 2018

Choose a reason for hiding this comment

changli6 Jul 18, 2018

Choose a reason for hiding this comment

jyuanca Jul 18, 2018

Choose a reason for hiding this comment

changli6 commented Jul 19, 2018