-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom InputFile / OutputFile providers for Spark #107
Comments
InputFile and OutputFile instances are supplied by a table's TableOperations. That provides a way to supply your own implementations by overriding the data source used by Spark. We use MetacatTables to connect to our metastore instead of HadoopTables or HiveTables. The default Iceberg source for DSv2 uses HadoopTables (and will be moving to HiveTables). To override that, we have a subclass of the IcebergSource called IcebergMetacatSource that overrides I'd recommend using the same approach for your integration. The only thing you need to change is to supply a different ServiceLoader config with your source instead of the default. Also, this will eventually be cleaner when Spark adds catalog support. You'll point your Spark configuration at a catalog implementation directly and that catalog will allow you to instantiate tables with the right TableOperations. |
The current DSv2 reader and writer implementations don't do that right now though - they primarily use Additionally the
This would require rewriting much of the Hadoop table operations logic, right? Are we perhaps saying that's the right layer of abstraction to work with? We wanted to treat Iceberg as a bit of a black box (see #92 (comment)) in that Iceberg is just a storage / organizational layer and it takes care of the conventions of file paths in the metastore and the backing store; all we want to do is change how the bytes are written to some location that Iceberg has selected. (Comment edited because the thought didn't flow smoothly on the first iteration) |
Finally I don't see |
I'm closing this because discussion has moved to the Apache repo: apache/iceberg#12 |
It would be useful to allow the custom metadata described in #106 to be consumed by the Spark Data Source. For example, it would be helpful to encrypt the files upon writing them. But different users of the data source will have different ways they would want to use that custom metadata to inform how the data is read or written.
We therefore propose supporting a data source option that service loads an instance of the below interface in the data source reader and writer layer.
Below is an API sketch for such a plugin:
It is difficult, however, to make
IcebergSparkIO
Serializable. Therefore if we're not careful, we would have to service load the implementation on every executor, and that so multiple times. We propose instead to service load a provider class that can be passed the data source options data structure so that the plugin only has to be service loaded once and can be serialized to be distributed to the executor nodes. Therefore we also require the below interface:The text was updated successfully, but these errors were encountered: