Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Unity Catalog writes using
daft.DataFrame.write_deltalake()
(#…
…3522) This pull request covers the below 4 workflows that were tested internally (on Databricks on Azure and AWS) after building the package in a local environment: - Load existing table in Unity Catalog and append to it without schema change : `df.write_deltalake( uc_table, mode=‘append’)` to existing table in UC retrieved using `unity.load_table(table_name)` - Load existing table in Unity Catalog and overwrite it without schema change : `df.write_deltalake( uc_table, mode=‘overwrite’)` overwrite existing table in UC retrieved using `unity.load_table(table_name)` - Load existing table in Unity Catalog and overwrite it with schema change : `df.w rite_deltalake( uc_table, mode=‘overwrite’, schema_mode = ‘overwrite’)` overwrite existing table, with schema change, in UC retrieved using `unity.load_table(table_name)` - Create new table in Unity Catalog using Daft engine and populate it with data : Register a new table in UC without any schema using `unity.load_table(table_name, storage_path=“<some_valid_cloud_storage_path>”)` and `df.write_deltalake( uc_table, mode=‘overwrite’ , schema_mode = ‘overwrite’)` A few notes : - `deltalake` (0.22.3) does not support writes to table with Deletion vectors enabled. For appends to existing table, to avoid `CommitFailedError: Unsupported reader features required: [DeletionVectors]`, ensure the tables being written to do not have Deletion vector enabled. - `httpx==0.27.2` pinned dependency is due to a defect with unitycatalog-python, which is affecting Daft as well for all the previous versions. Fixing it from this PR. - If schema updates are performed by Daft, readers will immediately see the new schema since Delta log is self-containing. However, in Unity Catalog UI for the schema to update, will need to use `REPAIR TABLE catalog.schema.table SYNC METADATA;` from Databricks compute to update UC metadata to match what is in Delta log. - In this version, append to an existing table after changing schema is not supported. Only overwrites are supported. - For AWS, needed to set environment variable using `export AWS_S3_ALLOW_UNSAFE_RENAME=true`. - There appears to be a defect with the `allow_unsafe_rename` parameter in df.write_deltalake as it did not work during internal testing. This could be a new issue to log , once confirmed. --------- Co-authored-by: Kev Wang <[email protected]>
- Loading branch information