title | slug | keyword | license |
---|---|---|---|
Gravitino connector - Iceberg catalog |
/trino-connector/catalog-iceberg |
gravitino connector trino |
Copyright 2023 Datastrato Pvt Ltd. This software is licensed under the Apache License version 2. |
Apache Iceberg is an open table format for huge analytic datasets. The Iceberg catalog allows Trino querying data stored in files written in Iceberg format, as defined in the Iceberg Table Spec. The catalog supports Apache Iceberg table spec versions 1 and 2.
To use Iceberg, you need:
- Network access from the Trino coordinator and workers to the distributed object storage.
- Access to a Hive metastore service (HMS), an AWS Glue catalog, a JDBC catalog, a REST catalog, or a Nessie server.
- Data files stored in a supported file format. These can be configured using file format configuration properties per catalog:
- ORC
- Parquet (default)
The Gravitino connector currently supports basic Iceberg table creation statements, such as defining fields, allowing null values, and adding comments. However, it does not support advanced features like partitioning, sorting, and distribution.
The Gravitino connector does not support CREATE TABLE AS SELECT
.
Support for the following alter table operations:
- Rename table
- Add a column
- Drop a column
- Rename a column
- Change a column type
- Set a table property
The Gravitino connector supports most SELECT statements, allowing the execution of queries successfully. Currently, it doesn't support certain query optimizations, such as pushdown and pruning functionalities.
Iceberg's tables and schemas do not support properties.
You need to do the following steps before you can use the Iceberg catalog in Trino through Gravitino.
- Create a metalake and catalog in Gravitino. Assuming that the metalake name is
test
and the catalog name isiceberg_test
, then you can use the following code to create them in Gravitino:
curl -X POST -H "Content-Type: application/json" \
-d '{
"name": "test",
"comment": "comment",
"properties": {}
}' http://gravitino-host:8090/api/metalakes
curl -X POST -H "Content-Type: application/json" \
-d '{
"name": "iceberg_test",
"type": "RELATIONAL",
"comment": "comment",
"provider": "lakehouse-iceberg",
"properties": {
"uri": "thrift://hive-host:9083",
"catalog-backend": "hive",
"warehouse": "hdfs://hdfs-host:9000/user/iceberg/warehouse"
}
}' http://gravitino-host:8090/api/metalakes/test/catalogs
For More information about the Iceberg catalog, please refer to Iceberg catalog.
- Set the value of configuration
gravitino.metalake
to the metalake you have created, named 'test', and start the Trino container.
Use the Trino CLI to connect to the Trino container and run a query.
Listing all Gravitino managed catalogs:
SHOW CATALOGS;
The results are similar to:
Catalog
----------------
gravitino
jmx
system
test.iceberg_test
(4 rows)
Query 20231017_082503_00018_6nt3n, FINISHED, 1 node
The gravitino
catalog is a catalog defined By Trino catalog configuration.
The test.iceberg_test
catalog is the catalog created by you in Gravitino.
Other catalogs are regular user-configured Trino catalogs.
Create a new schema named database_01
in test.iceberg_test
catalog.
CREATE SCHEMA "test.iceberg_test".database_01;
Create a new table named table_01
in schema "test.iceberg_test".database_01
.
CREATE TABLE "test.iceberg_test".database_01.table_01
(
name varchar,
salary int
);
Insert data into the table table_01
:
INSERT INTO "test.iceberg_test".database_01.table_01 (name, salary) VALUES ('ice', 12);
Insert data into the table table_01
from select:
INSERT INTO "test.iceberg_test".database_01.table_01 (name, salary) SELECT * FROM "test.iceberg_test".database_01.table_01;
Query the table_01
table:
SELECT * FROM "test.iceberg_test".database_01.table_01;
Add a new column age
to the table_01
table:
ALTER TABLE "test.iceberg_test".database_01.table_01 ADD COLUMN age int;
Drop a column age
from the table_01
table:
ALTER TABLE "test.iceberg_test".database_01.table_01 DROP COLUMN age;
Rename the table_01
table to table_02
:
ALTER TABLE "test.iceberg_test".database_01.table_01 RENAME TO "test.iceberg_test".database_01.table_02;
Drop a schema:
DROP SCHEMA "test.iceberg_test".database_01;
Drop a table:
DROP TABLE "test.iceberg_test".database_01.table_01;
Before running any Insert
statements for Iceberg tables in Trino,
you must check that the user Trino is using to access HDFS has access to the warehouse directory.
You can override this username by setting the HADOOP_USER_NAME system property in the Trino JVM config,
replacing hdfs_user with the appropriate username:
-DHADOOP_USER_NAME=hdfs_user