Iceberg¶
Dataiku can interact with Iceberg catalogs, to read and write Iceberg tables.
Iceberg tables are defined in a catalog. The amount of information kept in the catalog is usually low, and both their data and metadata are stored in files. Access to these files is done from the client side, for instance Dataiku.
Connecting to a catalog¶
Leveraging Iceberg in Dataiku starts with creating a connection to an Iceberg catalog. Each Iceberg connection in Dataiku targets a single warehouse in the underlying Iceberg catalog. The Iceberg ecosystem is still evolving, and there are many implementations of the Iceberg catalog spec. Dataiku connects using the standard Iceberg Java libraries.
REST catalogs¶
Many Iceberg catalogs use the REST API, or have an implementation of the REST API on top of their own interfaces. The Polaris catalog is a well known implementation, and the Snowflake Open Catalog is notably a Polaris catalog. The Nessie catalog also offers a REST interface on top of its own. Likewise, the Glue catalog offers a REST interface .
To connect to a REST catalog from Dataiku, your Iceberg connection needs:
the catalog type set to REST
the URI field pointing to the REST endpoint of the catalog
the name of a Warehouse defined in the catalog
credentials, either User/Password or OAuth2, depending on what’s available on the catalog
More often than not, additional Catalog properties are needed. For example, most implementations will ask for a scope property; a classical value on Polaris catalogs is scope -> PRINCIPAL_ROLE:ALL. Similarly, to add HTTP request headers to calls to the catalog, you need to add catalog properties prefixed by header.. For example, activating vended credentials can be done by adding header.X-Iceberg-Access-Delegation -> vended-credentials catalog property, whose effect is to set the X-Iceberg-Access-Delegation header on the requests to the catalog.
Glue¶
On AWS, you can use Glue as a catalog to store Iceberg table definitions. The data files will be stored on S3, and you can additionally query them via AWS Athena. This catalog also handles Iceberg tables in S3 tables .
Credentials for accessing Glue and for accessing S3 (to get the files) come from the default AWS credentials provider chain. To ensure a specific set of credentials are used, you need to pass the name of a S3 connection as S3/Glue credentials : Dataiku will use the latter connection to provide credentials for accessing not only S3 but also Glue.
To connect to a Glue catalog from Dataiku, your Iceberg connection needs:
the catalog type set to Glue
the Glue id, which is the AWS account id
as Warehouse, the path to some location in S3 where the tables’ metadata and data will be stored
the Region set to the desired AWS region
The Glue catalog maps Glue “databases” to Iceberg namespaces. This implies that the Glue catalog only has one level of namespaces (the Glue databases) and doesn’t handle hierarchical namespaces.
Glue as REST catalog¶
On top of the classical Glue catalog, AWS Glue offer a REST interface which can be use to setup a REST Iceberg catalog.
To connect to Glue as a REST catalog from Dataiku, your Iceberg connection needs:
the catalog type set to REST
the URI field pointing to the https://glue.<aws-region>.amazonaws.com/iceberg endpoint
the AWS account ID as Warehouse
the Authentication type set to AWS Signature V4
the Signing name set to glue
Typically you also need to pass a S3 connection in the S3/Glue credentials field, so that Dataiku can get credentials for Glue and for the S3 objects of the Iceberg tables. If left empty, the default credentials provider chain is used.
S3Tables as REST catalog¶
S3 tables is an Iceberg catalog entirely managed by AWS. Data and metadata are stored in special S3 buckets called table buckets . To use S3 tables from Dataiku, you need to setup a REST Iceberg connection:
the catalog type set to REST
the URI field pointing to the https://s3tables.<aws-region>.amazonaws.com/iceberg endpoint
the Warehouse is the ARN of the S3 Tables bucket to use
the Authentication type set to AWS Signature V4
the Signing name set to s3tables
add a Catalog property : io-impl -> com.dataiku.dss.shadelibawssk2.org.apache.iceberg.aws.s3.S3FileIO
Snowflake¶
Snowflake offers several catalogs for Iceberg. In Dataiku, the “Snowflake” catalog corresponds to the Snowflake as a catalog one.
The setup needed for the Snowflake Iceberg catalog is close to the setup needed for a standard Snowflake connection, since it’s calling the Snowflake APIs under the hood:
the catalog type set to Snowflake
the Host set to host of your Snowflake account
a role to connect as in Role
the Authentication type can be User/Password, OAuth2 or Keypair
Nessie¶
Nessie catalogs offer two URLs on which they serve the Iceberg catalog, but each URL implements a different interface:
the http[s]://host:port/iceberg serves a REST catalog
the http[s]://host:port/api/v2 serves a Nessie catalog
To use a Nessie catalog through its Nessie interface in Dataiku, the setup is as follows:
the catalog type set to Nessie
the URI set to the URI of the Nessie catalog’s Nessie interface (ending in /api/v2)
the Warehouse set to the name of a warehouse defined in Nessie
the Authentication type can be User/Password or OAuth2
Hadoop¶
Hadoop catalogs encode the namespaces and tables definitions as a folder hierarchy on a HDFS-compatible filesystem. To use some HDFS-compatible filesystem as an Iceberg catalog, the Iceberg connection should have
the Catalog type set to Hadoop
the Warehouse be a HDFS URI, like hdfs://host:port/path or s3a://bucket/path/
Note that Dataiku will perform the access to the data and metadata files, and therefore needs to choose a Hadoop identity for these accesses. If Use impersonated user is left unticked, Dataiku will use the identity of the UNIX user running the Dataiku backend. If it is ticked, then Dataiku will use the identity of the user logged in Dataiku that performs the action.
Hive¶
This catalog lets Dataiku access Iceberg tables whose definition is held by a Hive metastore. The Iceberg connection’s settings should have:
the Catalog type set to Hive
the Metastore Uri set to the URI of the Hive metastore (usually a thrift://host:port)
the Warehouse be a HDFS URI pointing to a Hive warehouse
Note that Dataiku will perform the access to the data and metadata files, and therefore needs to choose a Hadoop identity for these accesses. If Use impersonated user is left unticked, Dataiku will use the identity of the UNIX user running the Dataiku backend. If it is ticked, then Dataiku will use the identity of the user logged in Dataiku that performs the action.
JDBC¶
Any SQL database with a JDBC driver can be used to hold an Iceberg catalog. The database only contains the tables and namespaces names and the location of the metadata. The metadata and the data live as files on a cloud storage. This type of catalog is meant for testing and debugging purposes by the Iceberg project.
To use a SQL database to hold an Iceberg catalog, you need to create an Iceberg connection with:
the catalog type set to JDBC
the URI set to the JDBC URL of the SQL database (like jdbc:postgresql://host:port/db)
the Warehouse set to the URI of the location where the data and metadata files are stored (like a s3://bucket/path/)
the Driver class set to the class name of the JDBC driver to use
the Driver jars directory pointing to the location of the jar of the JDBC driver on the local filesystem
the Authentication type can be User/Password or OAuth2
Namespaces¶
Iceberg catalogs have a notion of namespacing. Tables are grouped in namespaces, and namespaces can also be grouped in parent namespaces, yielding a hierarchical structure. Some catalogs don’t support a nested hierarchy, like Glue or Hive. Dataiku can use namespaces and sub-namespaces:
Iceberg datasets have a Namespace field in which the namespace of the table can be set
Iceberg connections have a Default namespace field with a value to use when the Namespace field of the dataset is left empty
to point to a sub-namespace, you pass the namespaces dot-separated, like: namespace1.namespace2.namespace3 (which is namespace3 inside namespace2 inside namespace1)
Namespaces are referred to by their name, and the Iceberg spec allows almost all names to be used. To use namespace names containing dots, quoting should be used. For example ns1.”ns.2”.”ns-3” denotes ns-3 inside ns.2 inside ns1.
Credentials¶
To process Iceberg tables, the client side (Dataiku for instance) needs to access the files, which are often on external systems like cloud storages, for which credentials are required.
The Iceberg spec has provision for vended credentials, which take the form of credentials that the catalog prepares and scopes to only what is needed to access the table’s contents. These credentials are returned by the catalog with the table’s definition when fetched. You can activate them by adding a catalog property in the connection in Dataiku: vended-credentials-enabled -> true .
When vended credentials are not available, or won’t last long enough to cover the duration of the jobs submitted by Dataiku, you can specify names of Dataiku connections to grab credentials from in the Cloud storage credentials section. These credentials will be renewed as needed.
Unless Add endpoint properties is ticked, only credentials-related values are passed from the cloud storage connections from the Cloud storage credentials section to the Iceberg libraries. When ticked, additional properties for connecting to the cloud storage are passed. Typically, the S3 endpoint and GCS endpoint are passed. This is needed for example for data stored on MinIO, because the endpoint and path style are different from the defaults for S3.
Supported engines¶
Iceberg tables can be read from and written to by Dataiku, so all recipes using the Dataiku Stream engine will work. Additionally, you can use Spark, and if configured, a Trino connection, to process the tables.
If not using a Spark provided by a CDP cluster, or the Spark standalone package, then you need to ensure the Spark installation has the jars needed to handle Iceberg tables.
Trino can create catalogs (in the Trino sense) to handle Iceberg tables. To leverage a Trino cluster to process Iceberg datasets in Dataiku, you need to create a Trino connection to the cluster, and :
reference it as Associated Trino conn. in the Iceberg connection
pass the name of the Trino catalog to use as Associated Trino catalog. This Trino catalog should point to the Iceberg catalog of the Iceberg connection
Running SQL recipes off Iceberg datasets will then become possible.
Notes¶
Dataiku doesn’t support the bucket partitioning schemes of Iceberg, only identity, year, month, day and hour. Dataiku will not be able to pushdown partition filters for dimensions using other partitioning types.
While Iceberg supports using ORC or Avro as file format for the data and metadata files, Dataiku only supports Parquet, both for reading and writing.
Some Iceberg data types are not supported by Dataiku: variant, geometry, geography, uuid.
While Iceberg offers time travel capabilities in Iceberg tables’ data, Dataiku doesn’t support this feature.