Trino/Starburst

Warning

Tier 2 support: Connection to Trino/Starburst is covered by Tier 2 support

Note

You might want to start with our resources on data connections in the Knowledge Base.

DSS supports connecting to Trino clusters, including Starburst ones:

  • Reading and writing datasets

  • Executing SQL recipes

  • Performing visual recipes in-database

  • Using live engine for charts

Connection setup (Dataiku Cloud Stacks or Dataiku Custom)

The Trino JDBC driver is already preinstalled in DSS and does not need to be installed

  • Fill in the settings of the connection using your Trino/Starburst information.

  • (Optional but recommended) Fill auto-fast-write settings - see Writing data into Trino

Authenticate using JWT tokens

DSS can authenticate to Trino using JWT tokens, which DSS retrieves from an OAuth2 provider. The provider authenticates the user, gives DSS an access token in JWT form, and the Trino cluster verifies the validity of the JWT token then uses the identity and roles defined in the token for its own permission system.

OAuth2 access is performed either using “per-user” credentials, or using “global” client credentials.

Writing data into Trino

Loading data into a Trino dataset in DSS, using the regular SQL INSERT statements, is inefficient and should only be used for extremely small datasets.

The recommended way to load data into a Trino table is through a bulk COPY from files stored in a cloud storage.

DSS can automatically use this fast load method. For that, you need a S3, GCS or Azure Blob connection, where DSS automatically writes the temporary data, before using a bulk COPY INTO statement in Trino. Then, in the settings of the Trino connection:

  • Enable “Automatic fast-write”

  • In “Auto fast write connection”, enter the name of the cloud storage connection to use

  • In “Path in connection”, enter a relative path to the root of the cloud storage connection, such as “db-tmp”. This is a temporary path that will be used in order to put temporary upload files. This should not be a path containing datasets.

  • in “Catalog for fast-load”, enter the name of a Trino catalog that covers the cloud storage location for the connection you put for “Auto fast write connection”. The catalog needs to allow external locations, which implies it is a catalog using the Hive connector. If left empty, DSS will use the same catalog as the target table.

DSS will now automatically use the optimal cloud-to-Trino copy mechanism when executing a recipe that needs to load data “from the outside” into Trino, such as a code recipe.

Note that when running visual recipes directly in-database, this does not apply, as data do not move outside of the database.

Explicit sync from cloud

In addition to the automatic fast-write that happens transparently each time a recipe must write into Trino, the Sync recipe also has an explicit “Cloud to Trino” engine. This is faster than automatic fast-write because it does not copy to the temporary location in the cloud storage first.

It will be used automatically if the following constraints are met:

  • The source dataset is stored on S3, GCS or Azure Blob Storage

  • The destination dataset is stored on Trino

Note that the Trino fast-path may still fail:

  • if the catalog of the target dataset is not a Hive catalog in Trino, and “Catalog for fast-load” isn’t set on the Trino connection

  • if the Trino/Starburst cluster has no actual access to the cloud storage (for example Starburst clusters can only work on one cloud storage at a time)

In that case, the Sync recipe’s engine should be forced to DSS and the “automatic fast write” setup on the Trino connection.

Unloading data from Trino to Cloud

Unloading data from Trino directly to DSS using JDBC is reasonably fast. However, if you need to unload data from Trino to S3, GCS or Azure Blob Storage, the sync recipe has a “Trino to Cloud” engine that implements a faster path.

In order to use Trino to Cloud sync, the following conditions are required:

  • The source dataset must be stored on Trino

  • The destination dataset must be stored on Amazon S3, GCS or Azure Blob Storage

Advanced install of the JDBC driver

Note

This feature is not available on Dataiku Cloud.

The Trino JDBC driver is already preinstalled in DSS and does not usually need to be installed. If you need to customize the JDBC driver, follow these instructions:

The Trino JDBC driver can be downloaded from Maven repositories (such as https://mvnrepository.com/artifact/io.trino/trino-jdbc)

The driver is made of a single JAR file trino-jdbc-xxx.jar

To install:

  • Copy this JAR file to the lib/jdbc/trino subdirectory of the DSS data directory (make it if necessary)

  • Restart DSS

  • In each Trino connection, switch the driver mode to “User provided” and enter “lib/jdbc/trino” as the Trino driver directory