Snowflake

Note

You might want to start with our detailed tutorial for your first steps with SQL databases in DSS.

The rest of this page is reference information for Snowflake connections.

DSS supports the full range of features on Snowflake:

  • Reading and writing datasets
  • Executing SQL recipes
  • Performing visual recipes in-database
  • Using live engine for charts

Installing the JDBC driver

The Snowflake JDBC driver can be downloaded from Snowflake website (https://docs.snowflake.net/manuals/user-guide/jdbc.html)

The driver is made of a single JAR file snowflake-jdbc-VERSION.jar

To install:

  • Copy this JAR file to the lib/jdbc subdirectory of the DSS data directory
  • Restart DSS

Writing data into Snowflake

Loading data into a Snowflake database using the regular SQL INSERT or COPY statements is very inefficient and should only be used for extremely small datasets.

The recommended way to load data into a Snowflake table is through a bulk COPY from files stored in Amazon S3 or Azure Blob Storage. DSS will automatically use optimal S3-to-Snowflake or AzureBlob-to-Snowflake copy mechanisms when using a Sync recipe. For more information, see below.

The exception to this rule is when executing a recipe from a Snowflake dataset to a Snowflake dataset using the in-database (SQL) engine.

In other words:

  • You should never have a Flow with a recipe that writes from a non-Snowflake source to a Snowflake dataset, except sync recipes from S3 or Azure Blob Storage
  • “S3 to Snowflake” or “Azure Blob to Snowflake” recipes should only be the “Sync” recipe
  • Snowflake to Snowflake recipes will be fast if and only if the “In-database (SQL)” engine is selected.

For example, if you have a table in Snowflake, and want to use a recipe that does not have an “In-database (SQL) engine”, you should instead use two steps: * A first Snowflake-to-cloud-storage (S3 or Azure Blob) recipe * A cloud-storage-to-Snowflake sync recipe

Syncing to/from S3

When running a Sync recipe where the source is S3 and the target is Snowflake, DSS will automatically use a bulk COPY from files stored in Amazon S3, which is the efficient way to load data into Snowflake.

Due to Snowflake restrictions, the following set of conditions must be met for a Sync recipe to be executed as direct copy:

  • The source dataset must be stored on Amazon S3 and the destination dataset must be stored on Snowflake, or vice-versa
  • The S3 bucket and the Snowflake cluster must be in the same Amazon AWS region
  • The S3 side must be stored with a CSV format, using the UTF-8 charset
  • The CSV separator must be a simple ASCII character
  • For the S3 side, the “Unix” CSV quoting style is not supported
  • For the S3 side, the “Escaping only” CSV quoting style is only supported if the quoting character is \

Additionally, the schema of the input dataset must match the schema of the output dataset, and values stored in fields must be valid with respect to the declared Snowflake column type.

Note that an intermediate Prepare recipe can be used to normalize an input S3 dataset to the above conditions, if the dataset one wants to load into Snowflake does not natively fulfill them.

Syncing to/from Azure Blob Storage

When running a Sync recipe where the source is Azure Blob Storage and the target is Snowflake, DSS will automatically use a bulk COPY from files stored in Amazon Azure Blob Storage, which is the efficient way to load data into Snowflake.

Due to Snowflake restrictions, the following set of conditions must be met for a Sync recipe to be executed as direct copy:

  • The source dataset must be stored on Azure Blob Storage and the destination dataset must be stored on Snowflake, or vice-versa
  • The storage account and the Snowflake cluster must be in the same Azure region
  • The blob storage side must be stored with a CSV format, using the UTF-8 charset
  • The CSV separator must be a simple ASCII character
  • For the blob storage side, the “Unix” CSV quoting style is not supported
  • For the blob storage side, the “Escaping only” CSV quoting style is only supported if the quoting character is \

Additionally, the schema of the input dataset must match the schema of the output dataset, and values stored in fields must be valid with respect to the declared Snowflake column type.

Note that an intermediate Prepare recipe can be used to normalize an input Azure Bob Storage dataset to the above conditions, if the dataset one wants to load into Snowflake does not natively fulfill them.

Installation and configuration

In order to take advantage of the native integration, generate a SAS Token, then save the token in the Azure Blob Storage connection settings in the “SAS Token” field.

Spark native integration

The native integration with Spark allows Spark recipes reading from and/or writing to Snowflake datasets to directly exchange data with a Snowflake database. This results in a large increase in performance compared to the default method where data read from or written to Snowflake must be streamed through DSS first.

All Spark recipes which have a Snowflake dataset as an input or an output will automatically take advantage of the native integration with Snowflake once the installation is completed.

Installation and configuration

Two external JAR files are required to be installed in DSS

  • The same Snowflake JDBC driver mentioned above
  • The Snowflake Connector for Spark (provided by Snowflake). It can be downloaded directly from Maven under the spark-snowflake_2.11 artifact ID (https://search.maven.org/classic/#search%7Cga%7C1%7Cg%3A%22net.snowflake%22). Make sure to choose the correct version of the connector for the version of Spark installed (only versions 2.2-2.4 are supported).

Note

The version of the Snowflake Connector JAR must be compatible with the version of the Snowflake JDBC driver. For example, an older version of the JDBC driver combined with the latest version of the connector may cause errors with confusing exception messages when running jobs. We recommend using the latest version of each JAR.

  • Copy both of these JAR files to the lib/java subdirectory of the DSS data directory (even if the JDBC driver has already been placed in lib/jdbc as described above)
  • Restart DSS
  • Check the “Use Spark native integration” box in the Snowflake connection settings. This checkbox can also be used to toggle the integration on and off without having to remove the JARs.

Limitations and known issues

Datasets

  • When creating the connection, you need to specify one database. Access to tables from other databases is not possible

Importing

  • When importing tables from a Snowflake connection, you’ll see all schemas from all databases. However, you can only import tables from the schemas of the same database as the one defined in the connection

Visual recipes

  • Sample/Filter: “contains” operator is not supported
  • Group: aggregating on booleans is not supported
  • Window: cumulative average is not supported (Snowflake limitation)

Coding recipes

  • Execution plans are not supported

Spark native integration

  • Writing to Snowflake datasets with column names that contain periods is not supported (Snowflake Connector for Spark limitation)
  • The Snowflake Connector for Spark only supports Spark versions 2.2, 2.3, and 2.4