Snowflake¶
Note
You might want to start with our detailed tutorial for your first steps with SQL databases in DSS.
The rest of this page is reference information for Snowflake connections.
DSS supports the full range of features on Snowflake:
- Reading and writing datasets
- Executing SQL recipes
- Performing visual recipes in-database
- Using live engine for charts
Installing the JDBC driver¶
The Snowflake JDBC driver can be downloaded from Snowflake website (https://docs.snowflake.net/manuals/user-guide/jdbc.html)
The driver is made of a single JAR file snowflake-jdbc-VERSION.jar
To install:
- Copy this JAR file to the
lib/jdbc
subdirectory of the DSS data directory - Restart DSS
Writing data into Snowflake¶
Loading data into a Snowflake database using the regular SQL INSERT
or COPY
statements is very inefficient and should only be used for extremely small datasets.
The recommended way to load data into a Snowflake table is through a bulk COPY
from files stored in Amazon S3 or Azure Blob Storage. DSS will automatically use optimal S3-to-Snowflake or AzureBlob-to-Snowflake copy mechanisms when using a Sync recipe. For more information, see below.
The exception to this rule is when executing a recipe from a Snowflake dataset to a Snowflake dataset using the in-database (SQL) engine.
In other words:
- You should never have a Flow with a recipe that writes from a non-Snowflake source to a Snowflake dataset, except sync recipes from S3 or Azure Blob Storage
- “S3 to Snowflake” or “Azure Blob to Snowflake” recipes should only be the “Sync” recipe
- Snowflake to Snowflake recipes will be fast if and only if the “In-database (SQL)” engine is selected.
For example, if you have a table in Snowflake, and want to use a recipe that does not have an “In-database (SQL) engine”, you should instead use two steps: * A first Snowflake-to-cloud-storage (S3 or Azure Blob) recipe * A cloud-storage-to-Snowflake sync recipe
Syncing to/from S3¶
When running a Sync recipe where the source is S3 and the target is Snowflake, DSS will automatically use a bulk COPY from files stored in Amazon S3, which is the efficient way to load data into Snowflake.
Due to Snowflake restrictions, the following set of conditions must be met for a Sync recipe to be executed as direct copy:
- The source dataset must be stored on Amazon S3 and the destination dataset must be stored on Snowflake, or vice-versa
- The S3 bucket and the Snowflake cluster must be in the same Amazon AWS region
- The S3 side must be stored with a CSV format, using the UTF-8 charset
- The CSV separator must be a simple ASCII character
- For the S3 side, the “Unix” CSV quoting style is not supported
- For the S3 side, the “Escaping only” CSV quoting style is only supported if the quoting character is
\
Additionally, the schema of the input dataset must match the schema of the output dataset, and values stored in fields must be valid with respect to the declared Snowflake column type.
Note that an intermediate Prepare recipe can be used to normalize an input S3 dataset to the above conditions, if the dataset one wants to load into Snowflake does not natively fulfill them.
Syncing to/from Azure Blob Storage¶
When running a Sync recipe where the source is Azure Blob Storage and the target is Snowflake, DSS will automatically use a bulk COPY from files stored in Amazon Azure Blob Storage, which is the efficient way to load data into Snowflake.
Due to Snowflake restrictions, the following set of conditions must be met for a Sync recipe to be executed as direct copy:
- The source dataset must be stored on Azure Blob Storage and the destination dataset must be stored on Snowflake, or vice-versa
- The storage account and the Snowflake cluster must be in the same Azure region
- The blob storage side must be stored with a CSV format, using the UTF-8 charset
- The CSV separator must be a simple ASCII character
- For the blob storage side, the “Unix” CSV quoting style is not supported
- For the blob storage side, the “Escaping only” CSV quoting style is only supported if the quoting character is
\
Additionally, the schema of the input dataset must match the schema of the output dataset, and values stored in fields must be valid with respect to the declared Snowflake column type.
Note that an intermediate Prepare recipe can be used to normalize an input Azure Bob Storage dataset to the above conditions, if the dataset one wants to load into Snowflake does not natively fulfill them.
Installation and configuration¶
In order to take advantage of the native integration, generate a SAS Token, then save the token in the Azure Blob Storage connection settings in the “SAS Token” field.
Spark native integration¶
The native integration with Spark allows Spark recipes reading from and/or writing to Snowflake datasets to directly exchange data with a Snowflake database. This results in a large increase in performance compared to the default method where data read from or written to Snowflake must be streamed through DSS first.
All Spark recipes which have a Snowflake dataset as an input or an output will automatically take advantage of the native integration with Snowflake once the installation is completed.
Installation and configuration¶
Two external JAR files are required to be installed in DSS
- The same Snowflake JDBC driver mentioned above
- The Snowflake Connector for Spark (provided by Snowflake). It can be downloaded directly from Maven under the
spark-snowflake_2.11
artifact ID (https://search.maven.org/classic/#search%7Cga%7C1%7Cg%3A%22net.snowflake%22). Make sure to choose the correct version of the connector for the version of Spark installed (only versions 2.2-2.4 are supported).
Note
The version of the Snowflake Connector JAR must be compatible with the version of the Snowflake JDBC driver. For example, an older version of the JDBC driver combined with the latest version of the connector may cause errors with confusing exception messages when running jobs. We recommend using the latest version of each JAR.
- Copy both of these JAR files to the
lib/java
subdirectory of the DSS data directory (even if the JDBC driver has already been placed inlib/jdbc
as described above) - Restart DSS
- Check the “Use Spark native integration” box in the Snowflake connection settings. This checkbox can also be used to toggle the integration on and off without having to remove the JARs.
Limitations and known issues¶
Datasets¶
- On DSS 8.0.2 and below, when creating the connection, you need to specify one database. Access to tables from other databases is not possible.
Importing¶
- On DSS 8.0.2 and below, when importing tables from a Snowflake connection, you’ll see all schemas from all databases. However, you can only import tables from the schemas of the same database as the one defined in the connection
Visual recipes¶
- Sample/Filter: “contains” operator is not supported
- Group: aggregating on booleans is not supported
- Window: cumulative average is not supported (Snowflake limitation)
Coding recipes¶
- Execution plans are not supported
Spark native integration¶
- Writing to Snowflake datasets with column names that contain periods is not supported (Snowflake Connector for Spark limitation)
- The Snowflake Connector for Spark only supports Spark versions 2.2, 2.3, and 2.4
Breaking changes¶
8.0.2 -> 8.0.3:
- All columns of type TIMESTAMP_NTZ are now stored in DSS as String, unless users have checked the option Read SQL timestamp without timezone as DSS dates
- When the Snowflake session TIMESTAMP_TYPE_MAPPING consider TIMESTAMP to be TIMESTAMP_NTZ, all columns of type TIMESTAMP are now stored in DSS as String, unless users have checked the option Read SQL timestamp without timezone as DSS dates