Amazon Redshift

Note

You might want to start with our detailed tutorial for your first steps with SQL databases in DSS. The rest of this page is primarily reference information for Redshift.

DSS supports the full range of features on Redshift:

  • Reading and writing datasets

  • Executing SQL recipes

  • Performing visual recipes in-database

  • Using live engine for charts

Setting up (Dataiku Custom or Dataiku Cloud Stacks)

Selecting the JDBC driver

There are two JDBC drivers for Redshift:

  • Redshift can use the PostgreSQL driver. This driver is pre-installed in DSS. You don’t need any further installation.

  • Redshift can use a dedicated Redshift driver

When setting up the connection, you can choose which driver to use.

The dedicated driver is required for the following capabilities:

  • Reading external tables (also known as “Redshift Spectrum”)

  • Using an IAM role for connecting

  • Reading or writing more than 2 billion records from/to a Redshift dataset (apart from using the In-database SQL engine)

Installing the dedicated driver

  • Download the JDBC driver from AWS

  • Create a new folder under DATA_DIR/lib/jdbc, such as DATA_DIR/lib/jdbc/redshift-dedicated

  • Copy the redshift-jdbc42-X.Y.Z.T.jar file to DATA_DIR/lib/jdbc/redshift-dedicated

  • Restart DSS

In the connection settings, set “Redshift driver (user-provided)” as “Driver to use”, and enter lib/jdbc/redshift-dedicated as “Driver jars directory”.

Setting up (Dataiku Online)

  • In your launchpad, select “Add Feature”, then “Redshift” (either read-only or read-write)

  • Fill the settings

Limitations

  • Per-user credentials are not supported

  • Using an IAM role for connecting is not supported

  • For read-write connections, a single output schema must be selected

  • Reading external tables (also known as Redshift Spectrum) is not supported

  • “Automatic fast-write” write to Redshift (see below for details) is not supported

Writing data into Redshift

Loading data into Redshift using the regular SQL “INSERT” or “COPY” statements is extremely inefficient (a few dozens of records per second) and should only be used for extremely small datasets.

The recommended way to load data into Redshift is through a bulk COPY from files stored in Amazon S3.

DSS can automatically use this fast load method. For that, you require a S3 connection. Then, in the settings of the Redshift connection:

  • Enable “Automatic fast-write”

  • In “Auto fast write connection”, enter the name of the S3 connection to use

  • In “Path in connection”, enter a relative path to the root of the S3 connection, such as “redshift-tmp”. This is a temporary path that will be used in order to put temporary upload files. This should not be a path containing datasets.

DSS will now automatically use the optimal S3-to-Redshift copy mechanism when executing a recipe that needs to load data “from the outside” into Redshift, such as a code recipe.

Note that when running visual recipes directly in-database, this does not apply, as the data does not move outside of the database.

Constraints on the S3 connection

In order for automatic fast-write to work, the following are needed:

  • The S3 bucket and the Redshift cluster must be in the same Amazon AWS region

Explicit sync from S3

In addition to the automatic fast-write that happens transparently each time a recipe must write into Redshift, the Sync recipe also has an explicit “S3 to Redshift” engine. This is faster than automatic fast-write because it does not copy to the temporary location in S3 first.

It will be used automatically if the following constraints are met:

  • The source dataset is stored on S3

  • The destination dataset is stored on Redshift

  • The S3 side is stored with a CSV format, using the UTF-8 charset

  • For the S3 side, the “Unix” CSV quoting style is not used

  • For the S3 side, the “Escaping only” CSV quoting style is only supported if the quoting character is \

  • For the S3 side, the files must be all stored uncompressed, or all stored using the GZip compression format

In addition:

  • The S3 bucket and the Redshift cluster must be in the same Amazon AWS region

  • The schema of the input dataset must match the schema of the output dataset, and values stored in fields must be valid with respect to the declared Redshift column type.

Technical details about implementation

S3-to-Redshift sync and automatic fast-write are implemented by first saving a manifest file under a temporary directory in the “default path for managed datasets” of the EC2 connection corresponding to the input S3 dataset, then sending the appropriate COPY command to the Redshift database, causing it to load the files referenced in the manifest. Note that this manifest file is not deleted when the copy is complete, for debug and history purposes. Past manifest files are normally small enough not to cause any space issue, and can be manually deleted at any time after the copy.

Additional details about the bulk loading process of a Redshift database from S3 files can be found in the Redshift documentation at http://docs.aws.amazon.com/redshift/latest/dg/t_Loading_tables_with_the_COPY_command.html and http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html.

Unloading data from Redshift to S3

Unloading data from Redshift directly to DSS using JDBC is reasonably fast. However, if you need to unload data from Redshift to S3, the sync recipe has a “Redshift to S3” engine that implements a faster path.

In order to use Redshift to S3 sync, the following conditions are required:

  • The source dataset must be stored on Amazon Redshift

  • The destination dataset must be stored on Amazon S3

  • Buckets that mandate SSE-KMS encryption are not supported (only SSE-S3 is supported)

  • The S3 bucket and the Redshift cluster must be in the same Amazon AWS region

  • The S3 side must be stored with a CSV format, using the UTF-8 charset

  • For the S3 side, the “Unix” CSV quoting style is not supported

  • For the S3 side, the “Escaping only” CSV quoting style is only supported if the quoting character is \

  • For the S3 side, the files must be all stored uncompressed, or all stored using the GZip compression format

Additionally, the schema of the input dataset must match the schema of the output dataset, and values stored in fields must be valid with respect to the declared Redshift column type.

Reading external tables

DSS can read external tables (also known as “Redshift Spectrum”) starting in DSS 10.0.6

This capability requires using the Redshift driver

Controlling distribution and sort clauses

In Redshift, you can control:

  • How the rows are distributed between the nodes of the Redshift database

  • How the rows are sorted among one slice of the database

For each managed Redshift dataset, you can configure both distribution style and sort key in the dataset settings (Redshift settings)

For distribution, you can select:

  • “Auto”, which will assign an optimal distribution based on the size of the table data.

  • “Distribute evenly” which distributes rows evenly between nodes

  • “Copy table to all nodes”, which makes a copy of all rows in all nodes (use only for very small tables)

  • “Select a distribution key” to select manually a distribution column

For sort, you can choose between non sorted, or compound or interleaved sort. In the latter cases, you can select the sort columns.

For more details, please see the Redshift documentation.

When you create a new dataset, it is always created with “Auto” distribution and no sort key (it is not possible to set a default sort key at the connection level, because the columns depend on the dataset).

Limitations

  • The fast data load from S3 to Redshift ignores the timezone information of dates. Only dates expressed in UTC timezone are supported.