Amazon Redshift

DSS supports the full range of features on Redshift:

  • Reading and writing datasets
  • Executing SQL recipes
  • Performing visual recipes in-database
  • Using live engine for charts

Note

You might want to start with our detailed tutorial for your first steps with SQL databases in DSS.

You might want to start with that tutorial. The rest of this page is reference information for Redshift.

Installing the JDBC driver

The Redshift driver is pre-installed in DSS. You don’t need any further installation.

Writing data into Redshift

Loading data into a Redshift database using the regular SQL “INSERT” or “COPY” statements is extremely inefficient (a few dozens of records per second) and should only be used for extremely small datasets.

The recommended way to load data into a Redshift table is through a bulk COPY from files stored in Amazon S3. DSS will automatically use this optimal S3-to-Redshift copy mechanism when using a Sync recipe. For more information, see below.

The exception to this rule is when executing a recipe from a Redshift dataset to a Redshift dataset using in-database (SQL) engine.

In other words:

  • you should never have a Flow with a recipe that writes from a non-Redshift non-S3 source to a Redshift dataset.
  • S3 to Redshift recipes should only be the “Sync” recipe
  • Redshift to S3 recipes should only be the “Sync” recipe
  • Redshift to Redshift recipes will be fast if and only if the “In-database (SQL)” engine is selected.

For example, if you have a table in Redshift, and want to use a recipe that does not have an “In-database (SQL) engine”, you should instead use two steps: * A first Redshift-to-S3 recipe * A S3-to-Redshift sync recipe

Syncing to/from S3

Loading a Redshift database using SQL INSERT statements is inefficient, and should be avoided except for small datasets. The recommended way to load data into a Redshift table is through a bulk COPY from files stored in Amazon S3.

DSS uses this optimal path for S3-to-Redshift and Redshift-to-S3 sync recipes whenever possible.

Due to Redshift restrictions, the following set of conditions must be met for a sync recipe to be executed as direct copy:

  • S3 to Redshift:

    • The source dataset must be stored on Amazon S3
    • The destination dataset must be stored on Amazon Redshift
    • SSE-KMS and SSE-S3 encryptions are supported
  • Redshift to S3:

    • The source dataset must be stored on Amazon Redshift
    • The destination dataset must be stored on Amazon S3
    • Buckets that mandate SSE-KMS encryption are not supported (only SSE-S3 is supported)
  • Both ways

    • The S3 bucket and the Redshift cluster must be in the same Amazon AWS region
    • The S3 side must be stored with a CSV format, using the UTF-8 charset
    • For the S3 side, the “Unix” CSV quoting style is not supported
    • For the S3 side, the “Escaping only” CSV quoting style is only supported if the quoting character is \
    • For the S3 side, the files must be all stored uncompressed, or all stored using the GZip compression format

Additionally, the schema of the input dataset must match the schema of the output dataset, and values stored in fields must be valid with respect to the declared Redshift column type.

Note that an intermediate preparation recipe can be used to normalize an input S3 dataset to the above conditions, if the dataset one wants to load into Redshift does not natively fulfill them.

When the above conditions are met, S3-to-Redshift are implemented by first saving a manifest file under a temporary directory in the “default path for managed datasets” of the EC2 connection corresponding to the input S3 dataset, then sending the appropriate COPY command to the Redshift database, causing it to load the files referenced in the manifest. Note that this manifest file is not deleted when the copy is complete, for debug and history purposes. Past manifest files are normally small enough not to cause any space issue, and can be manually deleted at any time after the copy.

Additional details about the bulk loading process of a Redshift database from S3 files can be found in the Redshift documentation at http://docs.aws.amazon.com/redshift/latest/dg/t_Loading_tables_with_the_COPY_command.html and http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html.

Controlling distribution and sort clauses

In Redshift, you can control:

  • How the rows are distributed between the nodes of the Redshift database
  • How the rows are sorted among one slice of the database

For each managed Redshift dataset, you can configure both distribution style and sort key in the dataset settings (Redshift settings)

For distribution, you can select:

  • “Auto”, which will assign an optimal distribution based on the size of the table data.
  • “Distribute evenly” which distributes rows evenly between nodes
  • “Copy table to all nodes”, which makes a copy of all rows in all nodes (use only for very small tables)
  • “Select a distribution key” to select manually a distribution column

For sort, you can choose between non sorted, or compound or interleaved sort. In the latter cases, you can select the sort columns.

For more details, please see the Redshift documentation.

When you create a new dataset, it is always created with “Auto” distribution and no sort key (it is not possible to set a default sort key at the connection level, because the columns depend on the dataset).

Limitations

  • DSS uses the PostgreSQL driver for connecting to Redshift. This driver limits the size of result sets to 2 billion records. You cannot read or write more than 2 billion records from/to a Redshift dataset (apart from using the In-database SQL engine)
  • SSL support is not tested by Dataiku