Sync: Copying datasets

The “sync” recipe allows you to synchronize one dataset to another. Synchronization can be global or per-partition.

One major use-case of the sync recipe is to copy a dataset between storage backends where different computation are possible. For example:

  • copying a SQL dataset to HDFS to be able to perform Pig recipes on it
  • copying a HDFS dataset to ElasticSearch to be able to query it
  • copying a file dataset to a SQL database for efficient querying.

Depending on the types and parameters of the input and output datasets, DSS automatically chooses between several engines to execute the synchronization.

Schema handling

By default, when you create the sync recipe, DSS also creates the output dataset. In that case, DSS automatically copies the schema of the input dataset to the output dataset.

If you modify the schema of the input dataset, you should go to the edition screen for the sync recipe and click the “Resync Schema” button.

Schema mismatch

Generally speaking, the Sync recipe allows the input and output schema to be different. In that case, DSS uses the names of the columns to perform the matching between the input and output columns

Therefore, you can use the Sync recipe to remove or reorder some columns of a dataset. You cannot use the Sync recipe to “rename” a column. To do this, use a Preparation recipe.

Partitioning handling

By default, when syncing a partitioned dataset, DSS creates the output dataset with the same partitioning and puts a “equals” depedency. You can also sync to a non partitioned dataset or change the dependency.

When creating the recipe or clicking “Resync schema”, DSS automatically adds, if needed, the partitioning columns to the output schemas.

Specific case: S3 to Redshift

Loading a Redshift database using SQL INSERT statements is inefficient, and should be avoided except for small datasets.

The recommended way to load data into a Redshift table is through a bulk COPY from files stored in Amazon S3. The Data Science Studio uses this optimal path for S3-to-Redshift sync recipes whenever possible.

Due to Redshift restrictions, the following set of conditions must be met for a sync recipe to be implemented though direct copy:

  • the source dataset must be stored on Amazon S3
  • the destination dataset must be stored on Amazon Redshift
  • the S3 bucket and the Redshift cluster must be in the same Amazon AWS region
  • the source dataset must be stored with a CSV format, using the UTF-8 charset
  • the “Unix” CSV quoting style is not supported
  • the “Escaping only” CSV quoting style is only supported if the quoting character is “”
  • the input files must be all stored uncompressed, or all stored using the GZip compression format

Additionally, the schema of the input dataset must match the schema of the Redshift dataset, and values stored in input fields must be valid with respect to the declared Redshift column type.

Note that an intermediate preparation recipe can be used to normalize an input dataset to the above conditions, if the dataset one wants to load into Redshift does not natively fulfill them.

When the above conditions are met, sync recipes are implemented by first saving a manifest file under a temporary directory in the “default path for managed datasets” of the EC2 connection corresponding to the input S3 dataset, then sending the appropriate COPY command to the Redshift database, causing it to load the files referenced in the manifest. Note that this manifest file is not deleted when the copy is complete, for debug and history purposes. Past manifest files are normally small enough not to cause any space issue, and can be manually deleted at any time after the copy.

Additional details about the bulk loading process of a Redshift database from S3 files can be found in the Redshift documentation at http://docs.aws.amazon.com/redshift/latest/dg/t_Loading_tables_with_the_COPY_command.html and http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html.