Making relocatable managed datasets

When you create a managed dataset, you choose a connection in which to create this managed dataset.

DSS automatically chooses the settings of this new managed dataset within its connection. For example, by default, if you create a managed dataset ds1 into a SQL connection myconn, DSS will configure ds1 with ds1 (ie the dataset name) as table name. Similarly, DSS will use the name of the dataset in the path when creating a managed Filesystem or HDFS dataset.

It is a good practice to make sure that the settings of the managed datasets (SQL table names, paths) are relocatable. This means that we want them to have the following properties

  • If we create two datasets with the same name in different projects, their storage settings don’t overlap
  • If we duplicate a project within a DSS instance, their storage settings don’t overlap
  • If we import a project into an existing DSS instance, the storage settings of the new project don’t overlap with existing projects.

The main instrument in ensuring relocatability of datasets is the usage of variables within storage settings. Variables are defined at the project level and can be defined such as having a different value for each project. In particular, the ${projectKey} variable is automatically defined as the project key of the current project and is thus guaranteed to be different for each project.

For example, if the default path for a dataset named ds1 is configured to be /${projectKey}_ds1, it guarantees that this if dataset is copied to another project, its storage path won’t overlap.

Relocatability settings are configured in each connection. These settings only apply to newly created HDFS datasets. Once a dataset has been created, relocatability settings don’t apply anymore.

Relocation of SQL datasets

For SQL datasets, in the settings of the connection, you can configure (with variables):

  • For the table name, a prefix and a suffix to the dataset name
  • The database schema name

For example, with:

  • Schema: ${projectKey}
  • Table name prefix: ${myvar1}_
  • Table name suffix: _dss

If you go to project P1 (where myvar1 = a2) and create a managed dataset called ds1 in this connection, it will be stored in schema P1 and the table will be called a2_ds1_dss

Relocation of HDFS datasets

For SQL datasets, in the settings of the connection, you can configure (with variables):

  • For the path (within the connection), a prefix and a suffix to the dataset name

  • For the associated Hive table (see DSS and Hive):
    • for the table name, a prefix and a suffix to the dataset name
    • the Hive database

For example, with:

  • Path prefix: ${projectKey}/
  • Path suffix: test
  • Table name prefix: ${myvar1}_
  • Table name suffix: _dss
  • Hive database ${projectKey}_dss

If you go to project P1 (where myvar1 = a2) and create a managed dataset called ds1 in this connection, it will be stored in path P1/ds1 and the associated Hive table will be in database P1_dss with table name a2_ds1_dss