Making relocatable managed datasets¶
When you create a managed dataset, you choose a connection in which to create this managed dataset.
DSS automatically chooses the settings of this new managed dataset within its connection. For example, by default, if you create a managed dataset ds1
into a SQL connection myconn
, DSS will configure ds1
with ds1
(ie the dataset name) as table name. Similarly, DSS will use the name of the dataset in the path when creating a managed Filesystem or HDFS dataset.
It is a good practice to make sure that the settings of the managed datasets (SQL table names, paths) are relocatable. This means that we want them to have the following properties:
If we create two datasets with the same name in different projects, their storage settings don’t overlap.
If we duplicate a project within a DSS instance, their storage settings don’t overlap.
If we import a project into an existing DSS instance, the storage settings of the new project don’t overlap with existing projects.
The main instrument in ensuring relocatability of datasets is the usage of variables within storage settings. Variables are defined at the project level and can be defined such as having a different value for each project. In particular, the ${projectKey}
variable is automatically defined as the project key of the current project and is thus guaranteed to be different for each project.
For example, if the default path for a dataset named ds1
is configured to be /${projectKey}_ds1
, it guarantees that if this dataset is copied to another project, its storage path won’t overlap.
Relocatability settings are configured in each connection. These settings only apply to newly created HDFS datasets. Once a dataset has been created, relocatability settings don’t apply anymore.
Relocation of SQL datasets¶
For SQL datasets, in the settings of the connection, you can configure (with variables):
For the table name, a prefix and a suffix to the dataset name
The database schema name
For example, with:
Schema:
${projectKey}
(Please note: DSS can’t create missing schema so make sure the schemas are created accordingly beforehand)Table name prefix:
${myvar1}_
Table name suffix:
_dss
If you go to project P1
(where myvar1 = a2
) and create a managed dataset called ds1
in this connection, it will be stored in schema P1
and the table will be called a2_ds1_dss
Relocation of HDFS datasets¶
For HDFS datasets, in the settings of the connection, you can configure (with variables):
For the path (within the connection), a prefix and a suffix to the dataset name
- For the associated Hive table (see Hive):
for the table name, a prefix and a suffix to the dataset name
the Hive database
For example, with:
Path prefix:
${projectKey}/
Path suffix:
test
Table name prefix:
${myvar1}_
Table name suffix:
_dss
Hive database
${projectKey}_dss
If you go to project P1
(where myvar1 = a2
) and create a managed dataset called ds1
in this connection, it will be stored in path P1/ds1
and the associated Hive table will be in database P1_dss
with table name a2_ds1_dss