Partitioned SQL recipes¶
This page deals with the specific case of partitioned datasets in SQL recipes. For general information about SQL recipes, see SQL recipes
There are specificities both for reading partitioned SQL datasets in SQL recipes, and writing to them.
In SQL recipes (both “query” and “script”), reading partitioned datasets require that you manually restrict what is being read in your query.
For example, if you have a dataset “inp1” partitioned by “country” (a “discrete values” partitioning dimension), you want to write a query like that:
SELECT col1, COUNT(*) AS count FROM inp1 WHERE country = 'the partition I want to read' GROUP BY col1;
The partition(s) that you want to read is determined by the partition dependencies system and should not be hard-coded in your recipe. Instead, you should use Partitioning variables substitutions. In our previous example, you would actually write your query as:
SELECT col1, COUNT(*) AS count FROM inp1 WHERE country = '$DKU_SRC_country' GROUP BY col1;
This case applies if:
- You are writing a SQL Query recipe.
- You are NOT writing to a SQL table dataset in the same connection
Remember (as explained in SQL recipes) that in that specific case, DSS retrieves the rows from the query and writes them in the output dataset. In that case, DSS “controls” how data is written and handles all partitioning issues. (see Recipes for partitioned datasets for more details).
As you are writing to a files-based partitioned dataset, you do not need to do anything specific in the SQL query about the partitioning values. DSS will write to the correct folder automatically.
This case applies if:
You are writing a SQL Query recipe.
You are writing to, either:
- a SQL table dataset
- another column-based-partitioning dataset (like Cassandra or MongoDB)
In this case, the partitioning column must appear in the output data. Note that it must appear in the SQL query at the correct position wrt. the output schema, and with the correct name.
So for example, if we reuse our previous example. We are going to write the result of the query to a dataset “out1”, which is partitioned by “country2”.
The schema of out1 looks like:
The query could look like:
SELECT country as country2, col1, COUNT(*) as countFROM inp1 WHERE country = '$DKU_SRC_country' GROUP BY col1;
However, this only works in the case where we have a “equals” dependency, because we are actually writing the country of the input dataset as the country2 of the output dataset.
If we wanted to write in another partition, it would not work. To do this properly, we need to write the following:
SELECT '$DKU_DST_country2' as country2, col1, COUNT(*) as countFROM inp1 WHERE country = '$DKU_SRC_country' GROUP BY col1;
The DKU_DST variable is replaced by the value of the “country2” dimension for the output dataset. For more information, see Partitioning variables substitutions.
When you write with a SQL script recipe, you are responsible for:
- ensuring idempotence
- inserting records with the correct partitioning values.
This generally involves:
- Performing a DELETE query with a restriction on the target partitioning value (or, if you are using native partitioning, using the correct database-specific commands to drop a partition)
- Making sure that inserted records have their partitioning column value set to the target partitioning value.
To help you, Data Science Studio provides you with many variables that you can substitute in your SQL script. See Partitioning variables substitutions.