SQL pipelines in DSS¶
Concept¶
In DSS, each recipe reads some datasets and writes some datasets. For example:
A grouping recipe will read the input dataset from the storage, perform the grouping, and write the grouped dataset to its storage.
A SQL recipe will either:
If input and output connections are different, execute the SQL query in the source database, fetch results and write the results in the output database
If input and output connections are the same, transform the SQL query into a “INSERT … SELECT” query so that the data remains in the database
When using a chain of visual and/or code recipes, DSS executes each recipe independently. For example, if you have a grouping recipe followed by a prepare recipe then a join recipe then a grouping recipe, and so forth, each recipe will be executed independently, and the SQL engine will read and write the datasets in each recipe.
SQL pipelines are a Dataiku capability that combine several consecutive recipes (each using the same SQL engine) in a DSS workflow. These combined recipes can be either:
Visual recipes running on DSS engine
SQL query recipes
These combined recipes then run as a single job activity, without writing intermediate datasets
Using a SQL pipeline can strongly boost performance by avoiding unnecessary writes and reads of intermediate datasets. SQL pipelines also allow you to optimize the data storage capacity without having to manually re-factor the Dataiku flow (for example, by reducing the number of datasets).
Using SQL pipelines¶
SQL pipelines are not enabled by default in DSS but can be enabled on a per-project basis.
Go to the project Settings
Go to Pipelines
Select Enable SQL pipelines
Save settings
Configuring behavior for intermediate datasets¶
A SQL pipeline covers one or more intermediate datasets that are part of the pipeline. For each of these intermediate datasets, you can configure whether it is virtualized or not.
When a dataset is virtualized, Dataiku will not write it at all while executing the pipeline.
If the dataset is not useful by itself, but is only required as an intermediate step to feed recipes down the Flow, virtualization improves performance by preventing DSS from writing the data of this intermediate dataset when executing the SQL pipeline.
Some intermediate datasets are however useful by themselves. For example, if the dataset is used by charts, enabling virtualization can prevent DSS from creating required charts, as the data needed to create the charts would not be available.
Although writing intermediate datasets reduces the performance gain of using a SQL pipeline, the pipeline still provides the benefit that the datasets do not have to be read again once they’ve been written. There are thus performance benefits to using pipelines even without virtualizing datasets, or only virtualizing some of them.
To enable virtualization for a dataset:
Open the dataset and go to the Settings tab at the top of the page
Go to the Advanced tab
Check “Virtualizable in build”
You can also enable virtualization for one or more datasets at once by performing these steps:
Select one or more datasets in the Flow
Locate the “Other actions” section in the right panel and select Allow build virtualization (for pipelines)
Configuring behavior for recipes¶
A SQL pipeline covers multiple recipes, and you can configure the behavior of the pipeline for each recipe.
Open the recipe and go to the Advanced tab at the top of the page
Check the options for “Pipelining”:
“Can this recipe be merged in an existing recipes pipeline?”
“Can this recipe be the target of a recipes pipeline?”
The first setting determines whether a recipe can be concatenated inside an existing SQL pipeline. The second setting determines whether running the recipe can trigger a new SQL pipeline.
How do SQL pipelines run¶
When you run a build job, the Dataiku Flow dependencies engine automatically detects if there are SQL pipelines based on the settings of the datasets and recipes. The engine then creates separate job activities for each of them.
The details of the SQL pipelines that have run can be visualized in the job results page. On the left part of the screen, “SQL pipeline (xx activities)” will appear and mention how many recipes or recipe executions were merged together.
Supported databases¶
The SQL pipelines feature is supported for databases that are compatible with SQL views. These include:
Snowflake
Databricks
Redshift
BigQuery
Synapse
SQL Server
PostgreSQL
MySQL
Oracle
Greenplum
Teradata
Vertica
Views management¶
DSS uses temporary SQL views to represent virtualized datasets in a SQL pipeline.
During the execution of the pipeline, DSS references the view (instead of the the table backing the virtualized dataset) by using the view name. The view name contains these components:
[tableName]: name of the table from which the view is derived
[partitionID]: ID of the partition corresponding to the view (if working with a partitioned SQL table dataset)
[randomString]: randomly-generated 5-character alphanumeric string
Using these components, views are named as follows:
prefix:
DSSVIEW_middle:
[tableName]_[partitionID]suffix:
_[randomString]
Because some databases have strict limits on the length of view names, the middle characters in the view name [tableName]_[partitionID] may be truncated to ensure that the prefix and suffix fit.
DSS has a process for automatically cleaning up all temporary views at the end of pipelines. In some rare cases, views may however be left behind. DSS contains a macro that attempts to clean up any old view. You can find it alongside the other macros under the name “Drop pipeline views”. You must have full DSS Administrator privileges to run it. If you run into a pipeline execution error that is linked to old views being left behind, you can use the naming convention DSSVIEW_[tableName]_[partitionID]_[randomString] to find the views that DSS created, and manually drop them.
Limitations¶
The SQL datasets must be part of the same database connection
The following are not supported:
SQL script recipes
SQL query recipes with multiple statements
TopN visual recipes with a “Remaining rows” output dataset
Pivot visual recipes with the “Recompute schema at each run” option enabled
Split visual recipes using “Dispatch percentiles of sorted data” or “Full random” mode
Generate features visual recipes
In some cases, even if a dataset is configured as virtualizable, DSS may still write it during the execution of the SQL pipeline. This happens when there are some technical constraints on the dataset that prevent the dataset from being virtualized.
While the SQL pipeline feature is supported, you may encounter some edge cases. In that case, note that pipelines can still be disabled on a per-project basis, and Dataiku Support may request that you do so.