SQL recipes

For introduction information about SQL datasets, please see SQL-based datasets

Data Science Studio allows you to build datasets by executing SQL statements. Two variants are provided :

SQL “query” recipe SQL “script” recipe
  • Takes as input one or several SQL datasets and a single output dataset.
  • Executes a SELECT query
  • Results of the SQL query are written in the target dataset (can be SQL or non-SQL)
  • Schema of the output dataset is automatically inferred from the columns returned by the query

The “query” recipe only allows you to provide a single SELECT.

In some cases, you want to write more complex SQL “scripts” instead of simple queries. For example:

  • Using temporary tables. Your script will contain some CREATE, SELECT and INSERTs
  • You want to define and use stored procedures. Your script will contain the declaration and a query. The procedure could perform some inserts of its own
  • Takes as input one or several SQL datasets
  • Takes as output one or several SQL datasets
  • You must write the INSERT yourself
  • You must manage the schema of the target datasets

SQL query recipe

To write a SQL query recipe :

  • Create the recipe
  • Select the input dataset(s). All input datasets must be SQL table datasets (either external or managed), and must all be in the same database connection.
  • Select or create the output dataset. At this point, it’s recommended that you leave the schema on « Define manually », as the SQL query recipe can automatically manage the schema of the output dataset.
  • Save and start writing your SQL query. Your query must consist of one top-level SELECT statement.

Note

You cannot write a SQL recipe based on a “SQL query” dataset, as it would be meaningless. Only “SQL table” datasets are supported.

Testing and schema handling

At any point, you can click the Validate button. This does the following :

  • Check that the query is valid (syntactically and grammatically)
  • Fetch the names and types of columns created by the query
  • Compare them to the schema of the output dataset.

If the schemas don’t match (which will always be the case when you validate for the first time), DSS will explain the incompatibilities and propose to automatically adjust the output dataset schema.

IMAGE

You also get details if there is a discrepancy.

IMAGE

In this case, the name of the second column has changed. It was « b » in the schema, but the second column of the query is now named « count ». You can either fix the query or click the « overwrite output schema » to replace the schema of the dataset with the schema of the query output.

IMAGE

Note that if you overwrite the output schema while the output already contains data, it can make the already existing data invalid. Go to the dataset edition page to fixup the schema or remove the already existing data.

Note

The Validate button does not execute the query, it only asks the database to parse it. As such, executions of the test are always cheap, whatever the complexity of the query and size of the database.

Previewing results

Clicking the “Display first rows” button executes the query and displays the first rows. If the query is complex, this test can be costly.

Execution method

When the output dataset is a SQL table and is in the same connection as the input datasets, DSS will execute the query fully in the target database. DSS automatically rewrites your SELECT query to a “INSERT INTO ... SELECT”.

In other cases, DSS will stream the SELECT results from the source database to the DSS server and write them back in the target.

SQL Script recipe

To write a SQL script recipe :

  • Create the recipe
  • Select the input dataset(s). All input datasets must be SQL table datasets (either external or managed), and must all be in the same database connection.
  • Select or create the output dataset(s). All output dataset s must be SQL table datasets and must all be in the same database connection as the input datasets.
  • Save and start writing your SQL script. The script must perform the insertions in the output tables.

In a SQL script recipe, Data Science Studio can not perform the same level of query analysis as in a SQL query recipe. Therefore, there is no “display first rows” button and the “validate” button only checks the validity of the configuration.

Only running the recipe will actually execute the SQL script.

Recipes in Data Science Studio should generally be idempotent (i.e., running them several times should not impact the results). Therefore, you should always have a TRUNCATE or DELETE statement in your SQL script.

Note

The previous provision does not apply exactly this way for partitioned SQL recipes. See Partitioned SQL recipes for more details.

../_images/sql-script-example-1.png

Managing schema or creating tables

In a SQL script recipe, Data Science Studio cannot automatically detect the output schema of the output datasets. The “standard” workflow is therefore to :

  • Create the output datasets
  • Write their schema in the datasets settings page
  • In the datasets settings page, click on “Create table” to have DSS create the table with the defined schema.
  • Write the SQL script with INSERT statements.

This way, discrepancies in schema are detected by the database when the query is run.

However, manually writing the schema for a non trivial table is a very cumbersome task. There is a way to make this more efficient. As the SQL script recipe gives you the whole SQL capabilities, you can also issue DROP TABLE / CREATE TABLE IF NOT EXISTS statements in your SQL scripts.

This way, the “schema” of the output table is only in the SQL script.

After running the recipe for the first time, you then need to tell DSS about the schema of the table.

  • Click on the “Edit schema” button next to the dataset
../_images/sql-script-edit-schema.png
  • Click on “Check now”
  • DSS will tell you that the current schema of the dataset does not match the table
  • Click “Reload schema from table”
  • DSS fills the schema of the dataset, which is now consistent.
../_images/sql-script-check-consistency-1.png ../_images/sql-script-check-consistency-2.png

If you later modify the CREATE TABLE statement, don’t forget to go back to the dataset edition to fixup the schema. Failure to do so could lead to inability to properly use the table in a further Flow step.

Other use cases for SQL scripts

Additionally to the use cases already covered (using a stored procedure, using temporary tables), there are other use cases for SQL scripts:

  • When you manipulate a data type which is not natively handled by DSS. For instance, the PostGIS geometry types. Using SQL query, DSS would write as “varchar” the output columns, losing the ability to perform geo manipulation.