SQL-based datasets

Data Science Studio can both read and write datasets in SQL databases.

You can :

  • Create datasets representing SQL tables (and read and write in them)
  • Create datasets representing the results of a SQL query (and read them)
  • Write recipes that create datasets using the results of a SQL query on existing SQL datasets. See SQL recipes for more information about SQL recipes.

Supported databases

Data Science Studio has advanced support for the following databases :

  • MySQL
  • PostgreSQL 9.x
  • HP Vertica
  • EMC Greenplum
  • Amazon Redshift
  • Teradata
  • Oracle
  • Microsoft SQL Server

In addition, Data Science Studio can connect to any database that provides a JDBC driver.

Warning

For databases not listed here, some features might not work, especially writing to the database.

Defining a connection

Note

Before you try to connect to a database, make sure that the proper JDBC driver for it is installed. For information on how to install JDBC drivers, see Installing database drivers.

The first step to work with SQL databases is to create a connection to your SQL database.

  • Go to the Administration > Connection page.
  • Click « New connection » and select your database type.
  • Enter the requested connection parameters
  • Once the parameters are filled in, Data Science Studio automatically attempts to connect to the database, and gives you feedback.
  • Give a name to your connection, and save it.

External table datasets

SQL table datasets are the simplest form of interaction with SQL databases. To create an external SQL table dataset, you simply need to choose the connection, the table, and you’re all set. The content of the table is now a dataset.

  • Go to Datasets, click New > Your database type
  • Select a connection.
  • Make sure the « Read a database table » radio is selected.
  • Click on “Get tables list”
  • DSS connects to your database and retrieves the available tables.
  • Select your table.
  • Click the “Test table” button.
  • DSS shows a preview of the contents.
  • You can now save your dataset

Warning

You can not edit the schema of an external SQL table dataset. The names of the columns are provided by the database engine.

On an external dataset, Data Science Studio chooses to preferably trust the content of the data.

If you need to edit the names of the columns for further processing, you can for example use a data preparation recipe.

External query datasets

A SQL dataset can also be defined by a custom query. The results of the query become the rows of the dataset. This allows you to create a « virtual dataset », without having to materialize the rows (for example, if the query joins several tables).

A SQL query database is read-only. You cannot « write » to a SQL query.

Note

Data Science Studio does not automatically test SQL queries, as they can be very expensive. You need to manually click the « Test query » button

Managed SQL datasets

Managed datasets can be created on SQL databases. Only “table” datasets can be managed (it makes no sense to « write » on a SQL query dataset).

You can create a managed SQL dataset :

  • By clicking on the « Managed dataset » button in the New Dataset page.
  • By creating a new managed dataset as output of a recipe.

When you create a managed SQL dataset, you start by selecting the connection in which it gets written. A table name is automatically selected based on the name of the SQL dataset. You can change it. A managed SQL dataset can target either an existing table or a non-existing one.

When you click the « Test » button, Data Science Studio checks if the table exists in the database :

  • If it does not exist, you have the ability to create it. It is generally not mandatory to create the table at this point, as the recipes that might require it will automatically create it if it does not exist.

  • If the table exists, Data Science Studio automatically checks its schema. If the schema of the table and the schema of the dataset do not match, Data Science Studio emits a warning and proposes some fixes
    • Drop the table (so it will be recreated with the dataset schema)
    • Override the dataset schema with the current schema of the table.

Partitioning

All SQL datasets can be partitioned. Details can be found in Partitioned SQL datasets

Writing in SQL table datasets

SQL table datasets are writable.