Data Science Studio can both read and write datasets in SQL databases.
You can :
- Create datasets representing SQL tables (and read and write in them)
- Create datasets representing the results of a SQL query (and read them)
- Write code recipes that create datasets using the results of a SQL query on existing SQL datasets. See SQL recipes for more information about SQL recipes.
In addition, on most supported databases, DSS is able to:
- Execute Visual recipes directly in-database (ie: for a visual recipe from the database to the database, the data never moves out of the database)
- Execute Data Visualization directly in-database
For more information on the range of support for each of these features, please refer to the detailed page for your specific database below.
Data Science Studio has advanced support for the following databases :
- PostgreSQL 9.x
- HP Vertica
- EMC Greenplum
- Amazon Redshift
- Microsoft SQL Server
- SAP HANA
- IBM Netezza
- Google Bigquery
In addition, Data Science Studio can connect to any database that provides a JDBC driver.
For databases not listed here, some features might not work, especially writing to the database.
Before you try to connect to a database, make sure that the proper JDBC driver for it is installed. For information on how to install JDBC drivers, see Installing database drivers.
The first step to work with SQL databases is to create a connection to your SQL database.
- Go to the Administration > Connection page.
- Click “New connection” and select your database type.
- Enter the requested connection parameters. For more information about the parameters required see connecting to data
- Enter a name for your connection.
- Once the parameters are filled in, Data Science Studio automatically attempts to connect to the database, and gives you feedback on whether the attempt was successful.
- Save your connection.
SQL table datasets are the simplest form of interaction with SQL databases. To create an external SQL table dataset, you simply need to choose the connection, the table, and you’re all set. The content of the table is now a dataset.
- Go to Datasets, click New > Your database type
- Select a connection.
- Make sure the « Read a database table » radio is selected.
- Click on “Get tables list”
- DSS connects to your database and retrieves the available tables.
- Select your table.
- Click the “Test table” button.
- DSS shows a preview of the contents.
- You can now save your dataset
You cannot edit the schema of an external SQL table dataset. The names of the columns are provided by the database engine.
On an external dataset, Data Science Studio chooses to preferably trust the content of the data.
If you need to edit the names of the columns for further processing, you can for example use a data preparation recipe.
When creating an external MySQL table dataset, upcast the types of the columns with unsigned integer types in the dataset schema, so that DSS’s representation covers the full range of the values in these columns (use ‘smallint’ for ‘tinyint unsigned’, ‘int’ for ‘smallint unsigned’, ‘bigint’ for ‘int unsigned’). As MySQL silently casts unsigned values to signed ones in queries, and DSS treats integer types as signed, it is advised to avoid unsigned integers.
A SQL dataset can also be defined by a custom query. The results of the query become the rows of the dataset. This allows you to create a « virtual dataset », without having to materialize the rows (for example, if the query joins several tables).
A SQL query database is read-only. You cannot « write » to a SQL query.
Data Science Studio does not automatically test SQL queries, as they can be very expensive. You need to manually click the « Test query » button
Managed datasets can be created on SQL databases. Only “table” datasets can be managed (it makes no sense to « write » on a SQL query dataset).
You can create a managed SQL dataset :
- By clicking on the « Managed dataset » button in the New Dataset page.
- By creating a new managed dataset as output of a recipe.
When you create a managed SQL dataset, you start by selecting the connection in which it gets written. A table name is automatically selected based on the name of the SQL dataset. You can change it. A managed SQL dataset can target either an existing table or a non-existing one.
When you click the « Test » button, Data Science Studio checks if the table exists in the database :
If it does not exist, you have the ability to create it. It is generally not mandatory to create the table at this point, as the recipes that might require it will automatically create it if it does not exist.
- If the table exists, Data Science Studio automatically checks its schema. If the schema of the table and the schema of the dataset do not match, Data Science Studio emits a warning and proposes some fixes
- Drop the table (so it will be recreated with the dataset schema)
- Override the dataset schema with the current schema of the table.
- You can write code recipes that create datasets using the results of a SQL query on existing SQL datasets. See SQL recipes for more information about SQL recipes.
- You can also use the SQL Notebook for interactive querying.