DSS can both read and write datasets in SQL databases.
- Supported databases
- Defining a connection
- External table datasets
- External query datasets
- Managed SQL datasets
- Writing in SQL table datasets
- Executing SQL queries
- Making managed SQL datasets relocatable
- Detailed databases support information
- Advanced connection settings
- Create datasets representing SQL tables (and read and write in them)
- Create datasets representing the results of a SQL query (and read them)
- Write code recipes that create datasets using the results of a SQL query on existing SQL datasets. See SQL recipes for more information about SQL recipes.
In addition, on most supported databases, DSS is able to:
- Execute Visual recipes directly in-database (ie: for a visual recipe from the database to the database, the data never moves out of the database)
- Execute Data Visualization directly in-database
For more information on the range of support for each of these features, please refer to the detailed page for your specific database below.
We have a detailed howto for your first steps with SQL databases in DSS.
You might want to start with that Howto. The rest of this section is reference information.
DSS has detailed support for the following databases:
DSS has experimental support for the following databases:
Support for these databases is provided as a “best-effort”, we make no guarantees as to which features precisely work
In addition, DSS can connect to any database that provides a JDBC driver.
For databases not listed previously, we cannot guarantee that anything will work. Reading datasets often works, but it is rare that writing works out of the box.
You might be able to get a better behavior by selecting a specific dialect from the dropdown in the JDBC connection screen
Before you try to connect to a database, make sure that the proper JDBC driver for it is installed. For information on how to install JDBC drivers, see Installing database drivers.
The first step to work with SQL databases is to create a connection to your SQL database.
- Go to the Administration > Connection page.
- Click “New connection” and select your database type.
- Enter a name for your connection.
- Enter the requested connection parameters. See the page of your database for more information, if needed
- Click on Test. DSS attempts to connect to the database, and gives you feedback on whether the attempt was successful.
- Save your connection.
SQL table datasets are the simplest form of interaction with SQL databases. To create an external SQL table dataset, you simply need to choose the connection, the table, and you’re all set. The content of the table is now a dataset.
- Go to Datasets, click New > Your database type
- Select a connection.
- Make sure the « Read a database table » radio is selected.
- Click on “Get tables list”
- DSS connects to your database and retrieves the available tables.
- Select your table.
- Click the “Test table” button.
- DSS shows a preview of the contents.
- You can now save your dataset
You cannot edit the schema of an external SQL table dataset. The names of the columns are provided by the database engine.
On an external dataset, DSS chooses to preferably trust the content of the data.
If you need to edit the names of the columns for further processing, you can for example use a data preparation recipe.
When creating an external MySQL table dataset, upcast the types of the columns with unsigned integer types in the dataset schema, so that DSS’s representation covers the full range of the values in these columns (use ‘smallint’ for ‘tinyint unsigned’, ‘int’ for ‘smallint unsigned’, ‘bigint’ for ‘int unsigned’). As MySQL silently casts unsigned values to signed ones in queries, and DSS treats integer types as signed, it is advised to avoid unsigned integers.
A SQL dataset can also be defined by a custom query. The results of the query become the rows of the dataset. This allows you to create a « virtual dataset », without having to materialize the rows (for example, if the query joins several tables).
A SQL query database is read-only. You cannot « write » to a SQL query.
Data Science Studio does not automatically test SQL queries, as they can be very expensive. You need to manually click the « Test query » button
Managed datasets can be created on SQL databases. Only “table” datasets can be managed (it makes no sense to « write » on a SQL query dataset).
You can create a managed SQL dataset :
- By clicking on the « Managed dataset » button in the New Dataset page.
- By creating a new managed dataset as output of a recipe.
When you create a managed SQL dataset, you start by selecting the connection in which it gets written. A table name is automatically selected based on the name of the SQL dataset. You can change it. A managed SQL dataset can target either an existing table or a non-existing one.
When you click the « Test » button, Data Science Studio checks if the table exists in the database :
- If it does not exist, you have the ability to create it. It is generally not mandatory to create the table at this point, as the recipes that might require it will automatically create it if it does not exist.
- If the table exists, Data Science Studio automatically checks its schema. If the schema of the table and the schema of the dataset do not match, Data Science Studio emits a warning and proposes some fixes
- Drop the table (so it will be recreated with the dataset schema)
- Override the dataset schema with the current schema of the table.
- You can write code recipes that create datasets using the results of a SQL query on existing SQL datasets. See SQL recipes for more information about SQL recipes.
- You can also use the SQL Notebook for interactive querying.
For all databases, you can pass arbitrary key/value properties that are passed as-is to the database’s JDBC driver. The possible properties depend on each JDBC driver. Please refer to the documentation of your JDBC driver for more information
For all databases for which DSS has a specific connection kind, DSS automatically constructs the JDBC URL from the structured settings. For advanced use cases, you can enable the “Custom JDBC URL” mode and enter your own JDBC URL
When DSS reads records from the database, it fetches them by batches for improved performance. The “fetch size” parameter lets you select the size of this batch. If you leave this parameter blank, DSS uses a reasonable default. Setting the fetch size to high values (above a few thousands) can improve performance, especially if your network connection to the database has high latency, at the expense of increased memory usage.
By default, when writing non-partitioned managed datasets, DSS drops the table and recreates it (which avoid schema discrepancy problems). You can enable this option to TRUNCATE the table instead.