Connecting to data¶

The first task when using Data Science Studio is to define datasets to connect to your data sources.

A dataset is a series of records with the same schema. It is quite analogous to a table in the SQL world.

For a more global explanation about the different kinds of datasets, see the DSS concepts page.

See also

For more information, see the Concept | Data connections article in the Knowledge Base.

Supported connections
- Connectors
- File formats
  - Standard formats
  - Hadoop/Spark specific formats
File formats
- Delimiter-separated values (CSV / TSV)
- Excel
- Fixed width
- Parquet
- Avro
  - Applicability
- Hive ORCFile
  - Compatibility
  - Limitations
- XML
  - Handling the structure
    - Selection of the data to load
    - JSON representation
      - Example
  - Using XPath to select data
    - Limitations
    - Selecting values explicitly
      - Example
- JSON
  - Example
- ESRI Shapefiles
  - Vecmath library
- Delta Lake
- YXDB
SQL databases
- Introduction
  - Supported databases
  - Defining a connection
  - Advanced connection settings
- Snowflake
  - Connection setup (Dataiku Custom, Dataiku Cloud or Dataiku Cloud Stacks)
  - Authenticate using OAuth2
    - Common errors
  - Using Key-pair authentication
  - Writing data into Snowflake
    - Requirements on the cloud storage connection
    - Explicit sync from cloud
  - Unloading data from Snowflake to Cloud
  - Extended push-down
  - Spark native integration
  - Snowpark integration
  - Switching Role and Warehouse
    - How to set it up
  - Limitations and known issues
  - Advanced install of the JDBC driver
    - Spark integration
- Databricks
  - Connection setup (Dataiku Cloud Stacks or Dataiku Custom)
    - To set up the connection with per-user credentials
  - Authenticate using OAuth2
    - With per-user credentials
    - With global credentials
  - Access to Databricks Volumes
  - Writing data into Databricks
    - Requirements on the cloud storage connection
    - Explicit sync from cloud
  - Unloading data from Databricks to Cloud
  - Databricks Connect integration
  - Advanced install of the JDBC driver
- Azure Synapse
  - Installing the JDBC driver
  - Write into Azure Synapse
    - Explicit sync from Azure Blob Storage
  - Unload data from Synapse to Azure Blob
  - Login using OAuth
- Microsoft Fabric Warehouse
  - Installing the JDBC driver
  - Write into Microsoft Fabric Warehouse
    - Explicit sync from Azure Blob Storage
  - Login using OAuth
    - Login as a single service account
    - Login with per-user OAuth tokens
- Google BigQuery
  - Supported and unsupported features
  - The two drivers
  - Installing the JDBC driver
    - Built-in driver
    - Google-provided driver
  - Connecting to BigQuery
  - Writing data into BigQuery
    - Explicit sync from GCS
  - BigQuery native partitioning and clustering
    - Partitioning consistency
    - External datasets
  - Bigframes integration
- Amazon Redshift
  - Setting up (Dataiku Custom or Dataiku Cloud Stacks)
    - Selecting the JDBC driver
    - Installing the dedicated driver
  - Writing data into Redshift
  - Unloading data from Redshift to S3
  - Reading external tables
  - Controlling distribution and sort clauses
  - Limitations
- PostgreSQL
  - Installing the JDBC driver
  - Secure connections (SSL / TLS) support
- MySQL
  - Caveats
  - Installing the driver
  - Secure connections (SSL / TLS) support
    - Importing the server certificate and creating the client certificate
    - Setting up the MySQL connection
- Microsoft SQL Server
  - Installing the JDBC driver
  - Requirements
  - Azure SQL Data Warehouse / Synapse support
  - Kerberos authentication
  - User impersonation with Kerberos
  - Login using OAuth on Azure SQL Server
- Oracle
  - Installing the JDBC driver
  - Advanced connection properties
- Teradata
  - Installing the JDBC driver
  - Connecting using TD2 (default) authentication
    - Using per-user-credentials with TD2 authentication
  - Connecting using LDAP authentication
    - Using per-user-credentials with LDAP authentication
  - Connecting using Kerberos authentication
    - Using per-user-credentials with Kerberos authentication
  - Impersonation
  - Controlling the primary index
  - Tracing additional query information
  - Autocommit Mode
  - Limitations
  - Fast sync using TDCH
  - Notes
- Pivotal Greenplum
  - Installing the JDBC driver
  - Controlling distribution
  - Setting distribute and sort clauses
  - Secure connections
  - Limitations
- Google AlloyDB
  - Installing the JDBC driver
  - Secure connections (SSL / TLS) support
- Google Cloud SQL
- AWS Athena
  - Supported
  - Not supported
  - Installing the JDBC driver
  - Connecting to Athena
- Trino/Starburst
  - Connection setup (Dataiku Cloud Stacks or Dataiku Custom)
  - Authenticate using JWT tokens
  - Writing data into Trino
    - Explicit sync from cloud
  - Unloading data from Trino to Cloud
  - Advanced install of the JDBC driver
- Treasure Data
  - Connection setup (Dataiku Cloud Stacks or Dataiku Custom)
  - Authentication
  - Advanced install of the JDBC driver
- Vertica
  - Installing the JDBC driver
  - Timezones support
- SAP HANA
  - Caveats
- Dremio
  - Setting up
  - Importing tables
  - Limitations
- IBM Netezza
  - Caveats
- Exasol
  - Limitations
- IBM DB2
  - Installing the JDBC driver
  - Creating a DB2 connection
  - Creating DB2 datasets
- kdb+
  - Installing support
  - Creating a kdb+ connection
Amazon S3
- Create a S3 connection
  - Required S3 permissions
  - Transfer ownership to the bucket owner
- Creating S3 datasets
- Connections path handling
- Location of managed datasets and folders
  - For a “free selection” connection
  - For a “path restriction” connection
- Server-side encryption of files
  - Encryption Mode
- Custom Object Storage
Azure Blob Storage
- Creating a Azure connection
- Connecting to Azure using OAuth2
- Creating Azure Blob Storage datasets
- Connections path handling
- Location of managed datasets and folders
  - For a “free selection” connection
  - For a “path restriction” connection
Google Cloud Storage
- Create a GCS connection
  - Using Service Account
  - Using OAuth2
- Creating GCS datasets
- Connections path handling
- Location of managed datasets and folders
  - For a “free selection” connection
  - For a “path restriction” connection
Upload your files
- Storage location
- Size limitations
SharePoint Online
- Creating a SharePoint Online connection
- Connecting to SharePoint Online using OAuth2
- Connecting to SharePoint Online using user name and password
- Connecting to SharePoint Online using certificates
- Advanced connection properties
- Creating SharePoint Online datasets
  - From a SharePoint document
  - From a SharePoint list
- Location of managed datasets and folders
HDFS
- Compatible filesystems
MongoDB
- Setting up the MongoDB connection
Elasticsearch and OpenSearch
- Define an Elasticsearch connection
- Managed Elasticsearch datasets
- External Elasticsearch datasets
  - Partitioning
    - Field-based
    - Indices-based
- Search view
- Amazon OpenSearch Service
  - Limitations of OpenSearch Serverless
Managed folders
- Creating a managed folder
- Using a managed folder
  - Merge Folder Recipe
- Local vs non-local
- Usage in Python
- Usage in R
- Usage of a folder as a dataset
- Clearing
“Files in folder” dataset
Metrics dataset
Internal stats dataset
“Editable” dataset
kdb+
- Installing support
- Creating a kdb+ connection
FTP
- Creating a FTP connection
  - FTP connection parameters
- Creating FTP datasets
- Use the FTP dataset for writing
SCP / SFTP (aka SSH)
- Defining the SCP/SFTP connection
  - SCP/SFTP connection parameters
- Creating SCP or SFTP datasets
HTTP
- Creating a HTTP dataset
  - Remote URL definition
- Partitioned HTTP dataset
  - Example
HTTP (with cache)
Server filesystem
- Filesystem connection
- Create a filesystem dataset
Dataset plugins
Making relocatable managed datasets
- Relocation of SQL datasets
- Relocation of HDFS datasets
Clearing non-managed Datasets
Data ordering
- Write ordering
- Read-time ordering
Dynamic dataset repeat
- SQL table dataset
  - Example use case
- File-based dataset
  - Example use case
- Configuring
PI System / PIWebAPI server
- Setup the authentication preset
- Setup authentication per user
- Attribute search Dataset
- Event frames search Dataset
- PIWebAPI Toolbox Dataset
- Assets metrics downloader Recipe
- Event frames downloader Recipe
- Transpose & Synchronize Recipe
- Advanced parameters
- Data types
- Time formats
Google Drive
- Create a Google Drive preset
  - Using Service Account
  - Using OAuth2
- Usage
Google Sheets
- Create a Google Sheets preset
  - Using Service Account
  - Using OAuth2
- Usage
Google Analytics
- Create a Google Analytics preset
- Usage
  - Creating Google Analytics datasets
Data transfer on Dataiku Cloud
Sample Dataset
Cassandra
- Requirements
- Configuring Cassandra cluster connections
- Configuring Cassandra datasets
  - Data Science Studio managed datasets
  - External datasets
- Dataset configuration parameters
- Dataset and table schemas
- Restrictions and caveats