You are viewing the documentation for version 11 of DSS.

Connecting to data¶

The first task when using Data Science Studio is to define datasets to connect to your data sources.

A dataset is a series of records with the same schema. It is quite analogous to a table in the SQL world.

For a more global explanation about the different kinds of datasets, see the DSS concepts page.

Supported connections
- Connectors
- File formats
  - Standard formats
  - Hadoop/Spark specific formats
SQL databases
- Introduction
- Snowflake
- Azure Synapse
- Google BigQuery
- Amazon Redshift
- PostgreSQL
  - Installing the JDBC driver
  - Secure connections (SSL / TLS) support
- MySQL
- Microsoft SQL Server
- Oracle
  - Installing the JDBC driver
  - Advanced connection properties
- Teradata
- Pivotal Greenplum
- Google AlloyDB
  - Installing the JDBC driver
  - Secure connections (SSL / TLS) support
- AWS Athena
- Vertica
  - Installing the JDBC driver
  - Timezones support
- SAP HANA
  - Caveats
- IBM Netezza
  - Caveats
- Exasol
  - Limitations
- IBM DB2
- kdb+
  - Installing support
  - Creating a kdb+ connection
Amazon S3
- Create a S3 connection
  - Required S3 permissions
  - Transfer ownership to the bucket owner
- Creating S3 datasets
- Connections path handling
- Location of managed datasets and folders
  - For a “free selection” connection
  - For a “path restriction” connection
- Server-side encryption of files
  - Encryption Mode
Azure Blob Storage
- Creating a Azure connection
- Connecting to Azure using OAuth2
- Creating Azure Blob Storage datasets
- Connections path handling
- Location of managed datasets and folders
  - For a “free selection” connection
  - For a “path restriction” connection
Google Cloud Storage
- Create a GCS connection
  - Using Service Account
  - Using OAuth2
- Creating GCS datasets
- Connections path handling
- Location of managed datasets and folders
  - For a “free selection” connection
  - For a “path restriction” connection
Upload your files
- Storage location
- Size limitations
HDFS
- Compatible filesystems
Cassandra
- Requirements
- Configuring Cassandra cluster connections
- Configuring Cassandra datasets
  - Data Science Studio managed datasets
  - External datasets
- Dataset configuration parameters
- Dataset and table schemas
- Restrictions and caveats
MongoDB
- Setting up the MongoDB connection
Elasticsearch
- Define an Elasticsearch connection
- Managed Elasticsearch datasets
- External Elasticsearch datasets
  - Partitioning
    - Field-based
    - Indices-based
- Search view
File formats
- Delimiter-separated values (CSV / TSV)
- Fixed width
- Parquet
- Avro
  - Applicability
- Hive ORCFile
  - Compatibility
  - Limitations
- XML
  - Handling the structure
    - Selection of the data to load
    - JSON representation
      - Example
  - Using XPath to select data
    - Limitations
    - Selecting values explicitly
      - Example
- JSON
  - Example
- Excel
- ESRI Shapefiles
  - Vecmath library
- Delta Lake
Managed folders
- Creating a managed folder
- Using a managed folder
  - Merge Folder Recipe
- Local vs non-local
- Usage in Python
- Usage in R
- Usage of a folder as a dataset
- Clearing
“Files in folder” dataset
Metrics dataset
Internal stats dataset
“Editable” dataset
kdb+
- Installing support
- Creating a kdb+ connection
FTP
- Creating a FTP connection
  - FTP connection parameters
- Creating FTP datasets
- Use the FTP dataset for writing
SCP / SFTP (aka SSH)
- Defining the SSH connection
  - SSH connection parameters
- Creating SCP or SFTP datasets
HTTP
- Creating a HTTP dataset
  - Remote URL definition
- Partitioned HTTP dataset
  - Example
HTTP (with cache)
Server filesystem
- Filesystem connection
- Create a filesystem dataset
Dataset plugins
Making relocatable managed datasets
- Relocation of SQL datasets
- Relocation of HDFS datasets
Clearing non-managed Datasets
Data ordering
- Write ordering
- Read-time ordering
PI System / PIWebAPI server
- Setup the authentication preset
- Setup authentication per user
- Attribute search Dataset
- Event frames search Dataset
- PIWebAPI Toolbox Dataset
- Assets metrics downloader Recipe
- Event frames downloader Recipe
- Transpose & Synchronize Recipe
- Advanced parameters
- Data types
- Time formats
Data transfer on Dataiku Cloud