You are viewing the documentation for version 8.0 of DSS, which is outdated.
An up to date version might be available for the latest version

Dataiku DSS¶

Welcome to the Product Documentation for Dataiku Data Science Studio (DSS). This site contains information on the details of installing and configuring Dataiku DSS in your environment, using the tool through the browser interface, and driving it through the API.

Is This the Help You’re Looking For?¶

You might also find these other resources useful:

The Knowledge Base a variety of topics that can help you to learn more about Dataiku DSS, or find solutions to problems without having to ask for help.
Dataiku Academy provides guided learning paths for you to follow, upskill, and gain certification on Dataiku DSS.
Dataiku Community is a place where you can join the discussion, get support, share best practices and engage with other Dataiku users.

Reference Doc Contents¶

Installing DSS
- Requirements
- Installing a new DSS instance
- Upgrading a DSS instance
- Updating a DSS license
- Other installation options
- Setting up Hadoop and Spark integration
- Setting up Dashboards and Flow export to PDF or images
- R integration
- Customizing DSS installation
- Installing database drivers
- Java runtime environment
- Python integration
- Installing a DSS plugin
- Configuring LDAP authentication
- Working with proxies
- Migration operations
DSS concepts
- Data
- Datasets
- Recipes
- Building datasets
- Managed and external datasets
- Partitioning
Homepage
- Navigating the Homepage
Projects
- How to Copy a Dataiku Project
- Creating projects through macros
Connecting to data
- Supported connections
- Upload your files
- Server filesystem
- HDFS
- Amazon S3
- Google Cloud Storage
- Azure Blob Storage
- FTP
- SCP / SFTP (aka SSH)
- HTTP
- SQL databases
- Cassandra
- MongoDB
- Elasticsearch
- Managed folders
- “Files in folder” dataset
- Metrics dataset
- Internal stats dataset
- HTTP (with cache)
- Dataset plugins
- Data connectivity macros
- Making relocatable managed datasets
- Data ordering
Exploring your data
- Sampling
- Analyze
Schemas, storage types and meanings
- Definitions
- Basic usage
- Schema for data preparation
- Creating schemas of datasets
- Handling of schemas by recipes
- List of recognized meanings
- User-defined meanings
- Handling and display of dates
Data preparation
- How to Copy Prepare Recipe Steps
- Sampling
- Execution engines
- Processors reference
- Filtering and flagging rows
- Managing dates
- Reshaping
- Geographic processing
Charts
- The Charts Interface
- Sampling & Engine
- Basic Charts
- Tables
- Scatter Charts
- Map Charts
- Other Charts
- Common Chart Elements
- Color palettes
Interactive statistics
- The Worksheet Interface
- Univariate Analysis
- Bivariate Analysis
- Fit curves and distributions
- Correlation matrix
- Statistical Tests
- Principal Component Analysis (PCA)
Machine learning
- Prediction (Supervised ML)
- Clustering (Unsupervised ML)
- Automated machine learning
- Model Settings Reusability
- Features handling
- Algorithms reference
- Advanced models optimization
- Models ensembling
- Model Document Generator
- Deep Learning
- Models lifecycle
- Scoring engines
- Writing custom models
- Exporting models
- Partitioned Models
The Flow
- Visual Grammar
- Flow zones
- Rebuilding Datasets
- Limiting Concurrent Executions
- Exporting the Flow to PDF or images
- How to Manage Large Flows with Flow Folding
Visual recipes
- Prepare: Cleanse, Normalize, and Enrich
- Sync: copying datasets
- Grouping: aggregating data
- Window: analytics functions
- Distinct: get unique rows
- Join: joining datasets
- Splitting datasets
- Top N: retrieve first N rows
- Stacking datasets
- Sampling datasets
- Sort: order values
- Pivot recipe
- Download recipe
Recipes based on code
- The common editor layout
- Python recipes
- R recipes
- SQL recipes
- Hive recipes
- Pig recipes
- Impala recipes
- Spark-Scala recipes
- PySpark recipes
- Spark / R recipes
- SparkSQL recipes
- Shell recipes
- Variables expansion in code recipes
Code notebooks
- SQL notebook
- Python notebooks
- Predefined notebooks
- Containerized notebooks
Webapps
- “Standard” web apps
- Shiny web apps
- Bokeh web apps
- Publishing webapps on the dashboard
- Public webapps
- Webapps and security
- Scaling webapps on Kubernetes
- Introduction to DSS webapps
- Example use cases
Code reports
- R Markdown reports
Dashboards
- Dashboard concepts
- Display settings
- Exporting dashboards to PDF or images
- Insights reference
Dataiku Applications
- Introduction
- Using a Dataiku application
- Developing a Dataiku application
- Application-as-recipe
- Sharing a Dataiku application
DSS in the cloud
- DSS in AWS
- DSS in Azure
- DSS in GCP
Working with partitions
- Partitioning files-based datasets
- Partitioned SQL datasets
- Specifying partition dependencies
- Partition identifiers
- Recipes for partitioned datasets
- Partitioned Hive recipes
- Partitioned SQL recipes
- Partitioning variables substitutions
- Partitioned Models
- The two partitioning models
DSS and Hadoop
- Setting up Hadoop integration
- Connecting to secure clusters
- Hadoop filesystems connections (HDFS, S3, EMRFS, WASB, ADLS, GS)
- DSS and Hive
- DSS and Impala
- Hive datasets
- Multiple Hadoop clusters
- Dynamic AWS EMR clusters
- Hadoop user isolation
- Distribution-specific notes
- Teradata Connector For Hadoop
- Dynamic Google Dataproc clusters
DSS and Spark
- Usage of Spark in DSS
- Spark on Kubernetes
- Setting up (without Kubernetes)
- Spark configurations
- Interacting with DSS datasets
- Spark pipelines
- Limitations and attention points
DSS and SQL
- SQL datasets
- SQL write and execution
- Partitioning
- SQL pipelines in DSS
DSS and Python
- Installing Python packages
- Reusing Python code
- Using Matplotlib
- Using SpaCy
- Using Bokeh
- Using Plot.ly
- Using Ggplot
- Using Jupyter Widgets
DSS and R
- Installing R packages
- Reusing R code
- Using ggplot2
- Using Dygraphs
- Using googleVis
- Using ggvis
- Installing STAN or Prophet
- RStudio integration
Metastore catalog
- Hive metastore (through HiveServer2)
- Glue metastore
- DSS as virtual metastore
Code environments
- Operations (Python)
- Operations (R)
- Base packages
- Using Conda
- Automation nodes
- Non-managed code environments
- Plugins’ code environments
- Custom options and environment
- Troubleshooting
- Code env permissions
Running in containers
- Concepts
- Setting up (Kubernetes)
- Unmanaged Kubernetes clusters
- Managed Kubernetes clusters
- Using Amazon Elastic Kubernetes Service (EKS)
- Using Microsoft Azure Kubernetes Service (AKS)
- Using Google Kubernetes Engine (GKE)
- Using Openshift
- Using code envs with containerized execution
- Dynamic namespace management
- Customization of base images
- Troubleshooting
- Using Docker instead of Kubernetes
Collaboration
- Wikis
- Discussions
- Markdown
- Tags
- Working with Git
- Version control of projects
- Importing code from Git in project libraries
Automation scenarios, metrics, and checks
- Definitions
- Scenario steps
- Launching a scenario
- Reporting on scenario runs
- Custom scenarios
- Variables in scenarios
- Step-based execution control
- Metrics
- Checks
- Custom probes and checks
Automation node and bundles
- Installing the Automation node
- Creating a bundle
- Importing a bundle
API Node & API Deployer: Real-time APIs
- Introduction
- Concepts
- Installing an API node
- Installing the API Deployer
- First API (without API Deployer)
- First API (with API Deployer)
- Types of Endpoints
- Enriching prediction queries
- Security
- Managing versions of your endpoint
- Deploying on Kubernetes
- APINode APIs reference
- Operations reference
Time Series
- Topics
Unstructured data
- Text
- Images
- Video
- Graph
- Audio
Plugins
- Installing plugins
- Managing installed plugins
- Developing plugins
Python APIs
- Using the APIs inside of DSS
- Using the APIs outside of DSS
- Datasets (introduction)
- Datasets (reading and writing data)
- Datasets (other operations)
- Datasets (reference)
- Managed folders
- Interaction with Pyspark
- The main DSSClient class
- Projects
- Project folders
- Recipes
- Interaction with saved models
- Scenarios
- Scenarios (in a scenario)
- Flow creation and management
- Machine learning
- Statistics worksheets
- API Designer & Deployer
- Static insights
- Jobs
- Authentication information and impersonation
- Importing tables as datasets
- Wikis
- Discussions
- Performing SQL, Hive and Impala queries
- SQL Query
- Meanings
- Users and groups
- Connections
- Code envs
- Plugins
- Dataiku applications
- Metrics and checks
- Other administration tasks
- Reference API documentation of dataiku
- Reference API documentation of dataikuapi
- API for plugin components
- Clusters
R API
- Using the R API inside of DSS
- Using the R API outside of DSS
- Reference documentation
- Authentication information
- Creating static insights
Public REST API
- Features
- Public API Keys
- The REST API
Additional APIs
- The Javascript API
- The Scala API
File formats
- Delimiter-separated values (CSV / TSV)
- Fixed width
- Parquet
- Avro
- Hive SequenceFile
- Hive RCFile
- Hive ORCFile
- XML
- JSON
- Excel
- ESRI Shapefiles
Security
- Main project permissions
- Connections security
- User profiles
- Exposed objects
- Dashboard authorizations
- User secrets
- Audit Trail
- Advanced security options
- Single Sign-On
- Multi-Factor Authentication
- Passwords security
User Isolation
- Capabilities of User Isolation Framework
- Concepts
- Prerequisites and limitations
- Initial Setup
- Reference architectures
- Details of UIF capabilities
- Advanced topics
Operating DSS
- dsscli tool
- The data directory
- Backing up
- Audit trail
- The runtime databases
- Logging in DSS
- DSS Macros
- Managing DSS disk usage
- Understanding and tracking DSS processes
- Tuning and controlling memory usage
- Using cgroups for resource control
- Monitoring DSS
- Compute resource usage reporting
Advanced topics
- Sampling methods
- Formula language
- Custom variables expansion
Accessibility
- Global Shortcuts
- Project Navigation
- Within the Flow
- Within a Dataset
- Within a Prepare Recipe
- Within any Recipe
- Within any Code Editor (Excluding Notebooks)
- Within any Flow Object
- Within Plugins Development
Troubleshooting
- Diagnosing and debugging issues
- Obtaining support
- Support tiers
- Common issues
- Error codes
Release notes
- DSS 8.0 Release notes
- DSS 7.0 Release notes
- DSS 6.0 Release notes
- DSS 5.1 Release notes
- DSS 5.0 Release notes
- DSS 4.3 Release notes
- DSS 4.2 Release notes
- DSS 4.1 Release notes
- DSS 4.0 Release notes
- DSS 3.1 Release notes
- DSS 3.0 Relase notes
- DSS 2.3 Relase notes
- DSS 2.2 Relase notes
- DSS 2.1 Relase notes
- DSS 2.0 Relase notes
- DSS 1.4 Relase notes
- DSS 1.3 Relase notes
- DSS 1.2 Relase notes
- DSS 1.1 Release notes
- DSS 1.0 Release Notes
- Pre versions
Other Documentation
- Older DSS versions
- Other Dataiku products
Third-party acknowledgements
- Java libraries
- Python libraries
- Scala libraries
- Native libraries
- R libraries
- Frontend libraries
- Data
- For macOS version only