You are viewing the documentation for version 11 of DSS.

Dataiku DSS¶

Welcome to the Product Documentation for Dataiku Data Science Studio (DSS). This site contains information on the details of installing and configuring Dataiku DSS in your environment, using the tool through the browser interface, and driving it through the API.

Is This the Help You’re Looking For?¶

You might also find these other resources useful:

The Knowledge Base a variety of topics that can help you to learn more about Dataiku DSS, or find solutions to problems without having to ask for help.
Dataiku Academy provides guided learning paths for you to follow, upskill, and gain certification on Dataiku DSS.
Dataiku Community is a place where you can join the discussion, get support, share best practices and engage with other Dataiku users.

Reference Doc Contents¶

DSS concepts
- Homepage
- Projects
- Data
- Datasets
- Recipes
- Building datasets
- Managed and external datasets
- Partitioning
Connecting to data
- Supported connections
- SQL databases
- Amazon S3
- Azure Blob Storage
- Google Cloud Storage
- Upload your files
- HDFS
- Cassandra
- MongoDB
- Elasticsearch
- File formats
- Managed folders
- “Files in folder” dataset
- Metrics dataset
- Internal stats dataset
- “Editable” dataset
- kdb+
- FTP
- SCP / SFTP (aka SSH)
- HTTP
- HTTP (with cache)
- Server filesystem
- Dataset plugins
- Making relocatable managed datasets
- Clearing non-managed Datasets
- Data ordering
- PI System / PIWebAPI server
- Data transfer on Dataiku Cloud
Exploring your data
- Sampling
- Analyze
Schemas, storage types and meanings
- Definitions
- Basic usage
- Schema for data preparation
- Creating schemas of datasets
- Handling of schemas by recipes
- List of recognized meanings
- User-defined meanings
- Handling and display of dates
Data preparation
- How to Copy Prepare Recipe Steps
- Sampling
- Execution engines
- Processors reference
- Filtering and flagging rows
- Managing dates
- Reshaping
- Geographic processors
Charts
- The Charts Interface
- Sampling & Engine
- Basic Charts
- Tables
- Scatter Charts
- Map Charts
- Other Charts
- Common Chart Elements
- Color palettes
- Formatting
- Filter settings
Interactive statistics
- The Worksheet Interface
- Univariate Analysis
- Bivariate Analysis
- Fit curves and distributions
- Statistical Tests
- Multivariate Analysis
- Time Series Analysis
- Assisted Data Exploration
Machine learning
- Prediction (Supervised ML)
- Clustering (Unsupervised ML)
- Automated machine learning
- Model Settings Reusability
- Features handling
- Algorithms reference
- Advanced models optimization
- Models ensembling
- Model Document Generator
- Time Series Forecasting
- Deep Learning
- Models lifecycle
- Scoring engines
- Writing custom models
- Exporting models
- Partitioned Models
- ML Diagnostics
- ML Assertions
- Computer vision
- Image labeling
The Flow
- Visual Grammar
- Flow zones
- Rebuilding Datasets
- Limiting Concurrent Executions
- Exporting the Flow to PDF or images
- How to Manage Large Flows with Flow Folding
- Flow Document Generator
Visual recipes
- Prepare: Cleanse, Normalize, and Enrich
- Sync: copying datasets
- Grouping: aggregating data
- Window: analytics functions
- Distinct: get unique rows
- Join: joining datasets
- Fuzzy join: joining two datasets
- Geo join: joining datasets based on geospatial features
- Splitting datasets
- Top N: retrieve first N rows
- Stacking datasets
- Sampling datasets
- Sort: order values
- Pivot recipe
- Push to editable recipe
- Download recipe
- List Folder Contents
Recipes based on code
- The common editor layout
- Python recipes
- R recipes
- SQL recipes
- Hive recipes
- Impala
- Spark-Scala recipes
- PySpark recipes
- Spark / R recipes
- SparkSQL recipes
- Shell recipes
- Variables expansion in code recipes
Code notebooks
- SQL notebook
- Python notebooks
- Predefined notebooks
- Containerized notebooks
- Installing Jupyter Extensions
MLOps
- Feature Store
- Models evaluations
- Model Comparisons
- Drift analysis
- MLflow Models
- Experiment Tracking
Webapps
- “Standard” web apps
- Shiny web apps
- Bokeh web apps
- Dash web apps
- Publishing webapps on the dashboard
- Public webapps
- Webapps and security
- Scaling webapps on Kubernetes
- Introduction to DSS webapps
- Example use cases
Code Studios
- Concepts
- Requirements
- Create your first Code Studio
- Preparing Code Studio templates
- Running Code Studios
- Operations
- Publish a Code Studio as a webapp
- Project exports and bundles
Code reports
- R Markdown reports
Dashboards
- Dashboard concepts
- Display settings
- Exporting dashboards to PDF or images
- Filters
- Filtering a dashboard using a query parameter in the URL
- Insights reference
Workspaces
- Sharing DSS objects into a workspace
- Managing Workspaces
- Discussions
Dataiku Applications
- Application tiles
- Application-as-recipe
- Introduction
- Using a Dataiku application
- Developing a Dataiku application
- Application-as-recipe
- Sharing a Dataiku application
- Initiating an application instantiation request
- Managing an application execution request
Working with partitions
- Partitioning files-based datasets
- Partitioned SQL datasets
- Specifying partition dependencies
- Partition identifiers
- Recipes for partitioned datasets
- Partitioned Hive recipes
- Partitioned SQL recipes
- Partitioning variables substitutions
- Partitioned Models
- The two partitioning models
DSS and SQL
- SQL datasets
- SQL write and execution
- Partitioning
- SQL pipelines in DSS
DSS and Python
- Installing Python packages
- Reusing Python code
- Using Matplotlib
- Using SpaCy
- Using Bokeh
- Using Plot.ly
- Using Ggplot
- Using Jupyter Widgets
DSS and R
- Installing R packages
- Reusing R code
- Using ggplot2
- Using Dygraphs
- Using googleVis
- Using ggvis
- Installing STAN or Prophet
- RStudio integration
DSS and Spark
- Usage of Spark in DSS
- Spark configurations
- Interacting with DSS datasets
- Spark pipelines
- Limitations and attention points
- Setting up Spark integration
Code environments
- Operations (Python)
- Operations (R)
- Base packages
- Using Conda
- Automation nodes
- Non-managed code environments
- Plugins’ code environments
- Custom options and environment
- Troubleshooting
- Code env permissions
Collaboration
- Wikis
- Discussions
- Markdown
- Tags
- Working with Git
- Version control of projects
- Importing code from Git in project libraries
- Importing Jupyter Notebooks from Git
- Requests
Time Series
- Understanding time series data
- Format of time series data
- Time series preparation
- Time series visualization
- Time series forecasting
Geographic data
- Geographic data types
- Geographic data
- Visualizing geographic data
- Geographic data preparation
- Geo join
- Geocoding and reverse geocoding
- Georouting and isochrones
- Geographic formula functions
Text & Natural Language Processing
- Language Detection
- Named Entities Extraction
- Sentiment Analysis
- Translation
- Text summarization
- Key phrase extraction
- Ontology Tagging
- Spell checking
- OpenAI GPT
- Machine Learning with Text features
- OCR (Optical Character recognition)
- Speech-to-Text
- Text cleaning
- Text Embedding
- NLP using AWS APIs
- NLP using Azure APIs
- NLP with Crowlingo API
- NLP using Deepl API
- NLP using Google APIs
- NLP with MeaningCloud API
Images
Audio
Video
Automation scenarios, metrics, and checks
- Definitions
- Scenario steps
- Launching a scenario
- Reporting on scenario runs
- Custom scenarios
- Variables in scenarios
- Step-based execution control
- Metrics
- Checks
- Custom probes and checks
Production deployments and bundles
- Setting up the Deployer
- Creating a bundle
- Deployment infrastructures
- Deploying bundles with the Project Deployer
- Manually importing bundles
API Node & API Deployer: Real-time APIs
- Introduction
- Concepts
- Installing API nodes
- Setting up the API Deployer and deployment infrastructures
- First API (with API Deployer)
- First API (without API Deployer)
- Types of Endpoints
- Enriching prediction queries
- Security
- Managing versions of your endpoint
- Deploying on Kubernetes
- APINode APIs reference
- Operations reference
Governance
- Definitions
- Installing and setting up Govern
- Govern Projects, Models and Bundles
- Sign-off Scenario
- Model Registry
- Bundle Registry
- Blueprint Designer
- Public REST API
Python APIs
- Using the APIs inside of DSS
- Using the APIs outside of DSS
- Datasets (introduction)
- Datasets (reading and writing data)
- Datasets (other operations)
- Datasets (reference)
- Feature Store
- Managed folders
- Streaming Endpoints
- Interaction with Pyspark
- The main DSSClient class
- Projects
- Project folders
- Project libraries
- Recipes
- Interaction with saved models
- Scenarios
- Scenarios (in a scenario)
- Flow creation and management
- Machine learning
- Experiment Tracking
- Statistics worksheets
- Code studios
- API Designer & Deployer
- Project Deployer
- Static insights
- Jobs
- Authentication information and impersonation
- Importing tables as datasets
- Wikis
- Discussions
- Performing SQL, Hive and Impala queries
- SQL Query
- Meanings
- Users and groups
- Connections
- Code envs
- Plugins
- Macros
- Dataiku applications
- Metrics and checks
- Model Evaluation Stores
- Other administration tasks
- Utilities
- Reference API documentation of dataiku
- Reference API documentation of dataikuapi
- API for plugin components
- Clusters
- Code studios
- API for Fleet Manager
- API for Dataiku Govern
- Workspaces
- Webapps
R API
- Using the R API inside of DSS
- Using the R API outside of DSS
- Reference documentation
- Authentication information
- Creating static insights
Public REST API
- Features
- Public API Keys
- The REST API
Additional APIs
- The Javascript API
- The Scala API
Installing and setting up
- Dataiku Cloud Stacks for AWS
- Dataiku Cloud Stacks for Azure
- Dataiku Cloud Stacks for GCP
- Custom Dataiku install on Linux
- Other installation options
Elastic AI computation
- Concepts
- Initial setup
- Managed Kubernetes clusters
- Using Amazon Elastic Kubernetes Service (EKS)
- Using Microsoft Azure Kubernetes Service (AKS)
- Using Google Kubernetes Engine (GKE)
- Using code envs with containerized execution
- Dynamic namespace management
- Customization of base images
- Unmanaged Kubernetes clusters
- Using Openshift
- Troubleshooting
- Using Docker instead of Kubernetes
DSS in the cloud
- DSS in AWS
- DSS in Azure
- DSS in GCP
DSS and Hadoop
- Setting up Hadoop integration
- Connecting to secure clusters
- Hadoop filesystems connections (HDFS, S3, EMRFS, WASB, ADLS, GS)
- Hive
- Impala
- Spark
- Hive datasets
- Hadoop user isolation
- Distribution-specific notes
- Teradata Connector For Hadoop
- Multiple Hadoop clusters
- Dynamic AWS EMR clusters
- Dynamic Google Dataproc clusters
Metastore catalog
Operating DSS
- dsscli tool
- The data directory
- Backing up
- Audit trail
- The runtime databases
- Logging in DSS
- DSS Macros
- Managing DSS disk usage
- Understanding and tracking DSS processes
- Tuning and controlling memory usage
- Using cgroups for resource control
- Monitoring DSS
- HTTP proxies
- DSS license
- Compute resource usage reporting
Security
- Project Access
- Main project permissions
- Connections security
- User profiles
- Shared objects
- Workspaces & dashboards authorizations
- User secrets
- Audit Trail
- Govern Security: Roles and Permissions
- Configuring LDAP authentication
- Single Sign-On
- Multi-Factor Authentication
- Passwords security
- Advanced security options
User Isolation
- Capabilities of User Isolation Framework
- Concepts
- Prerequisites and limitations
- Initial Setup
- Troubleshooting
- Reference architectures
- Details of UIF capabilities
- Advanced topics
Plugins
- Installing plugins
- Managing installed plugins
- Developing plugins
Streaming data
- Concepts
- Kafka
- AWS SQS
- HTTP Server-Sent Events
- Continuous sync
- Continuous Python
- Streaming Spark Scala
Formula language
- Basic usage
- Reading column values
- Variables typing and autotyping
- Boolean values
- Operators
- Array and object operations
- Object notations
- DSS variables
- Array functions
- Boolean functions
- Date functions
- Math functions
- Object functions
- String functions
- Geometry functions
- Value access functions
- Control structures
- Tests
Custom variables expansion
- Defining variables
- Using variables in the code of a recipe
- Using variables in configuration fields
- Using override tables
- Modifying the value of variables
Sampling methods
- Generic sampling methods
- Exploration / Visual data preparation
Accessibility
- Global Shortcuts
- Project Navigation
- Within the Flow
- Within a Dataset
- Within a Prepare Recipe
- Within a Code Recipe
- Within any Recipe
- Within any Code Editor (Excluding Notebooks)
- Within any Flow Object
- Within Plugins Development
- Within a Dataset Insight
Troubleshooting
- Diagnosing and debugging issues
- Obtaining support
- Support tiers
- Common issues
- Error codes
Release notes
- DSS 11 Release notes
- DSS 10.0 Release notes
- DSS 9.0 Release notes
- DSS 8.0 Release notes
- DSS 7.0 Release notes
- DSS 6.0 Release notes
- DSS 5.1 Release notes
- DSS 5.0 Release notes
- DSS 4.3 Release notes
- DSS 4.2 Release notes
- DSS 4.1 Release notes
- DSS 4.0 Release notes
- DSS 3.1 Release notes
- DSS 3.0 Relase notes
- DSS 2.3 Relase notes
- DSS 2.2 Relase notes
- DSS 2.1 Relase notes
- DSS 2.0 Relase notes
- DSS 1.4 Relase notes
- DSS 1.3 Relase notes
- DSS 1.2 Relase notes
- DSS 1.1 Release notes
- DSS 1.0 Release Notes
- Pre versions
Other Documentation
- Older DSS versions
- Other Dataiku products
Third-party acknowledgements
- DSS
- Mac version only
- Dataiku Cloud