Dataiku DSS¶
Welcome to the Product Documentation for Dataiku Data Science Studio (DSS). This site contains information on the details of installing and configuring Dataiku DSS in your environment, using the tool through the browser interface, and driving it through the API.
Is This the Help You’re Looking For?¶
You might also find these other resources useful:
The Knowledge Base a variety of topics that can help you to learn more about Dataiku DSS, or find solutions to problems without having to ask for help.
Dataiku Academy provides guided learning paths for you to follow, upskill, and gain certification on Dataiku DSS.
Dataiku Community is a place where you can join the discussion, get support, share best practices and engage with other Dataiku users.
Reference Doc Contents¶
- DSS concepts
- Connecting to data
- Supported connections
- SQL databases
- Amazon S3
- Azure Blob Storage
- Google Cloud Storage
- Upload your files
- HDFS
- Cassandra
- MongoDB
- Elasticsearch
- File formats
- Managed folders
- “Files in folder” dataset
- Metrics dataset
- Internal stats dataset
- “Editable” dataset
- kdb+
- FTP
- SCP / SFTP (aka SSH)
- HTTP
- HTTP (with cache)
- Server filesystem
- Dataset plugins
- Making relocatable managed datasets
- Clearing non-managed Datasets
- Data ordering
- PI System / PIWebAPI server
- Data transfer on Dataiku Cloud
- Exploring your data
- Schemas, storage types and meanings
- Data preparation
- Charts
- Interactive statistics
- Machine learning
- Prediction (Supervised ML)
- Clustering (Unsupervised ML)
- Automated machine learning
- Model Settings Reusability
- Features handling
- Algorithms reference
- Advanced models optimization
- Models ensembling
- Model Document Generator
- Time Series Forecasting
- Deep Learning
- Models lifecycle
- Scoring engines
- Writing custom models
- Exporting models
- Partitioned Models
- ML Diagnostics
- ML Assertions
- Computer vision
- Image labeling
- The Flow
- Visual recipes
- Prepare: Cleanse, Normalize, and Enrich
- Sync: copying datasets
- Grouping: aggregating data
- Window: analytics functions
- Distinct: get unique rows
- Join: joining datasets
- Fuzzy join: joining two datasets
- Geo join: joining datasets based on geospatial features
- Splitting datasets
- Top N: retrieve first N rows
- Stacking datasets
- Sampling datasets
- Sort: order values
- Pivot recipe
- Push to editable recipe
- Download recipe
- List Folder Contents
- Recipes based on code
- Code notebooks
- MLOps
- Webapps
- Code Studios
- Code reports
- Dashboards
- Workspaces
- Dataiku Applications
- Working with partitions
- DSS and SQL
- DSS and Python
- DSS and R
- DSS and Spark
- Code environments
- Collaboration
- Time Series
- Geographic data
- Text & Natural Language Processing
- Language Detection
- Named Entities Extraction
- Sentiment Analysis
- Translation
- Text summarization
- Key phrase extraction
- Ontology Tagging
- Spell checking
- OpenAI GPT
- Machine Learning with Text features
- OCR (Optical Character recognition)
- Speech-to-Text
- Text cleaning
- Text Embedding
- NLP using AWS APIs
- NLP using Azure APIs
- NLP with Crowlingo API
- NLP using Deepl API
- NLP using Google APIs
- NLP with MeaningCloud API
- Images
- Audio
- Video
- Automation scenarios, metrics, and checks
- Production deployments and bundles
- API Node & API Deployer: Real-time APIs
- Introduction
- Concepts
- Installing API nodes
- Setting up the API Deployer and deployment infrastructures
- First API (with API Deployer)
- First API (without API Deployer)
- Types of Endpoints
- Enriching prediction queries
- Security
- Managing versions of your endpoint
- Deploying on Kubernetes
- APINode APIs reference
- Operations reference
- Governance
- Python APIs
- Using the APIs inside of DSS
- Using the APIs outside of DSS
- Datasets (introduction)
- Datasets (reading and writing data)
- Datasets (other operations)
- Datasets (reference)
- Feature Store
- Managed folders
- Streaming Endpoints
- Interaction with Pyspark
- The main DSSClient class
- Projects
- Project folders
- Project libraries
- Recipes
- Interaction with saved models
- Scenarios
- Scenarios (in a scenario)
- Flow creation and management
- Machine learning
- Experiment Tracking
- Statistics worksheets
- Code studios
- API Designer & Deployer
- Project Deployer
- Static insights
- Jobs
- Authentication information and impersonation
- Importing tables as datasets
- Wikis
- Discussions
- Performing SQL, Hive and Impala queries
- SQL Query
- Meanings
- Users and groups
- Connections
- Code envs
- Plugins
- Macros
- Dataiku applications
- Metrics and checks
- Model Evaluation Stores
- Other administration tasks
- Utilities
- Reference API documentation of
dataiku
- Reference API documentation of
dataikuapi
- API for plugin components
- Clusters
- Code studios
- API for Fleet Manager
- API for Dataiku Govern
- Workspaces
- Webapps
- R API
- Public REST API
- Additional APIs
- Installing and setting up
- Elastic AI computation
- Concepts
- Initial setup
- Managed Kubernetes clusters
- Using Amazon Elastic Kubernetes Service (EKS)
- Using Microsoft Azure Kubernetes Service (AKS)
- Using Google Kubernetes Engine (GKE)
- Using code envs with containerized execution
- Dynamic namespace management
- Customization of base images
- Unmanaged Kubernetes clusters
- Using Openshift
- Troubleshooting
- Using Docker instead of Kubernetes
- DSS in the cloud
- DSS and Hadoop
- Setting up Hadoop integration
- Connecting to secure clusters
- Hadoop filesystems connections (HDFS, S3, EMRFS, WASB, ADLS, GS)
- Hive
- Impala
- Spark
- Hive datasets
- Hadoop user isolation
- Distribution-specific notes
- Teradata Connector For Hadoop
- Multiple Hadoop clusters
- Dynamic AWS EMR clusters
- Dynamic Google Dataproc clusters
- Metastore catalog
- Operating DSS
- dsscli tool
- The data directory
- Backing up
- Audit trail
- The runtime databases
- Logging in DSS
- DSS Macros
- Managing DSS disk usage
- Understanding and tracking DSS processes
- Tuning and controlling memory usage
- Using cgroups for resource control
- Monitoring DSS
- HTTP proxies
- DSS license
- Compute resource usage reporting
- Security
- Project Access
- Main project permissions
- Connections security
- User profiles
- Shared objects
- Workspaces & dashboards authorizations
- User secrets
- Audit Trail
- Govern Security: Roles and Permissions
- Configuring LDAP authentication
- Single Sign-On
- Multi-Factor Authentication
- Passwords security
- Advanced security options
- User Isolation
- Plugins
- Streaming data
- Formula language
- Basic usage
- Reading column values
- Variables typing and autotyping
- Boolean values
- Operators
- Array and object operations
- Object notations
- DSS variables
- Array functions
- Boolean functions
- Date functions
- Math functions
- Object functions
- String functions
- Geometry functions
- Value access functions
- Control structures
- Tests
- Custom variables expansion
- Sampling methods
- Accessibility
- Troubleshooting
- Release notes
- DSS 11 Release notes
- DSS 10.0 Release notes
- DSS 9.0 Release notes
- DSS 8.0 Release notes
- DSS 7.0 Release notes
- DSS 6.0 Release notes
- DSS 5.1 Release notes
- DSS 5.0 Release notes
- DSS 4.3 Release notes
- DSS 4.2 Release notes
- DSS 4.1 Release notes
- DSS 4.0 Release notes
- DSS 3.1 Release notes
- DSS 3.0 Relase notes
- DSS 2.3 Relase notes
- DSS 2.2 Relase notes
- DSS 2.1 Relase notes
- DSS 2.0 Relase notes
- DSS 1.4 Relase notes
- DSS 1.3 Relase notes
- DSS 1.2 Relase notes
- DSS 1.1 Release notes
- DSS 1.0 Release Notes
- Pre versions
- Other Documentation
- Third-party acknowledgements