Dataiku Documentation
  • Discussions
    • Setup & Configuration
    • Using Dataiku DSS
    • Plugins & Extending Dataiku DSS
    • General Discussion
    • Job Board
    • Community Resources
  • Knowledge
    • Getting Started
    • Knowledge Base
    • Documentation
  • Academy
    • Course Catalog
    • Learning Paths
    • Resources
    • Academy Discussions
  • Community Programs
    • Upcoming User Events
    • Find a User Group
    • Past Events
    • Community Conundrums
    • Dataiku Neurons
    • Banana Data Podcast
  • What's New
  • Installing DSS
    • Requirements
    • Installing a new DSS instance
    • Upgrading a DSS instance
    • Updating a DSS license
    • Other installation options
      • Install on macOS
      • Install on AWS
      • Install on Azure
      • Install a virtual machine
      • Running DSS as a Docker container
      • Install on GCP
    • Setting up Hadoop and Spark integration
    • Setting up Dashboards and Flow export to PDF or images
    • R integration
    • SageMaker Integration
    • Customizing DSS installation
    • Installing database drivers
    • Java runtime environment
    • Python integration
    • Installing a DSS plugin
    • Configuring LDAP authentication
    • Working with proxies
    • Migration operations
  • DSS concepts
  • Homepage
    • My Items
    • Projects and Project Folders View
    • Projects, Folders, Dashboards, Wikis Views
      • Changing the Order of Sections on the Homepage
      • Projects and Project Folders View
    • Getting Started With DSS Panel
    • Changing the Order of Sections on the Homepage
  • Projects
    • How to Copy a Dataiku Project
    • Creating projects through macros
  • Connecting to data
    • Supported connections
    • Upload your files
    • Server filesystem
    • HDFS
    • Amazon S3
    • Google Cloud Storage
    • Azure Blob Storage
    • FTP
    • SCP / SFTP (aka SSH)
    • HTTP
    • SQL databases
      • MySQL
      • PostgreSQL
      • Vertica
      • Amazon Redshift
      • Pivotal Greenplum
      • Teradata
      • Oracle
      • Microsoft SQL Server
      • Google Bigquery
      • Snowflake
      • IBM DB2
      • SAP HANA
      • IBM Netezza
      • AWS Athena
      • Exasol
    • Cassandra
    • MongoDB
    • Elasticsearch
    • Managed folders
    • “Files in folder” dataset
    • Metrics dataset
    • Internal stats dataset
    • HTTP (with cache)
    • Dataset plugins
    • Data connectivity macros
    • Making relocatable managed datasets
    • Data ordering
  • Exploring your data
    • Sampling
    • Analyze
  • Schemas, storage types and meanings
    • Definitions
    • Basic usage
    • Schema for data preparation
    • Creating schemas of datasets
    • Handling of schemas by recipes
    • List of recognized meanings
    • User-defined meanings
    • Handling and display of dates
  • Data preparation
    • How to Copy Prepare Recipe Steps
    • Sampling
    • Execution engines
    • Processors reference
      • Extract from array
      • Fold an array
      • Sort array
      • Concatenate JSON arrays
      • Discretize (bin) numerical values
      • Change coordinates system
      • Copy column
      • Rename columns
      • Concatenate columns
      • Delete/Keep columns by name
      • Column Pseudonymization
      • Count occurrences
      • Convert currencies
      • Extract date elements
      • Compute difference between dates
      • Format date with custom format
      • Parse to standard date format
      • Split e-mail addresses
      • Enrich from French department
      • Enrich from French postcode
      • Enrich with record context
      • Extract ngrams
      • Extract numbers
      • Fill a column with a constant value
      • Fill empty cells with fixed value
      • Filter rows/cells on date range
      • Filter rows/cells with formula
      • Filter invalid rows/cells
      • Filter rows/cells on numerical range
      • Filter rows/cells on value
      • Find and replace
      • Flag rows/cells on date range
      • Flag rows with formula
      • Flag invalid rows
      • Flag rows on numerical range
      • Flag rows on value
      • Fold multiple columns
      • Fold multiple columns by pattern
      • Fold object keys
      • Formula
      • Fuzzy join with other dataset (memory-based)
      • Generate Big Data
      • Compute distance between geopoints
      • Extract from geo column
      • Geo-join
      • Resolve GeoIP
      • Create GeoPoint from lat/lon
      • Extract lat/lon from GeoPoint
      • Flag holidays
      • Split invalid cells into another column
      • Join with other dataset (memory-based)
      • Extract with JSONPath
      • Group long-tail values
      • Translate values using meaning
      • Normalize measure
      • Merge long-tail values
      • Move columns
      • Negate boolean value
      • Force numerical range
      • Generate numerical combinations
      • Convert number formats
      • Nest columns
      • Unnest object (flatten JSON)
      • Extract with regular expression
      • Pivot
      • Python function
      • Split HTTP Query String
      • Remove rows where cell is empty
      • Round numbers
      • Simplify text
      • Split and fold
      • Split and unfold
      • Split column
      • Transform string
      • Tokenize text
      • Transpose rows to columns
      • Triggered unfold
      • Unfold
      • Unfold an array
      • Convert a UNIX timestamp to a date
      • Fill empty cells with previous/next value
      • Split URL (into protocol, host, port, …)
      • Classify User-Agent
      • Generate a best-effort visitor id
      • Zip JSON arrays
    • Filtering and flagging rows
    • Managing dates
    • Reshaping
    • Geographic processing
  • Charts
    • The Charts Interface
    • Sampling & Engine
    • Basic Charts
    • Tables
    • Scatter Charts
    • Map Charts
    • Other Charts
    • Common Chart Elements
    • Color palettes
  • Interactive statistics
    • The Worksheet Interface
    • Univariate Analysis
    • Bivariate Analysis
    • Fit curves and distributions
    • Correlation matrix
    • Statistical Tests
    • Principal Component Analysis (PCA)
  • Machine learning
    • Prediction (Supervised ML)
      • Prediction settings
      • Prediction Results
      • Individual prediction explanations
    • Clustering (Unsupervised ML)
      • Clustering settings
      • Clustering results
    • Automated machine learning
    • Model Settings Reusability
    • Features handling
      • Features roles and types
      • Categorical variables
      • Numerical variables
      • Text variables
      • Vector variables
      • Image variables
      • Custom Preprocessing
    • Algorithms reference
      • In-memory Python (Scikit-learn / XGBoost)
      • MLLib (Spark) engine
      • H2O (Sparkling Water) engine
      • Vertica
    • Advanced models optimization
    • Models ensembling
    • Model Document Generator
    • Deep Learning
      • Introduction
      • Your first deep learning model
      • Model architecture
      • Training
      • Multiple inputs
      • Using image features
      • Using text features
      • Runtime and GPU support
      • Advanced topics
      • Troubleshooting
    • Models lifecycle
    • Scoring engines
    • Writing custom models
    • Exporting models
    • Partitioned Models
  • The Flow
    • Visual Grammar
    • Flow zones
    • Rebuilding Datasets
    • Limiting Concurrent Executions
    • Exporting the Flow to PDF or images
    • How to Manage Large Flows with Flow Folding
  • Visual recipes
    • Prepare: Cleanse, Normalize, and Enrich
    • Sync: copying datasets
    • Grouping: aggregating data
    • Window: analytics functions
    • Distinct: get unique rows
    • Join: joining datasets
    • Splitting datasets
    • Top N: retrieve first N rows
    • Stacking datasets
    • Sampling datasets
    • Sort: order values
    • Pivot recipe
    • Download recipe
  • Recipes based on code
    • The common editor layout
    • Python recipes
    • R recipes
    • SQL recipes
    • Hive recipes
    • Pig recipes
    • Impala recipes
    • Spark-Scala recipes
    • PySpark recipes
    • Spark / R recipes
    • SparkSQL recipes
    • Shell recipes
    • Variables expansion in code recipes
  • Code notebooks
    • SQL notebook
    • Python notebooks
    • Predefined notebooks
    • Containerized notebooks
  • Webapps
    • “Standard” web apps
    • Shiny web apps
    • Bokeh web apps
    • Publishing webapps on the dashboard
    • Public webapps
    • Webapps and security
    • Scaling webapps on Kubernetes
  • Code reports
    • R Markdown reports
  • Dashboards
    • Dashboard concepts
    • Display settings
    • Exporting dashboards to PDF or images
    • Insights reference
      • Chart
      • Dataset table
      • Model report
      • Managed folder
      • Jupyter Notebook
      • Webapp
      • Metric
      • Scenarios
      • Wiki article
  • Dataiku Applications
    • Application tiles
    • Application-as-recipe
  • DSS in the cloud
    • DSS in AWS
      • Reference architecture: managed compute on EKS with Glue and Athena
    • DSS in Azure
      • Reference architecture: manage compute on AKS and storage on ADLS gen2
    • DSS in GCP
      • Reference architecture: managed compute on GKE and storage on GCS
  • Working with partitions
    • Partitioning files-based datasets
    • Partitioned SQL datasets
    • Specifying partition dependencies
    • Partition identifiers
    • Recipes for partitioned datasets
    • Partitioned Hive recipes
    • Partitioned SQL recipes
    • Partitioning variables substitutions
    • Partitioned Models
  • DSS and Hadoop
    • Setting up Hadoop integration
    • Connecting to secure clusters
    • Hadoop filesystems connections (HDFS, S3, EMRFS, WASB, ADLS, GS)
    • DSS and Hive
    • DSS and Impala
    • Hive datasets
    • Multiple Hadoop clusters
    • Dynamic AWS EMR clusters
    • Hadoop user isolation
    • Distribution-specific notes
      • Cloudera CDH
      • Hortonworks HDP
      • MapR
      • Amazon Elastic MapReduce
      • Microsoft Azure HDInsight
      • Google Cloud Dataproc
    • Teradata Connector For Hadoop
    • Dynamic Google Dataproc clusters
  • DSS and Spark
    • Usage of Spark in DSS
    • Spark on Kubernetes
      • Managed Spark on K8S
      • Unmanaged Spark on Kubernetes
      • Advanced topics
    • Setting up (without Kubernetes)
    • Spark configurations
    • Interacting with DSS datasets
    • Spark pipelines
    • Limitations and attention points
  • DSS and SQL
    • SQL datasets
    • SQL write and execution
    • Partitioning
    • SQL pipelines in DSS
      • Using SQL pipelines
      • Views in SQL pipelines
      • Partitions and SQL pipelines
  • DSS and Python
    • Installing Python packages
    • Reusing Python code
    • Using Matplotlib
    • Using SpaCy
    • Using Bokeh
    • Using Plot.ly
    • Using Ggplot
    • Using Jupyter Widgets
  • DSS and R
    • Installing R packages
    • Reusing R code
    • Using ggplot2
    • Using Dygraphs
    • Using googleVis
    • Using ggvis
    • Installing STAN or Prophet
    • RStudio integration
  • Metastore catalog
    • Hive metastore (through HiveServer2)
    • Glue metastore
    • DSS as virtual metastore
  • Code environments
    • Operations (Python)
    • Operations (R)
    • Base packages
    • Using Conda
    • Automation nodes
    • Non-managed code environments
    • Plugins’ code environments
    • Custom options and environment
    • Troubleshooting
    • Code env permissions
  • Running in containers
    • Concepts
    • Setting up (Kubernetes)
    • Unmanaged Kubernetes clusters
    • Managed Kubernetes clusters
    • Using Amazon Elastic Kubernetes Service (EKS)
      • Using managed EKS clusters
      • Using unmanaged EKS clusters
    • Using Microsoft Azure Kubernetes Service (AKS)
      • Using managed AKS clusters
      • Using unmanaged AKS clusters
    • Using Google Kubernetes Engine (GKE)
      • Using managed GKE clusters
      • Using unmanaged GKE clusters
    • Using Openshift
    • Using code envs with containerized execution
    • Dynamic namespace management
    • Customization of base images
    • Troubleshooting
    • Using Docker instead of Kubernetes
  • Collaboration
    • Wikis
    • Discussions
    • Markdown
    • Tags
    • Working with Git
    • Version control of projects
    • Importing code from Git in project libraries
  • Automation scenarios, metrics, and checks
    • Definitions
    • Scenario steps
      • How to Copy Scenario Steps
      • How to Duplicate Scenario
    • Launching a scenario
    • Reporting on scenario runs
    • Custom scenarios
    • Variables in scenarios
    • Step-based execution control
    • Metrics
    • Checks
    • Custom probes and checks
  • Automation node and bundles
    • Installing the Automation node
    • Creating a bundle
    • Importing a bundle
  • API Node & API Deployer: Real-time APIs
    • Introduction
    • Concepts
    • Installing an API node
    • Installing the API Deployer
    • First API (without API Deployer)
    • First API (with API Deployer)
    • Types of Endpoints
      • Exposing a visual prediction model
      • Exposing a Python prediction model
      • Exposing a R prediction model
      • Exposing a Python function
      • Exposing a R function
      • Exposing a SQL query
      • Exposing a lookup in a dataset
    • Enriching prediction queries
    • Security
    • Managing versions of your endpoint
    • Deploying on Kubernetes
      • Setting up
      • Deployment on Google Kubernetes Engine
      • Deployment on Azure AKS
      • Deployment on AWS EKS
      • Deployment on Minikube
      • Managing SQL connections
    • APINode APIs reference
      • API node user API
      • API node administration API
      • Endpoint APIs
    • Operations reference
      • Using the apinode-admin tool
      • High availability and scalability
      • Logging and auditing
      • Health monitoring
  • Time Series
    • Understanding time series data
    • Format of time series data
    • Times series preparation
      • Resampling
      • Windowing
      • Extrema extraction
      • Interval extraction
    • Time series visualization
    • Time series forecasting
  • Unstructured data
    • Text
    • Images
    • Video
    • Graph
    • Audio
  • Plugins
    • Installing plugins
    • Managing installed plugins
    • Developing plugins
      • Plugin Components
      • Parameters
      • Component: Recipes
      • Component: Preparation Processor
      • Component: Macros
      • Component: Project creation macros
      • Component: Web Apps
      • Component: Filesystem providers
      • Component: Custom Fields
      • Component: Prediction algorithm
      • Components: Custom chart palettes and map backgrounds
      • Git integration in the plugin editor
      • Other topics
  • Python APIs
    • Using the APIs inside of DSS
    • Using the APIs outside of DSS
    • Datasets (introduction)
    • Datasets (reading and writing data)
    • Datasets (other operations)
    • Datasets (reference)
    • Managed folders
    • Interaction with Pyspark
    • The main DSSClient class
    • Projects
    • Project folders
    • Recipes
    • Interaction with saved models
    • Scenarios
    • Scenarios (in a scenario)
    • Flow creation and management
    • Machine learning
    • Statistics worksheets
    • API Designer & Deployer
    • Static insights
    • Jobs
    • Authentication information and impersonation
    • Importing tables as datasets
    • Wikis
    • Discussions
    • Performing SQL, Hive and Impala queries
    • SQL Query
    • Meanings
    • Users and groups
    • Connections
    • Code envs
    • Plugins
    • Dataiku applications
    • Metrics and checks
    • Other administration tasks
    • Reference API documentation of dataiku
    • Reference API documentation of dataikuapi
    • API for plugin components
      • API for plugin recipes
      • API for plugin datasets
      • API for plugin formats
      • API for plugin FS providers
  • R API
    • Using the R API inside of DSS
    • Using the R API outside of DSS
    • Reference documentation
    • Authentication information
    • Creating static insights
  • Public REST API
    • Features
    • Public API Keys
    • The REST API
  • Additional APIs
    • The Javascript API
    • The Scala API
  • File formats
    • Delimiter-separated values (CSV / TSV)
    • Fixed width
    • Parquet
    • Avro
    • Hive SequenceFile
    • Hive RCFile
    • Hive ORCFile
    • XML
    • JSON
    • Excel
    • ESRI Shapefiles
  • Security
    • Main project permissions
    • Connections security
    • User profiles
    • Exposed objects
    • Dashboard authorizations
    • User secrets
    • Audit Trail
    • Advanced security options
    • Single Sign-On
    • Multi-Factor Authentication
    • Passwords security
  • User Isolation
    • Capabilities of User Isolation Framework
    • Concepts
    • Prerequisites and limitations
    • Initial Setup
    • Reference architectures
      • Local-code only
      • Setup with Cloudera
      • Setup with Hortonworks Data Platform
      • Setup with Kubernetes
    • Details of UIF capabilities
      • Local code isolation
      • Hadoop Impersonation (HDFS, YARN, Hive, Impala)
      • Workload isolation on Kubernetes
      • Impersonation on Oracle
    • Advanced topics
      • Configuration of the local security
      • HDFS datasets data structure
  • Operating DSS
    • dsscli tool
    • The data directory
    • Backing up
    • Audit trail
      • Viewing the audit trail in DSS
      • Default storage of audit trail
      • Audit centralization and dispatch
      • The DSS Event Server
      • Configuration for API nodes
      • Audit data
      • Advanced topics
    • The runtime databases
    • Logging in DSS
    • DSS Macros
    • Managing DSS disk usage
    • Understanding and tracking DSS processes
    • Tuning and controlling memory usage
    • Using cgroups for resource control
    • Monitoring DSS
    • Compute resource usage reporting
  • Advanced topics
    • Sampling methods
    • Formula language
    • Custom variables expansion
  • Accessibility
  • Troubleshooting
    • Diagnosing and debugging issues
    • Obtaining support
    • Support tiers
    • Common issues
      • DSS does not start / Cannot connect
      • Cannot login to DSS
      • DSS crashes / The “Disconnected” overlay appears
      • Websockets problems
      • Cannot connect to a SQL database
      • A job fails
      • A scenario fails
      • A ML model training fails
      • “Your user profile does not allow” issues
    • Error codes
      • ERR_BUNDLE_ACTIVATE_CONNECTION_NOT_WRITABLE: Connection is not writable
      • ERR_CODEENV_CONTAINER_IMAGE_FAILED: Could not build container image for this code environment
      • ERR_CODEENV_CONTAINER_IMAGE_TAG_NOT_FOUND: Container image tag not found for this Code environment
      • ERR_CODEENV_CREATION_FAILED: Could not create this code environment
      • ERR_CODEENV_DELETION_FAILED: Could not delete this code environment
      • ERR_CODEENV_EXISTING_ENV: Code environment already exists
      • ERR_CODEENV_INCORRECT_ENV_TYPE: Wrong type of Code environment
      • ERR_CODEENV_INVALID_CODE_ENV_ARCHIVE: Invalid code environment archive
      • ERR_CODEENV_JUPYTER_SUPPORT_INSTALL_FAILED: Could not install Jupyter support in this code environment
      • ERR_CODEENV_JUPYTER_SUPPORT_REMOVAL_FAILED: Could not remove Jupyter support from this code environment
      • ERR_CODEENV_MISSING_ENV: Code environment does not exists
      • ERR_CODEENV_MISSING_ENV_VERSION: Code environment version does not exists
      • ERR_CODEENV_NO_CREATION_PERMISSION: User not allowed to create Code environments
      • ERR_CODEENV_NO_USAGE_PERMISSION: User not allowed to use this Code environment
      • ERR_CODEENV_UNSUPPORTED_OPERATION_FOR_ENV_TYPE: Operation not supported for this type of Code environment
      • ERR_CODEENV_UPDATE_FAILED: Could not update this code environment
      • ERR_CONNECTION_ALATION_REGISTRATION_FAILED: Failed to register Alation integration
      • ERR_CONNECTION_API_BAD_CONFIG: Bad configuration for connection
      • ERR_CONNECTION_AZURE_INVALID_CONFIG: Invalid Azure connection configuration
      • ERR_CONNECTION_DUMP_FAILED: Failed to dump connection tables
      • ERR_CONNECTION_INVALID_CONFIG: Invalid connection configuration
      • ERR_CONNECTION_LIST_HIVE_FAILED: Failed to list indexable Hive connections
      • ERR_CONNECTION_S3_INVALID_CONFIG: Invalid S3 connection configuration
      • ERR_CONNECTION_SQL_INVALID_CONFIG: Invalid SQL connection configuration
      • ERR_CONNECTION_SSH_INVALID_CONFIG: Invalid SSH connection configuration
      • ERR_CONTAINER_CONF_NO_USAGE_PERMISSION: User not allowed to use this containerized execution configuration
      • ERR_CONTAINER_CONF_NOT_FOUND: The selected container configuration was not found
      • ERR_CONTAINER_IMAGE_PUSH_FAILED: Container image push failed
      • ERR_DATASET_ACTION_NOT_SUPPORTED: Action not supported for this kind of dataset
      • ERR_DATASET_CSV_UNTERMINATED_QUOTE: Error in CSV file: Unterminated quote
      • ERR_DATASET_HIVE_INCOMPATIBLE_SCHEMA: Dataset schema not compatible with Hive
      • ERR_DATASET_INVALID_CONFIG: Invalid dataset configuration
      • ERR_DATASET_INVALID_FORMAT_CONFIG: Invalid format configuration for this dataset
      • ERR_DATASET_INVALID_METRIC_IDENTIFIER: Invalid metric identifier
      • ERR_DATASET_INVALID_PARTITIONING_CONFIG: Invalid dataset partitioning configuration
      • ERR_DATASET_PARTITION_EMPTY: Input partition is empty
      • ERR_DATASET_TRUNCATED_COMPRESSED_DATA: Error in compressed file: Unexpected end of file
      • ERR_ENDPOINT_INVALID_CONFIG: Invalid configuration for API Endpoint
      • ERR_FOLDER_INVALID_PARTITIONING_CONFIG: Invalid folder partitioning configuration
      • ERR_FSPROVIDER_CANNOT_CREATE_FOLDER_ON_DIRECTORY_UNAWARE_FS: Cannot create a folder on this type of file system
      • ERR_FSPROVIDER_DEST_PATH_ALREADY_EXISTS: Destination path already exists
      • ERR_FSPROVIDER_FSLIKE_REACH_OUT_OF_ROOT: Illegal attempt to access data out of connection root path
      • ERR_FSPROVIDER_HTTP_CONNECTION_FAILED: HTTP connection failed
      • ERR_FSPROVIDER_HTTP_INVALID_URI: Invalid HTTP URI
      • ERR_FSPROVIDER_HTTP_REQUEST_FAILED: HTTP request failed
      • ERR_FSPROVIDER_ILLEGAL_PATH: Illegal path for that file system
      • ERR_FSPROVIDER_INVALID_CONFIG: Invalid configuration
      • ERR_FSPROVIDER_INVALID_FILE_NAME: Invalid file name
      • ERR_FSPROVIDER_LOCAL_LIST_FAILED: Could not list local directory
      • ERR_FSPROVIDER_PATH_DOES_NOT_EXIST: Path in dataset or folder does not exist
      • ERR_FSPROVIDER_ROOT_PATH_DOES_NOT_EXIST: Root path of the dataset or folder does not exist
      • ERR_FSPROVIDER_SSH_CONNECTION_FAILED: Failed to establish SSH connection
      • ERR_HIVE_HS2_CONNECTION_FAILED: Failed to establish HiveServer2 connection
      • ERR_HIVE_LEGACY_UNION_SUPPORT: Your current Hive version doesn’t support UNION clause but only supports UNION ALL, which does not remove duplicates
      • ERR_METRIC_DATASET_COMPUTATION_FAILED: Metrics computation completely failed
      • ERR_METRIC_ENGINE_RUN_FAILED: One of the metrics engine failed to run
      • ERR_MISC_ENOSPC: Out of disk space
      • ERR_MISC_EOPENF: Too many open files
      • ERR_ML_MODEL_DETAILS_OVERFLOW: Model details exceed size limit
      • ERR_NOT_USABLE_FOR_USER: You may not use this connection
      • ERR_OBJECT_OPERATION_NOT_AVAILABLE_FOR_TYPE: Operation not supported for this kind of object
      • ERR_PLUGIN_CANNOT_LOAD: Plugin cannot be loaded
      • ERR_PLUGIN_COMPONENT_NOT_INSTALLED: Plugin component not installed or removed
      • ERR_PLUGIN_DEV_INVALID_COMPONENT_PARAMETER: Invalid parameter for plugin component creation
      • ERR_PLUGIN_DEV_INVALID_DEFINITION: The descriptor of the plugin is invalid
      • ERR_PLUGIN_INVALID_DEFINITION: The plugin’s definition is invalid
      • ERR_PLUGIN_NOT_INSTALLED: Plugin not installed or removed
      • ERR_PLUGIN_WITHOUT_CODEENV: The plugin has no code env specification
      • ERR_PLUGIN_WRONG_TYPE: Unexpected type of plugin
      • ERR_PROJECT_INVALID_ARCHIVE: Invalid project archive
      • ERR_PROJECT_INVALID_PROJECT_KEY: Invalid project key
      • ERR_PROJECT_UNKNOWN_PROJECT_KEY: Unknown project key
      • ERR_RECIPE_CANNOT_CHANGE_ENGINE: Cannot change engine
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY: Cannot check schema consistency
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY_EXPENSIVE: Cannot check schema consistency: expensive checks disabled
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY_NEEDS_BUILD: Cannot compute output schema with an empty input dataset. Build the input dataset first.
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY_ON_RECIPE_TYPE: Cannot check schema consistency on this kind of recipe
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY_WITH_RECIPE_CONFIG: Cannot check schema consistency because of recipe configuration
      • ERR_RECIPE_CANNOT_CHANGE_ENGINE: Not compatible with Spark
      • ERR_RECIPE_CANNOT_USE_ENGINE: Cannot use the selected engine for this recipe
      • ERR_RECIPE_ENGINE_NOT_DWH: Error in recipe engine: SQLServer is not Data Warehouse edition
      • ERR_RECIPE_INCONSISTENT_I_O: Inconsistent recipe input or output
      • ERR_RECIPE_SYNC_AWS_DIFFERENT_REGIONS: Error in recipe engine: Redshift and S3 are in different AWS regions
      • ERR_RECIPE_PDEP_UPDATE_REQUIRED: Partition dependecy update required
      • ERR_RECIPE_SPLIT_INVALID_COMPUTED_COLUMNS: Invalid computed column
      • ERR_SCENARIO_INVALID_STEP_CONFIG: Invalid scenario step configuration
      • ERR_SECURITY_CRUD_INVALID_SETTINGS: The user attributes submitted for a change are invalid
      • ERR_SECURITY_GROUP_EXISTS: The new requested group already exists
      • ERR_SECURITY_INVALID_NEW_PASSWORD: The new password is invalid
      • ERR_SECURITY_INVALID_PASSWORD: The password hash from the database is invalid
      • ERR_SECURITY_MUS_USER_UNMATCHED: The DSS user is not configured to be matched onto a system user
      • ERR_SECURITY_PATH_ESCAPE: The requested file is not within any allowed directory
      • ERR_SECURITY_USER_EXISTS: The requested user for creation already exists
      • ERR_SECURITY_WRONG_PASSWORD: The old password provided for password change is invalid
      • ERR_SPARK_FAILED_DRIVER_OOM: Spark failure: out of memory in driver
      • ERR_SPARK_FAILED_TASK_OOM: Spark failure: out of memory in task
      • ERR_SPARK_FAILED_YARN_KILLED_MEMORY: Spark failure: killed by YARN (excessive memory usage)
      • ERR_SPARK_PYSPARK_CODE_FAILED_UNSPECIFIED: Pyspark code failed
      • ERR_SPARK_SQL_LEGACY_UNION_SUPPORT: Your current Spark version doesn’t support UNION clause but only supports UNION ALL, which does not remove duplicates
      • ERR_SQL_CANNOT_LOAD_DRIVER: Failed to load database driver
      • ERR_SQL_DB_UNREACHABLE: Failed to reach database
      • ERR_SQL_IMPALA_MEMORYLIMIT: Impala memory limit exceeded
      • ERR_SQL_POSTGRESQL_TOOMANYSESSIONS: too many sessions open concurrently
      • ERR_SQL_TABLE_NOT_FOUND: SQL Table not found
      • ERR_SQL_VERTICA_TOOMANYROS: Error in Vertica: too many ROS
      • ERR_SQL_VERTICA_TOOMANYSESSIONS: Error in Vertica: too many sessions open concurrently
      • ERR_TRANSACTION_FAILED_ENOSPC: Out of disk space
      • ERR_TRANSACTION_GIT_COMMMIT_FAILED: Failed committing changes
      • ERR_USER_ACTION_FORBIDDEN_BY_PROFILE: Your user profile does not allow you to perform this action
      • WARN_RECIPE_SPARK_INDIRECT_HDFS: No direct access to read/write HDFS dataset
      • WARN_RECIPE_SPARK_INDIRECT_S3: No direct access to read/write S3 dataset
      • Undocumented error
  • Release notes
    • DSS 8.0 Release notes
    • DSS 7.0 Release notes
    • DSS 6.0 Release notes
    • DSS 5.1 Release notes
    • DSS 5.0 Release notes
    • DSS 4.3 Release notes
    • DSS 4.2 Release notes
    • DSS 4.1 Release notes
    • DSS 4.0 Release notes
    • DSS 3.1 Release notes
    • DSS 3.0 Relase notes
    • DSS 2.3 Relase notes
    • DSS 2.2 Relase notes
    • DSS 2.1 Relase notes
    • DSS 2.0 Relase notes
    • DSS 1.4 Relase notes
    • DSS 1.3 Relase notes
    • DSS 1.2 Relase notes
    • DSS 1.1 Release notes
    • DSS 1.0 Release Notes
    • Pre versions
  • Other Documentation
  • Third-party acknowledgements
 
Dataiku DSS
You are viewing the documentation for version 8.0 of DSS.
  • Docs »
  • Machine learning

Machine learning¶

For an overview of machine learning with DSS, please see our courses on machine learning

This reference documentation contains additional details on the algorithms and methods used by DSS.

  • Prediction (Supervised ML)
  • Clustering (Unsupervised ML)
  • Automated machine learning
  • Model Settings Reusability
  • Features handling
  • Algorithms reference
  • Advanced models optimization
  • Models ensembling
  • Model Document Generator
  • Deep Learning
  • Models lifecycle
  • Scoring engines
  • Writing custom models
  • Exporting models
  • Partitioned Models
Next Previous

© Copyright 2021, Dataiku.

Sphinx theme provided by Read the Docs