DSS 11 Release notes

Migration notes

Migration paths to DSS 11

How to upgrade

It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.

For automatic upgrade information, see Upgrading a DSS instance.

Pay attention to the warnings described in Limitations and warnings.

Limitations and warnings

Automatic migration from previous versions (see above) is supported. Please pay attention to the following removal and deprecation notices.

Support removal

Some features that were previously announced as deprecated are now removed or unsupported.

  • Support for MapR

  • Support for ElasticSearch 1.x and 2.x

Deprecation notice

DSS 11 deprecates support for some features and versions. Support for these will be removed in a later release.

  • Support for SuSE 15 and SuSE 15 SP1 is deprecated

  • Support for CentOS 7.3 to 7.8, RedHat 7.3 to 7.8 and Oracle Linux 7.3 to 7.8 is deprecated

  • As a reminder from DSS 10.0, the “Build missing datasets” build mode is deprecated and will be removed in a future release. This mode only worked in very specific cases and was never fully operational.

  • As a reminder from DSS 10.0, support for training Machine Learning models with H2O Sparkling Water is deprecated and will be removed in a future release.

  • As a reminder from DSS 9.0, support for EMR below 5.30 is deprecated and will be removed in a future release.

  • As a reminder from DSS 7.0, support for “Hive CLI” execution modes for Hive is deprecated and will be removed in a future release. We recommend that you switch to HiveServer2. Please note that “Hive CLI” execution modes are already incompatible with User Isolation Framework.

Version 11.1.3 - November 29th, 2022

DSS 11.1.3 is a bugfix release

Cloud Stacks

  • Added the ability to have more than 255 characters of cloud-level tags

  • Fixed instances creation for which label is not set

Datasets

  • S3: Automatically disable “switch to bucket region” when a custom S3 endpoint is specified, since it will not work in that case

Visual recipes

  • Join recipe: Fixed an issue in the UI post-join computed columns

  • Prepare recipe: Fixed ‘Remove rows on empty’ processor not filtering out empty strings coming from SQL datasets with DSS engine

Scenarios

  • Fixed error when running a scenario with a user who has “Read project content” & “Run scenario” when there is at least one workspace on the instance

Dashboards

  • Removed unnecessary vertical scrollbar on charts insights

Spark and Kubernetes

  • Fixed spark-on-K8S for kube version >= 1.24 if the target namespace is not the default namespace

API Node

  • Fixed migration of very old API nodes

Version 11.1.2 - November 15th, 2022

DSS 11.1.2 is a bugfix and security release

Visual recipes

  • Prepare: Fixed various issues in French vacation flagging

Charts

  • Made the chart switcher suggestions more consistent

  • Fixed loading of KPI chart on dashboard

  • Fixed numerical formatting options not being saved

Elastic AI

  • Fixed notebooks on Kubernetes not starting with Elastic AI clusters

Cloud Stacks

  • Fixed reprovisioning of instances on GCP after many previous reprovisionings

Models export

  • Fixed numpy warnings when scoring

  • Removed dependency on old version of numpy

Performance and scalability

  • Fixed missing protection against memory overrun for boxplot charts

  • Fixed possible instance hang related to Hive support

Misc

  • Added support for macOS Ventura in the macOS application

Version 11.1.1 - October 25th, 2022

DSS 11.1.1 is a bugfix release

Cloud Stacks

  • Fixed instances provisioning failing after upgrade in some circumstances

Version 11.1.0 - October 21st, 2022

DSS 11.1.0 is a very significant new release with both new features, performance enhancements and bugfixes.

Compatibility note

The version of one of the libraries used by Visual Time Series Forecasting, gluonts, has been upgraded. Time Series Forecasting models may need to be retrained.

Major new features and enhancements

New chart types

  • Added a Treemap chart, ideal for representing data where dimensions form a hierarchy

  • Added a KPI chart, to display individual aggregated features as single numbers (such as global sum of sales)

Python export of models

It is now possible to directly export DSS models to Python code, for usage in any Python code outside of DSS. This comes in addition to the pre-existing Java export, for usage in any Java code outside of DSS, and PMML for usage in any PMML-compatible scoring system.

For more details, please see Exporting models

MLflow export of models

It is now possible to directly export DSS models to MLflow, for usage in any MLflow-compatible scoring engine that is compatible with the “python_function” flavor of MLflow.

For more details, please see Exporting models

Enhancement of Excel exports

  • Exporting to Excel now properly respects string fields with leading zeros, and does not remove leading zeros anymore (more generally speaking, Exporting to Excel now properly respects storage types)

  • Exporting to Excel now also shows dates as valid dates in Excel

Deployment of clustering models to API node

It is now possible to deploy clustering models to the API node, for direct attribution of clusters to previously-unseen records.

Model explainability for MLflow models

Imported MLflow models can now benefit from a large panel of model explainability capabilities, just like DSS-trained models.

Support for R 4

DSS can now use R 4. In order to use R 4, you need to run the R integration procedure with “R” in the PATH pointing to R 4. All code environments then need to be rebuilt.

Cloud Stacks setups are still on R 3.6, and will switch to R 4 in DSS 12.

Performance & Scalability

  • Much faster (up to thousands of times faster) computation of dependencies for extremely complex flow graphs (notably flows with multiple successive “branch-out / branch-in” patterns)

  • Global performance enhancement for all visual recipes running on DSS engine (up to 50% faster for sync and prepare recipes)

  • Significantly reduced overall memory consumption of the DSS backend with very large instances (many projects, datasets, ….)

Charts

  • New more efficient and clearer chart type switcher

Datasets

  • New feature: Support for Google AlloyDB

  • New feature: ElasticSearch: Added support for ElasticSearch 8

  • New feature: ElasticSearch: Added ability to list and import ElasticSearch indices from the connection explorer

  • New feature: S3: Added Ability to set bucket owner ACL when uploading to S3

  • ElasticSearch: Adding list of matching indices when importing an dataset with an index pattern

  • ElasticSearch: DSS now relies on ElasticSearch mapping for better schema inference

  • Clearer view of when you are viewing a sample versus the whole dataset

Machine Learning

  • New feature: Computer vision: Added interactive scoring for Image classification and Object detection

  • New feature: Time series: Added Hyperparameter search for time series models

  • New feature: Time series: Added support for comparing time series models

  • New feature: Stratified sampling for Machine Learning models

Elastic AI

  • New feature: Ability to view internal details of Spark-based recipes execution (through managed Spark History Server)

  • New feature: GKE: added support for regional clusters

  • New feature: Added support for Kubernetes 1.24

  • New feature: Added support for custom image pull secrets (primarily for non-cloud Kubernetes setups)

Scenarios, metrics, checks

  • New feature: Added variable expansion in SQL probes

Code envs

  • New feature: Added ability to use conda for code envs with Python 3.8 and Python 3.9

Fixes

Datasets

ElasticSearch
  • ElasticSearch: Fixed support of non-managed datasets with an non lower-case mapping type

  • ElasticSearch: Fixed “empty” dataset error when creating a non-managed Elastic Search dataset without testing the index

  • ElasticSearch: Improved ElasticSearch dataset partitioning UI

  • ElasticSearch: Improved detection of OpenSearch

  • ElasticSearch: Fixed usage of global proxy

  • ElasticSearch: Fixed clearing of datasets on ElasticSearch 6 and above

  • ElasticSearch: Added support for variable expansion for external ElasticSearch datasets

  • ElasticSearch: Fixed schema consistency check when settings contain variables

  • ElasticSearch: Fixed schema consistency on managed datasets when first rows have empty values

  • ElasticSearch: Fixed hourly partition redispatch

  • ElasticSearch: automatically suggest an appropriate dataset name

Snowflake
  • Snowflake: Added ability to fetch table descriptions in connections explorer

  • Snowflake: Fixed auto-fast-write with append mode

Google Cloud
  • BigQuery: Fixed reading of BigQuery views with DSS built-in driver

  • BigQuery: Fixed hang in case of permission failure on the “Storage API” when using the built-in driver

  • BigQuery: Fixed failure of long jobs (> 1 hour)

  • BigQuery: Added ability to fetch table descriptions in connections explorer

  • Google Cloud Storage: Added ability to use Application Default Credentials (ADC) to access Google Cloud Storage

  • Google Cloud Storage: Fixed display issue in dataste Browse

Azure
  • Synapse and Azure SQLServer: Added per-user OAuth login using Authorization Code flow in addition to the previous Device Code flow

  • Azure Blob: Added ability to use non-standard Azure Blob endpoints for Azure Government compatibility

  • Azure Blob: Fixed issue with creation of managed folders when based on a gen2 storage account with hierarchical namespaces

  • Azure Blob: Fix magic markers not being properly cleaned up, which could lead Spark jobs to fail

  • SQLServer: Added support for multiple catalogs in the SQLServer connection

Other
  • Teradata: Fixed wrong parsing of type DATE in Teradata if the time zone session is different from GMT

  • Oracle: Fixed listing of partitions on Oracle tables with more than 500 000 rows

  • S3: Fixed display of the bucket name in the settings tab of dataset

  • SQL: Added support for multiple catalogs for “Other databases (JDBC)” datasets

  • Improved user experience and fixed several issues with moving and renaming files for cloud storages

  • Fixed error when overwriting manually a file in a managed folder by uploading it again

  • Fixed variables[“xxx”] syntax in dataset sampling settings

  • Fixed “Allow managed folder” flag on Filesystem based connections not properly enforced

  • Fixed last partition actions not being accessible in dataset metrics screen

  • Fixed UI layout overflow when using nested filters in dataset status tab

  • Added a warning message when trying to delete a dataset that is shared and used in other projects

  • Fixed “Change tracking” file not saved in the UI

  • Added dataset column meanings and descriptions in catalog

  • Added option in Explore’s “Display” menu to increase the range of decimal numbers that get displayed in natural form instead of scientific notation

Machine learning

  • Performance improvement for computation of performance metrics and evaluation recipes on binary classification models

  • Performance improvement for fetching result pages for saved models

  • Fixed issue switching from one sample weight variable to another

  • Fixed rare case of failure computing individual explanations

  • Fixed display issue in the hyperparameter optimization chart

  • Fixed training of Lasso-Lars models with K-Fold cross-test

  • Fixed possible failure computing lift curve with K-fold cross-test

  • Fixed evaluation of models with target encoding & feature selection enabled

  • Fixed cases where a code env that was not suitable for bayesian search could be detected as suitable

  • Fixed an issue where a single broken model could cause unability to compute drift in all related models

  • Don’t suggest the “Explore Neighborhood” or “Optimize outcome” when the required train-time computations have been disabled by the user

  • Added display of the Python version used to trained a python based model

  • Removed the ‘No hyperparameter search’ uninformative message when Search space limit is changed

  • Fixed the threshold bar on confusion matrix and assertions when the optimal threshold is 0

  • Fixed hyperparameter widget for integer field not ignoring wrong values

  • Hyperparameter search on Kubernetes: Improve the heuristic to determine the number of available CPU

  • Prevented exporting a model to Snowflake function if it is not supported

  • Fixed a frontend error on partial dependence plot when selecting a variable with special character

  • Dropped infinite values in target for regression algos to prevent training from failing

  • Fixed wrongful ability to enable pairwise feature interactions with rejected features that led to failure

  • Added What-If analysis capability on dashboards

  • Fixed Optimized scoring for multiclass partitioned models when some partitions are missing some classes

  • Fixed display of plugin provided algorithms when duplicating a ML task

  • Fixed training and scoring with python engine when date columns have values beyond year 2200

  • Fixed display of calibration curve tab for non probabilistic models

  • Fixed not-yet-scored item unexpectedly showing up in What-If comparator

  • Fixed confusion matrix for multiclass partitioned models

  • Fixed missing data in model evaluation stores when evaluating models trained with K-Fold cross-test

  • Fixed UI glitch on custom metric in model evaluation store

  • Model comparator: Fixed display of the champion icon when there is no data

  • Model comparator: Fixed display of count and TF/IDF vectorization when comparing feature processing

  • Fixed UI issue with nested filters in ML assertions

  • Fixed renaming of model evaluations

  • Fixed various small UI issues with model evaluation store

  • Fixed evaluation on models with a custom metric when “don’t compute perf” is enabled

Computer vision
  • Computer vision: Added diagnostics on computer vision models when training on multiple GPUs

  • Computer vision: Fixed errors handling in computer vision interactive scoring

  • Computer vision: Fixed performance issue with Python 2.7 (deprecated)

  • Computer vision: Fixed clicking on the “Edit” button for hyperparameters

  • Computer vision: Fixed deployment of computer vision models with a managed folder coming from another project

  • Computer vision: Fixed support for Python 3.7 code envs

  • Computer vision: Improved confusion matrix for low number of classes

Clustering
  • Clustering: Fixed column mismatch in clustering heatmap export

  • Clustering: Fixed changing clusters in interactive clustering

Code-based deep learning
  • Code-based deep learning: Added support for ML diagnostics

  • Code-based deep learning: Removed irrelevant display of hyperparameters edit button

Time series
  • Time series: Fixed evaluation recipe that could fail, mentioning not enough observations

  • Time series: Fixed possible error in commputation of MASE and MSIS metrics

  • Time series: Improved user experience when changing settings

  • Time series: Added gaps between the folds in the forecast graph

Visual recipes

Prepare
  • New feature: Prepare: Added a “case insensitive contains” operator

  • Prepare: Improved boolean type detection when column only contains a single value

  • Prepare: Fixed SQL engine when applying 7 or more IF blocks on the same column in a if-then-else processor

  • Prepare: Prevented selection of SQL engine when a formula cannot be translated

  • Prepare: Improved formula validation consistency and enhanced validation performance

  • Prepare: Fixed issue on Spark engine when adding then removing “cast output” option on a formula processor

  • Prepare: Highlight invalid steps in red when they are part of a group

  • Prepare: Fixed issue with the “enrich with context information” processor with Parquet datasets

  • Prepare: Fixed possible issue with “Impute missing values” processor on SQL engine

Other
  • Window & Group: Fixed display of settings of aggregation types near the bottom of the screen

  • Window: Fixed silent switching from SQL to DSS when removing an unused column from the input and not forcing a save

  • Join: fixed messed-up “outer join” icon

  • Sync: Fixed SQL engine wrongly claiming to be unable to append

  • Stack: Fixed filter containing variables

  • Fuzzy join: Fixed output when joining joining PostgreSQL datasets

  • Fuzzy Join: Fixed possible failure

  • Push to editable: Fixed layout of nested filters

  • App-as-recipe: Fixed “Add” button of input/output page in app-as-recipe when the recipe has many inputs

  • Fixed link to recipe input when it is a shared managed folder

  • Fixed UI of conditions with geopoint type on filters

  • Redispatch partitioning: Fixed some memory errors when redispatching with a very large number of partitions

  • Fixed issue with date types coming from BigQuery

  • Fixed permissions issues when running Merge Folder and List Folder content recipes on foreign folders

  • Fixed support of SQL pipelines on Athena-based SQL recipes targeting a S3 connection with Athena configured

  • Fixed issue trying to use Snowflake UDF on JDBC connections using Snowflake dialect

Flow

  • Fixed copy of managed folders using a custom Filesystem provider

Charts, Dashboards & Workspaces

  • Added various sampling panel UX/UI enhancements in dataset explore and insights

  • Added animation dropdown to charts when viewed from the insight

  • Fixed a non blocking error when adding a filter tile

  • Fixed display of filter in the insight creation modal

  • Fixed positioning issue with “force axes to use the same scale” on scatter plot

  • Fixed issue with filters refresh

  • Fixed ability to select engine for filter tile in dashboard

  • Fixed AVG aggregation in DSS engine when there are missing values in the column

  • Fixed “Continue without saving” action on chart insight

  • Improved legend display to limit overlapping

  • Fixed issue in workspace dataset viewer when using “highlight whitespaces” option

  • Fixed computation of dataset-level metrics from a workspace

  • Fixed display of foreign datasets in dashboards when used in workspaces

Coding and API

  • Added support for Snowflake connections using OAuth authentication for Snowpark

  • Improved polling in Python client, which will now detect job completion faster

  • SQL notebook: Fixed refresh of SQL notebook cells when modified by another user (in another browser)

  • Fixed error handling when reading datasets, which will now correctly cause the read call to fail in all situations

  • Added support for time series models in ML API

  • Added project libs management in python client

  • Fixed error when calling the DSSUserActivity properties

  • Fixed Python and SQL code recipe editor on a shared dataset if you have no permission on the source project

  • Fixed SQL query recipe if selecting column name containing a question mark ‘?’

  • Added ability to import indices from ElasticSearch in the dataset import API

  • Fixed various issues with plugin installation API

Code Studios

  • Fixed Code Studios behind a Apache reverse proxy

  • Upgraded node.js in VSCode code studios

  • Added sync of files when publishing a Code Studio as a webapp

  • Added public webapp support for Code-Studio-based webapps

  • Added Code-Studio-based webapps in the “Usage” tab of Code Studio templates

  • Fixed Code Studios in projects with numeric-only project key

Desktop IDE integrations

  • Pycharm: Added support for editing project libraries

  • VS Code: Added support for editing project libraries

Deployer & MLOps

Deployer
  • API Deployer: Display more information about the original project and model in the API Deployer

  • API Deployer: Fixed wrong python sample code when booleans are used

  • Project Deployer: Added a warning in the deployer if a bundle is using a shared objects that does not exist on the target infrastructure

  • Project Deployer: Automatically add permissions to new projects published to the project deployer

  • Project Deployer: Fixed failure with webapps deployed on automation node

MLflow
  • MLflow import: Changed default value for container_exec_config_name parameter of import_mlflow_version_from_path

  • MLflow import: import_mlflow_version_from_path and import_mlflow_version_from_managed_folder methods now activate by default the imported model

  • MLflow import: Fixed failure while importing a MLFlow model from a managed folder if the path of the managed folder starts with a ‘/’

  • MLflow import: Fixed import of model versions on automation node

  • MLflow import: Fixed issue with passing a dataiku.Folder object to the setup_mlflow method

  • MLflow import: Fixed failure of evaluation recipe when no model evaluation store was used

Other
  • Drift: Fixed data drift computation not performed by evaluation recipes for MLflow models with containerized execution

  • Automation node: added progress bar for manual bundle import

  • Fixed search for Model Evaluation Store in Flow when a project filter is defined

Interactive statistics

  • Added resampling capability for timeseries

  • Improved support of “TopN time” with missing timestamps

Labeling

  • Labeling: Used a dedicated set for validation

  • Added an option to autovalidate answers done by reviewers

Experiment tracking

  • Fixed UI display when some metrics had NaN or Infinity values

  • Fixed usage of custom step values in log_batch

  • Added ability to select the threshold when deploying a model from a run

Feature store

  • Fixed case-sensitivity issues in search

  • Added the ability to add a feature group to a project through the “+ DATASET” menu of the flow

  • Added the ability to send sharing requests from the feature store

Govern

  • Added ability to send mails through TLS-enabled SMTP servers

  • Fixed issue with signoff workflows

  • Fixed governance of projects from automation node

  • Fixed various issues with sorting fields

  • When errors happen when syncing from DSS to Govern, report on the encountered errors

  • Fixed the logic of custom hooks, so that they can run independently from the user profile of the user performing changes

  • Fixed various UI issues

Formula

  • New feature: Added the geoMakeValid function to formula language

Collaboration

  • Added ability to request sharing on objects that are themselves shared from another project

  • Avoid creating an empty dashboard authorization rule when sharing an object

  • Allowed to import Dataiku application with custom UI without needing the development plugin permissions

  • Fixed error when moving a project from the “Home > Projects” screen

  • Allowed users to remove/unshare shared objects from their project

  • Fixed ‘Change image’ on imported projects

  • Fixed global wiki screen search in list mode

  • Fixed possible failure of the “graph” view of projects

Performance & Scalability

  • Fixed a performance problem for the creation of bundles on projects with extremely large Git histories

  • Fixed a memory leak when reading a vast number of Parquet files from notebooks or webapps

  • Fixed a memory leak with large number of Kubernetes-hosted webapps that could ultimately lead to a crash

  • Fixed a possible failure causing jobs to hang and datasets to become unbuildable until a restart

  • Load-time performance enhancements for charts

  • Various UI-side performance enhancements

Cloud Stacks

  • New feature: Python 3.7, 3.8, 3.9, 3.10 are now fully usable out of the box

  • New feature: Added a setup action for setting environment variables

  • New feature: AWS: Added m6i, m6a, c6i, c6a, r6i, r6a instances type

  • New feature: GCP: Allowed configuration of static private IP for FM and DSS instances

  • Highlight in DSS the settings which are automatically managed through Fleet Manager

  • Added a warning in Fleet Manager to prevent downgrading DSS versions

  • Provided an external URL option for Govern node and remote Deployer node

  • All links to various nodes can now use the external URL

  • Prevented duplicated label/node ids for instances

  • Fixed loss of SSO settings on Fleet Manager when rebooting Fleet Manager instance (Major)

  • Fixed error when trying to display agent logs after instance reprovisioning

  • Don’t show disabled users in licensing summary

  • AWS: Ask for SSH key name at fleet creation time

  • Azure: Fixed handling of tags with empty value

  • Don’t incorrectly suggest default password, since passwords are automatically generated in Cloud Stacks

  • Fixed upgrade procedure of Govern nodes

  • Fixed UI issue saving virtual networks with inline SSL certificate

  • Fixed issue resetting user password with special characters

Elastic AI

  • Automatically retry more errors from Kubernetes (notably “tls: internal error”)

  • Fixed pod monitoring misreading certain cpuRequest/cpuLimit values

  • Fixed environment variables set in code environments not exposed correctly in notebooks executed in Kubernetes

  • Fixed occasional Spark on Kubernetes failure when clusters are under heavy load

  • GKE: Fixed error on “Add node pool” action

  • GKE: Fixed the default value for “inherit from DSS host” setting

  • EKS: Fixed bad error reporting under some eksctl failure conditions

  • Fixed some failures with special characters in custom labels and annotations

  • Fixed potential failure of SparkSQL recipes validation system

  • Fixed non fast path read/write when using Spark in Notebooks

  • Fixed cases where configuration error in a single S3 connection could cause all Spark jobs to fail

  • Added ability to use multiple S3 credentials (for multiple buckets) in a single Spark job

  • Fixed possible failure of webapps on Kubernetes due to Python dependencies

  • Fixed possible failure of Kubernetes workloads when the node id contains spaces

Hadoop & Spark

  • Added support for CDP 7.1.7.p1XXX above p1000 (tested specifically on p1029 and p1035)

  • Fixed Spark recipes with Java 11 when the metastore is managed by DSS

  • Fixed Hive validation on CDH 6.3 and 7 when “hive.aux.jars.path” is not empty

  • Avoided failure if fallback db is unset and synchronization is disabled

  • Fixed ACLs not being set for impersonated notebooks if the “Configuration for PySpark/SparkR/Scala notebook” is missing in spark settings

Setup and administration

  • Prevented failure of monitoring summary in cases of broken recipes

  • Fixed SPNEGO authentication

  • Disabled license expiration warnings for non-admin users

  • Added a filter by type of connection in the connection list screen

  • Added in a setting to globally disable code env resources feature

  • Fixed ability to use project-level presets in plugin recipes

  • More clearly marked Python 2.7 as deprecated in the UI

  • (Custom install) Added support for Graphics exports on most recent supported OSes (such as Ubuntu 20.04 LTS)

  • (Custom install) Do not accept installing a new DSS with Python 2.7 as the base env anymore

  • (Custom install) Display a warning when upgrading a DSS that still has Python 2.7 as the base env

Plugins

  • Added the ability for custom datasets to use more of the Dataiku API (notably, accessing user secrets)

  • Set Python 3.6 and Pandas 1.0 as default when adding a code env to a plugin

  • Fixed bug when there are multiple scenario step plugins using a multiselect field

  • Added an error message if a plugin recipe cannot be retrieved anymore

  • Prevented uploading/updating development-mode plugins

  • Convert to plugin recipe modal: displayed clear indications when the submit button is disabled

  • Custom model views: added a ‘backendTypes’ property in webapp.json to define supported ml backends

  • Custom model views: Fixed custom views for models trained with Python 3.7

  • Fixed History tab in plugins editor not listing all plugins

  • Fixed JSON_OBJECT type for custom macros

Security

Misc

  • Dataiku Apps: Fixed variable display tile not automatically refreshed with the latest value of the variables

Version 11.0.3 - September 9th, 2022

DSS 11.0.3 is a security release. All users are strongly encouraged to update to this release.

Version 11.0.2 - August 25th, 2022

DSS 11.0.2 is a security and bugfix release. All users are strongly encouraged to update to this release.

Snowflake

  • Fixed type mapping for Snowpark Python

Cloud Stacks

  • Fixed upgrade issue for Govern node

Version 11.0.1 - August 3rd, 2022

DSS 11.0.1 is a bugfix release

Recipes

  • Fixed “IsEmpty” on a geometry column on existing visual filters

  • Fixed invalid selection when opening the “smart pattern extractor” from selected text in explore table

  • Prepare recipe: fixed the position of the column generated by the visual if processor

  • Fixed a concurrency issue with SQL recipes using the Redshift driver

Spark

  • Fixed Avro support with standalone Spark 3.2

  • Upgraded the Snowflake driver and Spark driver for standalone Spark

Machine Learning

  • Fixed display of trained models for partitioned time series models

  • Image labeling: Fixed possible metadata table name collision when using externally hosted runtime databases and long project keys

  • Image labeling: Fixed support of externally hosted runtime databases with a non-default schema or prefix

MLOps

  • Fixed drift computation for MLflow regression models

  • Handled drift computation of categorical features when chi2 test fails

  • Evaluation Recipe: Fixed “Don’t compute perf” option for a MLflow imported model with no ground truth in the evaluation dataset

Dataiku Applications

  • Improved display of scenario with a WARNING/FAILURE outcome in Dataiku application instances

  • Fixed plugin-provided Dataiku Applications

  • Fixed WARNING icon not displayed when scenario finishes with warning status

Code Studios

  • Fixed project libraries not added in PYTHONPATH when code studio is started on a blank project

Administration

  • Govern: Fixed display of LDAP default profile and user group/profile mapping

  • Fixed DSS not starting when using externally hosted runtime databases with non-default schema

  • Fixed DSS not starting if two instances are using the same externally hosted runtime database with different schemas

Misc

  • Feature store: Fixed display of a feature group that has been shared to a now-deleted project

Version 11.0.0 - July 12th, 2022

DSS 11.0.0 is a major upgrade to DSS with major new features.

Major new features

Visual Time Series Forecasting

Time Series Forecasting is now natively available in DSS Visual ML. Visual Time Series Forecasting features many capabilities:

  • Single or multiple series

  • Multiple horizon forecasting

  • Multiple algorithms, including deep learning algorithms

Time Series Forecasting are fully deployable and governable like other DSS Visual Models.

For more details, please see Time Series Forecasting

Code Studios, including Visual Studio Code, JupyterLab and RStudio

Code Studios allow DSS users to harness the power and versatility of many Web-based IDEs and web application building frameworks.

Code Studios allow you, for example, to:

  • Edit and debug Python, R, SQL, … recipes and libraries in Visual Studio Code

  • Edit and debug Python or R recipes, notebooks, libraries, … in JupyterLab

  • Edit and debug R recipes and libraries in RStudio Server

For more details, please see Code Studios

Image Labeling

In order to create and fine-tune image models (classification and object detection), you first need labeled images. Labeling is often a tedious task.

DSS now features a native Image Labeling capability, with the following features:

  • Support for image classification and object detection use cases

  • Ability to invite annotators (people who label the images)

  • Efficient interface for annotators with keyboard shortcuts

  • Ability to request annotations from multiple annotatorss

  • Annotations review process with management of conflicts between annotators

This new capability allows you to perform even more of the entire Machine Learning cycle for computer vision in DSS.

MLOps: Experiment Tracking

DSS now includes an experiment tracker for logging parameters, performance metrics, models, and other metadata when running your machine learning code, and for visualizing results of such experiments.

The DSS Experiment Tracker leverages the well-known MLflow Tracking API, which allows you to seamlessly port existing or 3rd party experiment tracking code and get all DSS benefits.

For more details, please see Experiment Tracking

MLOps: Feature Store

A Feature Store helps Data Scientists, build, find and use relevant data for models in order to build efficient models faster.

Most key components of a Feature Store are native capabilities of DSS:

DSS 11 adds a new Feature Store section, which acts as the central registry of all Feature Groups, a Feature Group being a curated and promoted Dataset containing valuable Features.

For more details, please see Feature Store

Data Visualization: New Pivot Table

The Pivot Table has been strongly overhauled. It now supports:

  • Multiple dimensions on rows and columns, with subtotal support

  • Excel Export of multiple dimensions and multiple measures

For more details, please see Charts

Quick Sharing

Project administrators can now enable “Quick Sharing”, which allows any user who has read access to the project to share a dataset to his own project, without having to ask the project administrator first.

Quick Sharing can be globally disabled by instance administrators.

For more details, please see Shared objects

Access & Sharing requests

Project administrators can now choose to make their project “discoverable”, which allows users who don’t have access to the project to still discover its existence and basic information about it (name, description, …), and then to request access to it.

Project administrators receive notifications about access requests, and can manage them, grant them or reject them.

Similarly, users who have access to a project can now request that datasets be shared with their own projects, and project administrators can manage these sharing requests (if they don’t have Quick Sharing enabled).

These mechanisms can be globally disabled by instance administrators.

For more details, please see Requests

Create if, then, else processor

This new visual data preparation processor performs actions or calculations based on conditional statements defined using an “if, then else” syntax.

It can be used notably to create new columns based on conditions on the values of other columns. While this was previously feasible using formulas or the Switch case processor, the new Create if, then, else statements processor can provide much more flexibility, without having to write complex formulas.

For more details, please see Create if, then, else statements

Flow Document Generator

In regulated industries, it is often required to document flows, at creation and after every change for traceability. This is often tedious. DSS now features the ability to automatically generate a DOCX document from a Flow, which documents the whole flow, including datasets and recipes details.

For more details, please see Flow Document Generator.

Govern: Projects and bundles governance

The Govern Node now supports managing, governing, and controlling deployment of Project Bundles in the Deployer

Dataiku Cloud Stacks on GCP

Dataiku Cloud Stacks is now available on GCP.

For more details, please see Dataiku Cloud Stacks for GCP

Other notable enhancements and features

Outcome Optimization for regression

The “What-If” feature now supports Outcome Optimization for regression problems. Outcome Optimization allows you to start from a given record, and to explore the neighborhood of this record to find the changes to input features that would lead to changes in the predicted value, towards either the largest, smallest, or a specific value. You can select which features can be modified and which can’t.

Nested filters

In locations where visual filters can be used, it is now possible to nest complex boolean conditions, such as:

  • If col1 is 2

  • AND
    • col2 is 3

    • OR col3 is 4

This applies to:

  • The Filter visual recipe

  • The “Create-if-then-else” prepare processor

  • The “Pre/Post filters” of all visual recipes

  • Filters in Explore and Charts sampling

  • Filters in Visual ML

OIDC authentication

In addition to SAMLv2, OIDC can now be used as SSO protocol for logging in to DSS

For more details, please see Single Sign-On

SSO support for Fleet Manager

It is now possible to log in through SSO on Fleet Manager

For more details, please see Installing and setting up

“List folder content” recipe

This new visual recipe takes a managed folder as input, a dataset as output, and writes in the dataset the listing of files in the managed folder.

This recipe is especially useful for image labeling and computer vision use cases.

For more details, please see List Folder Contents

Workspace discussions

Discussions are now available on workspaces

Data Visualization: Count Distinct and Count Not Null aggregations

All aggregated charts (columns, bars, pies, lines, areas, pivot table, …) now support the “Count Distinct” and “Count Not Null” aggregation functions for measures.

This also now makes it possible to have non-numerical measures

For more details, please see Charts

Data Visualization: multiple layers on Geo Map

It is now possible to draw multiple layers with different geometries on the Geo Map chart

For more details, please see Geographic data

Data Visualization: additional customization options

The following can now be customized:

  • Ability to change the name of a measure in the legend and tooltip

  • Ability to change the name of a dimension in the legend and tooltip

  • Ability to reformat numbers on axis and in cells of the pivot table

For more details, please see Charts

Georouting and Isochrones

DSS now has capabilities for computing itineraries between geopoints and isochrones around geopoints.

For more details, please see Geographic data

Machine Learning: multiple custom metrics

You can now define multiple custom metrics for a single Visual ML model.

Streamlit webapps through Code Studios

Through the Code Studios mechanism, you can now create and run Streamlit applications in DSS.

For more details, please see Code Studios

Govern: new permissions experience

A new editor for permissions for Govern was introduced

Govern: History

You can now view the history and timeline of individual govern objects

Govern: Sign off editor

Sign-off processes for Govern can now be edited for more sign-off flexibility

Other enhancements and fixes

Elastic AI

  • Spark version has been upgraded to 3.2.1

Machine Learning

  • Added Traditional Chinese stop words

  • Code-based Deep Learning: Tensorflow 2 can now be used

  • Fixed display on some screens when sample weights are used

  • Fixed display of the “customize code” box for text features

  • Fixed potential model display failure for models trained with K-fold-cross-test and sample weights

  • Fixed bad behavior when trying to use custom metrics without code writing permissions

  • Fixed display issue for axis legend on the partial dependence distribution chart

  • Fixed training failure with MLLib engine when “cumulative lift” metric is used

  • Properly ask users to rebuild train/test set if number of folds changed

  • Various small UI fixes

  • Code-based Deep Learning: made unused columns optional in scoring recipe

  • Fixed display issues with blue information boxes in result screens

  • Removed display of sample weights options when unsupported

  • Fixed “Needs probabilities” checkbox for custom metrics

  • Fixed estimated number of estimators to train when using time ordering

  • Computer Vision: Fixed training failures when number of epochs is 2

  • Fixed evaluation of ensemble models with text features

  • Code-based Deep Learning: added ability to use a custom text preprocessor returning a tensor with more than 3 dimensions

MLOps

  • Added support for partitioning in model evaluations

  • Prevented non-functional usage of a foreign model evaluation store in evaluation recipe

  • Added ability to use a foreign model for an evaluation recipe

  • Small UI fixes

Govern

  • Fixed various issues in DSS/Govern sync

  • Fixed redirect to URL after login

  • Fixed various UI issues

  • Fixed filtering by project on model registry

  • Fixed display of archived artifacts

Visual Statistics

  • Fixed display issue for dataset selector in “duplicate worksheet” modal

  • Univariate card: Added placeholder instead of empty chart when the histogram is empty

  • Small UI fixes

Explore & Datasets

  • Fixed flickering error that could appear on Explore screen

  • Fixed inability to explore when a bad regular expression was entered in a filter

  • Fixed multiple issues in listing of buckets and containers for S3, Azure Blob and Google Blob datasets

  • BigQuery: Added ability to read external tables and materialized views with the native driver

  • BigQuery: Enabled fast read of tables by default with the native driver

  • BigQuery: Fixed flooding of logs with Simba driver 1.2.22.1026 and above

  • Snowflake to cloud: disabled broken ability to use fast path when input is a SQL query dataset

  • Fixed ability to resize columns in foreign dataset explore

Dataiku Applications

  • New user experience for the “Edit SQL datasets” action, with ability to browse very large databases

  • Added ability to restrict connection type in the CONNECTION parameter type

Flow & Jobs

  • Improved wrapping of long dataset names

  • Fixed display of “Python only” logs for containerized recipes

  • The “Tags” flow view now shows tags from foreign datasets

  • Added link to parent recipes on managed folders

Visual recipes

  • Fixed autocompletion of formula with non-ASCII column names

  • Fixed storage of date filters when day is the 31st

  • Fixed “Increment date” processor in SQL mode when using the “Increment by: value in column” mode

  • Added automatic regrouping of multiple “clear cells with this value” steps from the Analyze box

  • Fixed handling of variables in formula editor

  • Prepare recipe: Improved searching for processors

  • Fixed ability to use variables in computed columns with DSS engine

  • Prepare recipe: fixed “filter rows on date” processor on Oracle

  • Prepare recipe: fixed “concat columns” step failure on Spark 3

Data Visualization

  • Pivot Table: Excel export now exports multiple measures

  • Pivot Table: Excel export now respects coloring

  • Fixed issues when reordering charts via drag & drop

  • Fixed “one tick per bin” wrongfully applying to hexagon charts

  • Fixed log scale on binned scatter plots

  • Fixed UI issue on manual axis range edition

Dashboards

  • Improved UI for filter tiles with filter summary and ability to reset filters

  • Fixed search for existing insights

  • Added ability to change the dataset of a filters tile

  • Fixed various issues with filter tiles

API

  • Fixed ability to write chunks of more of 2 Gigabytes when using ManagedFolderWriter.write()

  • Fixed inability to edit some code env parameters through API

Scenarios

  • Propagate warnings from steps to the outcome of the scenario

  • Added missing timezones in the temporal trigger timezone selector

Collaboration

  • Fixed sending of “you have been granted access to project” when your grant does not actually give you access to the project

  • Fixed download of .ipynb attached files in Wiki

Cloud Stacks

  • Upgraded kubectl version in order to deploy latest Kubernetes verions

  • Fixed renaming of automation node breaking the deployer

  • Added display of DSS URL directly in Fleet Manager

Plugins & Extensibility

  • Allowed custom model views to be restricted to some prediction types

  • Forbidden presets are now hidden

Performance & Scalability

  • Fixed API node memory overconsumption when passing huge payloads as inputs or outputs of API services

  • Made project deletion much faster, especially with large number of datasets

  • Improved performance of home page with many projects

Security

  • Added encryption for SAML keystore password

Misc

  • Added better categorization for admin settings page

  • Fixed wrong navigation bar when going to the Deployer

  • Direct webapp access will properly redirect back to the webapp after login

  • Fixed Streaming Scala recipes with Avro on Kafka

  • Added API key id in the API node audit log

  • Improved Industry Solutions creation modal

  • Fixed ability to modify or delete empty todo list

  • Fixed custom requests and limits in containerized execution

  • Fixed “Certification” link on home page with Safari

  • Fixed missing cleanup of Kubernetes objects for containerized continuous Python recipes

Known issues

  • When using Elastic AI / “standalone” mode for Spark, writing Avro files does not work. We advise you to use Parquet or ORC. Please get in touch with Dataiku Support for workarounds.