DSS 11.0 Release notes

Migration notes

Migration paths to DSS 11.0

How to upgrade

It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.

For automatic upgrade information, see Upgrading a DSS instance.

Pay attention to the warnings described in Limitations and warnings.

Limitations and warnings

Automatic migration from previous versions (see above) is supported. Please pay attention to the following removal and deprecation notices.

Support removal

Some features that were previously announced are deprecated are now removed or unsupported.

  • Support for MapR

  • Support for ElasticSearch 1.x and 2.x

Deprecation notice

DSS 11.0 deprecates support for some features and versions. Support for these will be removed in a later release.

  • Support for SuSE 15 and SuSE 15 SP1 is deprecated

  • Support for CentOS 7.3 to 7.8, RedHat 7.3 to 7.8 and Oracle Linux 7.3 to 7.8 is deprecated

  • As a reminder from DSS 10.0, the “Build missing datasets” build mode is deprecated and will be removed in a future release. This mode only worked in very specific cases and was never fully operational.

  • As a reminder from DSS 10.0, support for training Machine Learning models with H2O Sparkling Water is deprecated and will be removed in a future release.

  • As a reminder from DSS 9.0, support for EMR below 5.30 is deprecated and will be removed in a future release.

  • As a reminder from DSS 7.0, support for “Hive CLI” execution modes for Hive is deprecated and will be removed in a future release. We recommend that you switch to HiveServer2. Please note that “Hive CLI” execution modes are already incompatible with User Isolation Framework.

Version 11.0.0 - July 12th, 2022

DSS 11.0.0 is a major upgrade to DSS with major new features.

Major new features

Visual Time Series Forecasting

Time Series Forecasting is now natively available in DSS Visual ML. Visual Time Series Forecasting features many capabilities:

  • Single or multiple series

  • Multiple horizon forecasting

  • Multiple algorithms, including deep learning algorithms

Time Series Forecasting are fully deployable and governable like other DSS Visual Models.

For more details, please see Time Series Forecasting

Code Studios, including Visual Studio Code, JupyterLab and RStudio

Code Studios allow DSS users to harness the power and versatility of many Web-based IDEs and web application building frameworks.

Code Studios allow you, for example, to:

  • Edit and debug Python, R, SQL, … recipes and libraries in Visual Studio Code

  • Edit and debug Python or R recipes, notebooks, libraries, … in JupyterLab

  • Edit and debug R recipes and libraries in RStudio Server

For more details, please see Code Studios

Image Labeling

In order to create and fine-tune image models (classification and object detection), you first need labeled images. Labeling is often a tedious task.

DSS now features a native Image Labeling capability, with the following features:

  • Support for image classification and object detection use cases

  • Ability to invite annotators (people who label the images)

  • Efficient interface for annotators with keyboard shortcuts

  • Ability to request annotations from multiple annotatorss

  • Annotations review process with management of conflicts between annotators

This new capability allows you to perform even more of the entire Machine Learning cycle for computer vision in DSS.

MLOps: Experiment Tracking

DSS now includes an experiment tracker for logging parameters, performance metrics, models, and other metadata when running your machine learning code, and for visualizing results of such experiments.

The DSS Experiment Tracker leverages the well-known MLflow Tracking API, which allows you to seamlessly port existing or 3rd party experiment tracking code and get all DSS benefits.

For more details, please see Experiment Tracking

MLOps: Feature Store

A Feature Store helps Data Scientists, build, find and use relevant data for models in order to build efficient models faster.

Most key components of a Feature Store are native capabilities of DSS:

DSS 11 adds a new Feature Store section, which acts as the central registry of all Feature Groups, a Feature Group being a curated and promoted Dataset containing valuable Features.

For more details, please see Feature Store

Data Visualization: New Pivot Table

The Pivot Table has been strongly overhauled. It now supports:

  • Multiple dimensions on rows and columns, with subtotal support

  • Excel Export of multiple dimensions and multiple measures

For more details, please see Charts

Quick Sharing

Project administrators can now enable “Quick Sharing”, which allows any user who has read access to the project to share a dataset to his own project, without having to ask the project administrator first.

Quick Sharing can be globally disabled by instance administrators.

For more details, please see Shared objects

Access & Sharing requests

Project administrators can now choose to make their project “discoverable”, which allows users who don’t have access to the project to still discover its existence and basic information about it (name, description, …), and then to request access to it.

Project administrators receive notifications about access requests, and can manage them, grant them or reject them.

Similarly, users who have access to a project can now request that datasets be shared with their own projects, and project administrators can manage these sharing requests (if they don’t have Quick Sharing enabled).

These mechanisms can be globally disabled by instance administrators.

For more details, please see Requests

Create if, then, else processor

This new visual data preparation processor performs actions or calculations based on conditional statements defined using an “if, then else” syntax.

It can be used notably to create new columns based on conditions on the values of other columns. While this was previously feasible using formulas or the Switch case processor, the new Create if, then, else statements processor can provide much more flexibility, without having to write complex formulas.

For more details, please see Create if, then, else statements

Flow Document Generator

In regulated industries, it is often required to document flows, at creation and after every change for traceability. This is often tedious. DSS now features the ability to automatically generate a DOCX document from a Flow, which documents the whole flow, including datasets and recipes details.

For more details, please see Flow Document Generator.

Govern: Projects and bundles governance

The Govern Node now supports managing, governing, and controlling deployment of Project Bundles in the Deployer

Dataiku Cloud Stacks on GCP

Dataiku Cloud Stacks is now available on GCP.

For more details, please see Dataiku Cloud Stacks for GCP

Other notable enhancements and features

Outcome Optimization for regression

The “What-If” feature now supports Outcome Optimization for regression problems. Outcome Optimization allows you to start from a given record, and to explore the neighborhood of this record to find the changes to input features that would lead to changes in the predicted value, towards either the largest, smallest, or a specific value. You can select which features can be modified and which can’t.

Nested filters

In locations where visual filters can be used, it is now possible to nest complex boolean conditions, such as:

  • If col1 is 2

  • AND
    • col2 is 3

    • OR col3 is 4

This applies to:

  • The Filter visual recipe

  • The “Create-if-then-else” prepare processor

  • The “Pre/Post filters” of all visual recipes

  • Filters in Explore and Charts sampling

  • Filters in Visual ML

OIDC authentication

In addition to SAMLv2, OIDC can now be used as SSO protocol for logging in to DSS

For more details, please see Single Sign-On

SSO support for Fleet Manager

It is now possible to log in through SSO on Fleet Manager

For more details, please see Installing and setting up

“List folder content” recipe

This new visual recipe takes a managed folder as input, a dataset as output, and writes in the dataset the listing of files in the managed folder.

This recipe is especially useful for image labeling and computer vision use cases.

Workspace discussions

Discussions are now available on workspaces

Data Visualization: Count Distinct and Count Not Null aggregations

All aggregated charts (columns, bars, pies, lines, areas, pivot table, …) now support the “Count Distinct” and “Count Not Null” aggregation functions for measures.

This also now makes it possible to have non-numerical measures

For more details, please see Charts

Data Visualization: multiple layers on Geo Map

It is now possible to draw multiple layers with different geometries on the Geo Map chart

For more details, please see Geographic data

Data Visualization: additional customization options

The following can now be customized:

  • Ability to change the name of a measure in the legend and tooltip

  • Ability to change the name of a dimension in the legend and tooltip

  • Ability to reformat numbers on axis and in cells of the pivot table

For more details, please see Charts

Georouting and Isochrones

DSS now has capabilities for computing itineraries between geopoints and isochrones around geopoints.

For more details, please see Geographic data

Machine Learning: multiple custom metrics

You can now define multiple custom metrics for a single Visual ML model.

Streamlit webapps through Code Studios

Through the Code Studios mechanism, you can now create and run Streamlit applications in DSS.

For more details, please see Code Studios

Govern: new permissions experience

A new editor for permissions for Govern was introduced

Govern: History

You can now view the history and timeline of individual govern objects

Govern: Sign off editor

Sign-off processes for Govern can now be edited for more sign-off flexibility

Other enhancements and fixes

Machine Learning

  • Added Traditional Chinese stop words

  • Code-based Deep Learning: Tensorflow 2 can now be used

  • Fixed display on some screens when sample weights are used

  • Fixed display of the “customize code” box for text features

  • Fixed potential model display failure for models trained with K-fold-cross-test and sample weights

  • Fixed bad behavior when trying to use custom metrics without code writing permissions

  • Fixed display issue for axis legend on the partial dependence distribution chart

  • Fixed training failure with MLLib engine when “cumulative lift” metric is used

  • Properly ask users to rebuild train/test set if number of folds changed

  • Various small UI fixes

  • Code-based Deep Learning: made unused columns optional in scoring recipe

  • Fixed display issues with blue information boxes in result screens

  • Removed display of sample weights options when unsupported

  • Fixed “Needs probabilities” checkbox for custom metrics

  • Fixed estimated number of estimators to train when using time ordering

  • Computer Vision: Fixed training failures when number of epochs is 2

  • Fixed evaluation of ensemble models with text features

  • Code-based Deep Learning: added ability to use a custom text preprocessor returning a tensor with more than 3 dimensions

MLOps

  • Added support for partitioning in model evaluations

  • Prevented non-functional usage of a foreign model evaluation store in evaluation recipe

  • Added ability to use a foreign model for an evaluation recipe

  • Small UI fixes

Govern

  • Fixed various issues in DSS/Govern sync

  • Fixed redirect to URL after login

  • Fixed various UI issues

  • Fixed filtering by project on model registry

  • Fixed display of archived artifacts

Visual Statistics

  • Fixed display issue for dataset selector in “duplicate worksheet” modal

  • Univariate card: Added placeholder instead of empty chart when the histogram is empty

  • Small UI fixes

Explore & Datasets

  • Fixed flickering error that could appear on Explore screen

  • Fixed inability to explore when a bad regular expression was entered in a filter

  • Fixed multiple issues in listing of buckets and containers for S3, Azure Blob and Google Blob datasets

  • BigQuery: Added ability to read external tables and materialized views with the native driver

  • BigQuery: Enabled fast read of tables by default with the native driver

  • BigQuery: Fixed flooding of logs with Simba driver 1.2.22.1026 and above

  • Snowflake to cloud: disabled broken ability to use fast path when input is a SQL query dataset

  • Fixed ability to resize columns in foreign dataset explore

Dataiku Applications

  • New user experience for the “Edit SQL datasets” action, with ability to browse very large databases

  • Added ability to restrict connection type in the CONNECTION parameter type

Flow & Jobs

  • Improved wrapping of long dataset names

  • Fixed display of “Python only” logs for containerized recipes

  • The “Tags” flow view now shows tags from foreign datasets

  • Added link to parent recipes on managed folders

Visual recipes

  • Fixed autocompletion of formula with non-ASCII column names

  • Fixed storage of date filters when day is the 31st

  • Fixed “Increment date” processor in SQL mode when using the “Increment by: value in column” mode

  • Added automatic regrouping of multiple “clear cells with this value” steps from the Analyze box

  • Fixed handling of variables in formula editor

  • Prepare recipe: Improved searching for processors

  • Fixed ability to use variables in computed columns with DSS engine

  • Prepare recipe: fixed “filter rows on date” processor on Oracle

  • Prepare recipe: fixed “concat columns” step failure on Spark 3

Data Visualization

  • Pivot Table: Excel export now exports multiple measures

  • Pivot Table: Excel export now respects coloring

  • Fixed issues when reordering charts via drag & drop

  • Fixed “one tick per bin” wrongfully applying to hexagon charts

  • Fixed log scale on binned scatter plots

  • Fixed UI issue on manual axis range edition

Dashboards

  • Improved UI for filter tiles with filter summary and ability to reset filters

  • Fixed search for existing insights

  • Added ability to change the dataset of a filters tile

  • Fixed various issues with filter tiles

API

  • Fixed ability to write chunks of more of 2 Gigabytes when using ManagedFolderWriter.write()

  • Fixed inability to edit some code env parameters through API

Scenarios

  • Propagate warnings from steps to the outcome of the scenario

  • Added missing timezones in the temporal trigger timezone selector

Collaboration

  • Fixed sending of “you have been granted access to project” when your grant does not actually give you access to the project

  • Fixed download of .ipynb attached files in Wiki

Cloud Stacks

  • Upgraded kubectl version in order to deploy latest Kubernetes verions

  • Fixed renaming of automation node breaking the deployer

  • Added display of DSS URL directly in Fleet Manager

Plugins & Extensibility

  • Allowed custom model views to be restricted to some prediction types

  • Forbidden presets are now hidden

Performance & Scalability

  • Fixed API node memory overconsumption when passing huge payloads as inputs or outputs of API services

  • Made project deletion much faster, especially with large number of datasets

  • Improved performance of home page with many projects

Security

  • Added encryption for SAML keystore password

Misc

  • Added better categorization for admin settings page

  • Fixed wrong navigation bar when going to the Deployer

  • Direct webapp access will properly redirect back to the webapp after login

  • Fixed Streaming Scala recipes with Avro on Kafka

  • Added API key id in the API node audit log

  • Improved Industry Solutions creation modal

  • Fixed ability to modify or delete empty todo list

  • Fixed custom requests and limits in containerized execution

  • Fixed “Certification” link on home page with Safari

  • Fixed missing cleanup of Kubernetes objects for containerized continuous Python recipes

Known issues

  • When using Elastic AI / “standalone” mode for Spark, writing Avro files does not work. We advise you to use Parquet or ORC. Please get in touch with Dataiku Support for workarounds.