DSS 10.0 Release notes

Migration notes

Migration paths to DSS 10.0

How to upgrade

It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.

For automatic upgrade information, see Upgrading a DSS instance.

Pay attention to the warnings described in Limitations and warnings.

Limitations and warnings

Automatic migration from previous versions (see above) is supported. Please pay attention to the following removal and deprecation notices.

Support removal

Some features that were previously announced are deprecated are now removed or unsupported.

  • Support for Ubuntu 16.04 LTS is now removed

  • Support for Debian 9 is now removed

  • Support for SuSE 12 SP2, SP3 and SP4 is now removed. SuSE 12 SP5 remains supported

  • Support for AmazonLinux 1 is now removed

  • Support for Hortonworks HDP 2 is now removed

  • Support for Cloudera CDH 5 is now removed

  • Support for HDInsight is now removed

Deprecation notice

DSS 10.0 deprecates support for some features and versions. Support for these will be removed in a later release.

  • The “Build missing datasets” build mode is deprecated and will be removed in a future release. This mode only worked in very specific cases and was never fully operational.

  • Support for MapR is deprecated and will be removed in a future release.

  • Support for training Machine Learning models with H2O Sparkling Water is deprecated and will be removed in a future release.

  • As a reminder from DSS 9.0, support for EMR below 5.30 is deprecated and will be removed in a future release.

  • As a reminder from DSS 9.0, support for Elasticsearch 1.x and 2.x is deprecated and will be removed in a future release.

  • As a reminder from DSS 7.0, support for “Hive CLI” execution modes for Hive is deprecated and will be removed in a future release. We recommend that you switch to HiveServer2. Please note that “Hive CLI” execution modes are already incompatible with User Isolation Framework.

  • As a reminder from DSS 7.0, Support for Microsoft HDInsight is now deprecated and will be removed in a future release. We recommend that users plan a migration toward a Kubernetes-based infrastructure.

Version 10.0.2 - December 13th, 2021

DSS 10.0.2 is a significant new release with both new features, performance enhancements and bugfixes.

Items marked with (9.0.6) are also present in DSS 9.0.6

Datasets

  • New feature Added per user login for Google Cloud Storage (OAuth) (9.0.6)

  • New feature Added per user login for BigQuery (OAuth) (9.0.6)

  • When creating a dataset from file names with Unicode characters (including CJK), an equivalent ASCII dataset name is automatically generated (9.0.6)

  • Fixed possible UI overlapping between different custom exporters (9.0.6)

  • Fixed creation of managed SQL datasets from “New Dataset > Internal > Managed”

Machine Learning

  • Fixed creation of cluster recipes on foreign datasets (9.0.6)

  • Fixed creation of scoring recipes from MLFlow models

  • Fixed import of MLFlow models on UIF-enabled DSS

Hadoop, Spark, Elastic AI

  • New feature: Added support for CDP Private Cloud Base 7.1.7 (9.0.6)

  • Added the ability to import EMR-created tables from Glue as S3 datasets when not using EMR with DSS (9.0.6)

  • Fixed failure of Spark recipes when project variables contain Unicodes characters (including CJK) (9.0.6)

  • Fixed SparkSQL recipe validation failure when the code contains Unicode characters (9.0.6)

  • Fixed issue with Kubernetes namespace policies (9.0.6)

  • Fixed direct write to Snowflake from Spark with OAuth authentication and variables (9.0.6)

Dashsboards

  • Fixed truncation of large dashboard exports (9.0.6)

  • Fixed opening of insights when clicking their title

Cloud Stacks

  • New feature: Azure: Added ability to create a subnet that does not cover the entire vnet (9.0.6)

  • New feature: Azure: Support for static private IP for Fleet Manager (9.0.6)

  • New feature: Azure: Support for static private IP for DSS instances (9.0.6)

  • New feature: Azure: Added ability to create resources in a specific resource group instead of always using the vnet resource group (9.0.6)

  • New feature: Azure: Added ability to fully control the name of created resources (machines, disks, network interface, …) (9.0.6)

  • New feature: AWS: Added support for Hong Kong, Osaka, Milan and Bahrain regions (9.0.6)

Flow

  • Fixed Flow filtering with flow zones and exposed objects (9.0.6)

Recipes

  • Prepare recipe: “Simplify column names” now automatically translates Unicode characters (including CJK) to equivalent ASCII (9.0.6)

  • Prepare recipe: Snowflake: Fixed date parsing with timezone being sensitive to the JDBC session timezone (9.0.6)

  • Code recipes: When creating the recipe with input or output managed folder with Unicode names (including CJK), generate an equivalent ASCII variable name for the starter code (9.0.6)

  • Join recipe: Improved input preview

  • Join recipe: Better warnin at recipe validation when there are unusable characters in column names (9.0.6)

  • SQL recipe: Fixed usage of explicit DKU_END_STATEMENT (9.0.6)

  • Fixed possible failure with Snowflake/Synapse/BigQuery auto-fast-paths with date columns (9.0.6)

  • Fixed failure with Snowflake auto-fast-path and incomplete configuration (9.0.6)

API

  • Added ability to modify containerization settings of code envs (9.0.6)

  • Fixed creation of prepare recipe with existing outputs from the Python public API (9.0.6)

  • Fixed the direction argument of the SelectQuery.order_by method (9.0.6)

  • Fixed invalid removal of default Flow zone through the API (9.0.6)

Notebooks and webapps

  • Fixed changing name of a SQL notebooks when created from the side panel (9.0.6)

  • Fixed possible issue when saving standard webapps (9.0.6)

  • Fixed write to Snowflake/Synapse/BigQuery auto-fast-path from Jupyter notebooks and webapps (9.0.6)

  • Fixed failure of webapps when the project variables contain Unicodes characters (including CJK) (9.0.6)

Performance and scalability

  • Improved performance of flow zones listing (9.0.6)

  • Improved performance on home page with large number of project folders (9.0.6)

  • Fixed leak of Python processes from custom filesystem providers such as Sharepoint (9.0.6)

  • Fixed memory leak in Cloud Stacks for Azure (9.0.6)

  • Fixed failure on dashboards for datasets with large number of charts (9.0.6)

  • Added pagination on users list and UIF rules screens (9.0.6)

  • Improved CPU consumption of eventserver reporting (9.0.6)

Misc

  • Dataiku Applications: Added an option to hide the “Switch to project view” button (9.0.6)

  • Added ability for non-admins to create plugin code envs if they have plugin development rights (9.0.6)

  • Fixed bug when duplicating a plugin component

Version 10.0.0 - November 15th, 2021

This release is dedicated to the memory of our dear colleague Mark Treveil.

DSS 10.0.0 is a major upgrade to DSS with major new features.

New features

MLOps: Models Comparison and Drift Analysis

Model evaluations now allow you to capture the performance and behavior of a model after it has been trained, in order to analyze the evolution of its behavior in time. This enables Drift analysis.

Visual model comparisons allow you to quickly compare models between them or different versions of models. They can be used both during the Machine Learning design phase or to compare behaviors and performance over time.

For more details, please see MLOps

MLOps: Centralized Models registry

Part of the new Govern Node, the centralized models registry provides a centralized way to see all models (whether developed in Dataiku or externally) in one place, versioned and with performance metrics and project summaries for leaders and project managers. This includes Drift analysis metrics

MLOps: Models deployment signoff workflows

Part of the new Govern Node, you can now have mandatory sign-off and approval of models before they can be deployed in production. Models signoff can include multiple and customizable reviewers and approvers.

MLOps: MLFlow Models import

DSS can now import models from the MLFlow Models framework. MLFLow Models imported into DSS benefit from all the capabilities of DSS-trained models, including:

For more details, please see MLFlow Models

Governance: Projects governance, risk & value assessments

Part of the new Govern Node, the centralized projects governance framework leaders and project managers to keep an eye of all of the AI initiatives lifecycle with clear steps and gates in order to keep proper oversight of your business initiatives.

Risk and value assessment matrices provide a standardized framework to compare initiatives for investment and determine the appropriate oversight level.

For more details, please see Governance

Data consumers: Workspaces, a new home for data consumers

Outputs of complex data projects are often scattered across multiple projects and locations, making it challenging for business stakeholders and data consumers to quickly gain access to the needed data.

Workspaces provide dedicated, secure landing pages where data consumers can easily browse Dataiku dashboards, webapps, datasets, applications, wikis, etc. to get direct access to the most relevant insight or to take direct action using applications and webapps.

For more details, please see Workspaces

Data consumers: cross-chart filters on dashboards

You can now add cross-charts filters on dashboards. The filter can affect all charts on a slide.

For more details, please see Dashboard concepts

Geospatial analytics: Geo-join recipe

The new geo-join recipe allow you to visually match and enrich geospatial datasets.

For more details, please see Geo join: joining datasets based on geospatial features

Geospatial analytics: Density chart

The Geo heatmap chart provides a “density”-based analytics in order to quickly visualize the most important locations on a map.

Geospatial analytics: preparation tools

New tools in the prepare recipe facilitate Geospatial analytics:

  • New processor and formula function: Create an area around a geopoint

  • Formula function: Simplify a geometry (including SQL support for PostGIS and Snowflake)

  • Formula function: Get the bounding box of a geometry

  • Formula function: Compute distance between geometries

  • Formula function: Check for intersection between geometries

  • The Change CRS processor can now run in SQL (with PostGIS)

Machine Learning: Object detection

Object detection is now a top-level task in DSS. You can now easily leverage leading, pre-trained deep learning models for detecting objects, and fine tune them to your specific labeled datasets.

Like all models trained visually in DSS, object detection models provide detailed results screens, builtin scoring ability, versioning and governance.

For more details, please see Computer vision

Machine Learning: Counterfactuals and Actionable Recourse

Counterfactuals and Actionable Recourse analysis enhance Interactive scoring with insights about the behavior of the model in the vicinity of a reference example.

Counterfactuals generate various records similar to the reference example and that lead to a different predicted class.

Actionable recourse generates the records with the smallest possible perturbations compared to the reference example that lead to a specific predicted class, different from the one of the reference example. Interactive scoring is a simulator that enables any AI builder or consumer to run “what-if” analyses (i.e., qualitative sensibility analyses)

Machine Learning: LightGBM

The fast and powerful LightGBM algorithm joins the family of algorithms that can be trained by the DSS AutoML component

Machine Learning: expanded feature encodings

Several new feature encodings are now available in AutoML:

  • Enhanced impact (target) encoding

  • Rank encoding

  • Frequency encoding

  • Cyclical encodings for date/time

For more details, please see Features handling

Machine Learning: Queues

While training machine learning models, you can now enqueue several trainings that will all execute without further intervention. This allows you to schedule many experiments at the end of the day, and come back the next day with all your models trained and ready to be compared in the new Models Comparison.

Statistics: Augmented Exploratory Data Analysis

When performing exploratory data analysis on wide or complex datasets, it can be challenging and overwhelming for users to understand which columns might be most important to their analysis, how the columns relate to each other, and to identify patterns and insights.

Within the Statistics, a new wizard interactively suggests statistical analyses that may be interesting, along with new additional advanced charting capabilities such as 3-D scatter plots and parallel coordinates plots.

Other notable enhancements

Charts: Customizable axis ranges

Ranges on both the X and Y axis of charts can now be customized

Charts: Color assignments

It is now easier to manually control color assignments on charts in order to have consistent colors between charts.

Charts: numerical formatting

New numerical formatting options are available for charts (for values displayed in the chart and in the tooltips)

Git push and pull for libraries

In addition to the existing capability to fetch project libraries from existing Git repositories, it is now possible to push them back to their origin.

For more details, please see Importing code from Git in project libraries

Code env resources

When installing some packages in code envs, such as NLTK or Spacy, you frequently need to download additional resources, such as pretrained models. Previously, each user had to download the resource in a specific folder, and sometimes tweak options of the packages in order to point to the downloaded resources.

Code env resources allow you to download resources directly to the code env folder, making them available for all users

For more details, please see Operations (Python)

Data preparation: Easy extraction with Grok

You can now leverage the “Grok” pattern extraction mechanism that allows you to easily parse logs using predefined patterns. A visual editor makes it easy to view what your expression matches and to troubleshoot it.

For more details, please see Extract with grok

Wiki: quality-of-life enhancements

It is now possible to attach images in the wiki by directly dragging and dropping it.

Adding attachments does not require saving edits first anymore.

Other enhancements and fixes

Visual recipes

  • Prepare: Fixed invalid JSON in “shift+V” on a cell

  • Prepare: Fixed issue with the Nest processor on Spark

  • Grouping: Fixed UI issue with CJK characters in column names

  • Grouping: Improved discoverability of “First/Last”

  • Distinct, Pivot, Grouping: Fixed error on partitioned SQL datasets when the partition column was also used as a key

Machine Learning

  • Fixed possible permissions issues with UIF enabled

  • Variables importance and partial dependencies can now be exported (CSV, Excel, Tableau, dataset, …)

  • Fixed failure when copying feature handling between clustering tasks

  • Fixed score discrepancy with partitioned models in SQL mode with “redispatch”

  • Fixed UI issue with mass actions on features handling

  • Fixed clustering recipe failure when a column is fully empty

  • Fixed faulty ability to remove models while they were training

  • Fixed performance issue with distributed hyperparameters search

  • Updated the computation of individual explanations to improve their correctness

Snowflake

  • Preparation: URL parser can now be pushed down to Snowflake

  • Preparation: Email parser can now be pushed down to Snowflake

Datasets

  • Fixed issues with autodetection of Parquet on S3/Azure/GCS datasets

  • Faster datetime-based partitioning on PostgreSQL

Flow

  • The “Schema changes” modal will not display anymore when modifying the last dataset in the Flow. Schema changes are auto-accepted.

  • Added ability to select zone when copying a subflow

  • Added connection information on dataset right panel

  • Better error handling when using invalid values in a Time Range partitioning dependency

  • Fixed various issues with managed folders from foreign projects

  • Fixed navigation bar when using the catalog from a project

Charts

  • Fixed color and size on “Binned XY” chart

  • Fixed possible misalignment on date axis for column charts

Dashboards

  • Fullscreen mode is now preserved after a redirection to SSO login

API

  • Added ability to create evaluation recipes in the API

Administration

  • It is now possible to view all usages of a code env

  • Fixed possible hang in airgapped environments

  • Fixed browser window title in administration pages

Security

  • Removed plain-text credentials from the Twitter connector

Misc

  • Fixed wiki search when using “:” in the searched term

  • Performance enhancements for instances with large number of users

  • Fixed issue with “Test” button for containerized execution config with multiple clusters