DSS 9.0 Release notes

Migration notes

Migration paths to DSS 9.0

How to upgrade

It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.

For automatic upgrade information, see Upgrading a DSS instance.

Pay attention to the warnings described in Limitations and warnings.

Limitations and warnings

Automatic migration from previous versions (see above) is supported. Please pay attention to the following removal and deprecation notices.

Support removal

Some features that were previously announced are deprecated are now removed or unsupported.

  • Support for RedHat 6, CentOS 6 and Oracle Linux 6 is removed

  • Support for Amazon Linux 2017.XX is removed

  • Support for Spark 1 (1.6) is removed. We strongly advise you to migrate to Spark 2. All supported Hadoop distributions can use Spark 2.

  • Support for Pig is removed

  • Support for Machine Learning through Vertica Advanced Analytics is removed We recommend that you switch to In-memory based machine learning models. In-database scoring of in-memory-trained machine learnings will remain available

  • Support for Hive SequenceFile and RCFile formats is removed

Deprecation notice

DSS 9.0 deprecates support for some features and versions. Support for these will be removed in a later release.

  • Support for Ubuntu 16.04 LTS is deprecated and will be removed in a future release

  • Support for Debian 9 is deprecated and will be removed in a future release

  • Support for SuSE 12 SP2, SP3 and SP4 is deprecated and will be removed in a future release. SuSE 12 SP5 remains supported

  • Support for Amazon Linux 1 is deprecated and will be removed in a future release.

  • Support for Hortonworks HDP 2.5 and 2.6 is deprecated and will be removed in a future release. These platforms are not supported anymore by Cloudera.

  • Support for Cloudera CDH 5 is deprecated and will be removed in a future release. These platforms are not supported anymore by Cloudera.

  • Support for EMR below 5.30 is deprecated and will be removed in a future release.

  • As a reminder from DSS 7.0, support for “Hive CLI” execution modes for Hive is deprecated and will be removed in a future release. We recommend that you switch to HiveServer2. Please note that “Hive CLI” execution modes are already incompatible with User Isolation Framework.

  • As a reminder from DSS 7.0, Support for Microsoft HDInsight is now deprecated and will be removed in a future release. We recommend that users plan a migration toward a Kubernetes-based infrastructure.

Version 9.0.1 - April 6th, 2021

Datasets and connections

  • Azure Synapse: Fixed “contains” formula function and visual operator

  • Snowflake: Added support for explain plans

  • Azure Blob: Fixed issue with restrictive ACLs on parent folders of datasets

  • Delta Lake: Fixed preview of large Delta datasets

Deployer

  • Fixed failure when re-deploying API services from pre-existing infrastructures and deployments from DSS 7.0

  • Fixed project list search in Project Deployer

  • When bundle preload fails, keep the failure logs visible

  • Improved error message readability in the health status of a deployment

  • Improved Deployer integration in the Global Finder

  • Fixed inability to import projects in case of failure during code env remapping

Machine learning

  • Regression Fixed scatter plot when model is trained in Python 3

Performance and scalability

  • Improved performance when a very large number of scenarios start at the same time

  • Improved performance of automation home page with high number of projects and scenario runs

  • Improved performance of project home page with high number of scenario runs

  • Improved performance of scenario page with high number of runs

  • Improved performance of automation monitoring pages with high number of runs

  • Reduced resource consumption of backend with very large number of triggers

  • Reduced resource consumption of backend with very large number of “Build” scenario steps

  • Reduced resource consumption of backend with very large number of connected users

Prepare recipe

  • UI and UX improvements on the smart pattern generator

  • UI and UX improvements on the smart date modal

Coding

  • New feature: Added an API for listing and managing Jupyter notebooks

Cloud stacks

  • Made public IP optional on Fleet Manager CloudFormation template

  • Added EBS encryption for the Fleet Manager EBS

Notebooks

  • SparkSQL notebook: fixed issues with very large Parquet datasets

Misc

  • Added support for Microsoft Edge browser

  • Fixed possible failures of the “clear scenario logs” macro

  • Fixed possible upgrade failure when time-based triggers contained invalid settings

Version 9.0.0 - March 1st, 2021

DSS 9.0.0 is a major upgrade to DSS with major new features.

New features

Unified Deployer

The DSS Deployer provides a unified environment for fully-managed production deployments of both projects and API services. It allows you to have a central view of all of your production assets, to manage CI/CD pipelines with testing/preproduction/production stages, and is fully API-drivable.

For more details, please see Production deployments and bundles.

Interactive scoring and What-if

Interactive scoring is a simulator that enables any AI builder or consumer to run “what-if” analyses (i.e., qualitative sensibility analyses) to get a better understanding of what impact changing a given feature value has on the prediction by displaying in real time the resulting prediction and the individual prediction explanations.

For more details, please see Interactive scoring.

Dash Webapps

Dash by Plotly is a framework for easily building rich web applications. DSS now includes the ability to write, deploy and manage Dash webapps. Dash joins Flask, Bokeh and Shiny as webapps building frameworks to help data scientists go much further than simple dashboards and provide full interactivity to users.

For more details, please see Dash web apps.

Fuzzy join recipe

A very frequent data wrangling use case is to join datasets with “almost equal” data. The new “fuzzy join” recipe is dedicated to joins between two datasets when join keys don’t match exactly. It handles inner, left, right and outer fuzzy joins, and handles text, numerical and geographic fuzziness.

For more details, please see Fuzzy join: joining two datasets

Smart Pattern Builder

In Data Preparation, you can now highlight a part of a cell in order to automatically generate suggestions to extract information “similar” to the one you highlighted. You can then add other examples to guide the automated pattern builder of DSS, and choose the pattern that provides you with the best results.

Visual ML Diagnostics

ML Diagnostics help you detect common pitfalls while training models, such as overfitting, leakage, insufficient learning and such. It can suggest possible improvements.

For more details, please see ML Diagnostics

Model assertions

Model assertions streamline and accelerate the model evaluation process, by automatically checking that predictions for specified subpopulations meet certain conditions. You can automatically compare “expected predictions” on segments of your test data with the model’s output. DSS will check that the model’s predictions are aligned with your business judgment.

For more details, please see ML Assertions

Git push and pull for notebooks

It is now possible to fetch Jupyter notebooks from existing Git repositories, and to push them back to their origin. Pulls and pushes can be made notebook-per-notebook or for a group of notebooks.

For more details, please see Importing Jupyter Notebooks from Git

Wiki Export

Wikis can now be exported to PDF, either on a per-article basis or globally.

For more details, please see Wikis

Model Fairness report

Evaluating the fairness of machine learning models has been a topic of both academic and business interest in recent years. Before prescribing any resolution to the problem of model bias, it is crucial to learn more about how biased a model is, by measuring some fairness metrics. The model fairness report provides you with assistance in this measurement task.

For. more details, please see Model fairness report

Streaming (experimental)

DSS now features an experimental real-time processing framework, notably targeting Kafka and Spark Structured Streaming.

For more details, please see Streaming data

Delta Lake reading (experimental)

DSS now features experimental support for directly reading the latest version of Delta Lake datasets.

For more details, please see Delta Lake

Other notable enhancements

Azure Synapse support

DSS now officially supports Azure Synapse (dedicated SQL pools)

For more details, please see Azure Synapse

Date Preparation

DSS brings a lot of new capabilities for date preparation:

  • New visual prepare processors for incrementing or truncating dates, and for finding differences between dates

  • New ability to delete, keep or flag rows based on various time intervals

  • Better date filtering capabilities for Explore view

For more details, please see Managing dates

Formula editor

The formula editor has been strongly enhanced with better code completion, inline help for all functions and features, and better examples.

For more details, please see Formula language

Spark 3

DSS now supports Spark 3.

If using Dataiku Cloud Stacks for AWS or Elastic AI for Spark, Spark 3 is builtin.

It is also now possible to use SparkSession in Pyspark code

Python 3.7

DSS now supports Python 3.7

You can now create Python 3.7 code envs. In addition, on Linux distributions where Python 3.7 is the default, DSS will automatically use it.

In addition, new DSS setups will now use Python 3.6 or Python 3.7 as the default builtin environment.

In Python 3.7, async is promoted to a reserved keyword and thus cannot be used as a keyword argument in a method or a function anymore. As a consequence, the DSS Scenario API is replacing the async keyword argument, formerly used in some methods, by the asynchronous keyword argument. Please make sure to update uses of the Scenario class accordingly if running Python scenarios or Python scenario steps with Python 3.7. Impacted methods are: run_scenario, run_step, build_dataset, build_folder, train_model, invalidate_dataset_cache, clear_dataset, clear_folder, run_dataset_checks, compute_dataset_metrics, synchronize_hive_metastore, update_from_hive_metastore, execute_sql, set_project_variables, set_global_variables, run_global_variables_update, create_jupyter_export, package_api_service.

Builtin Snowflake driver

DSS now comes with the Snowflake JDBC driver and native Spark connector builtin. You do not need to install JDBC drivers for Snowflake anymore.

Enhanced “time-based” trigger

The time-based trigger in scenario has been strongly enhanced with the following capabilities:

  • Ability to show and handle triggering times in all timezones, not only server timezone

  • Ability to run every X hours instead of only every hour

  • Ability to run every X days instead of only every day

  • Ability to run every X week instead of only every week

  • Ability to run every X months instead of only every month

  • For once every X month triggers, ability to run on “last Monday” or “third Tuesday”

  • Ability to set a starting date for a trigger

Enhanced cross-connection and no-input SQL recipes

SQL recipes can now work without an input dataset. The recipe will run in the connection of the output dataset.

For SQL recipes with both inputs and outputs, it is now possible to enable “cross-connection” handling while using the connection of the output (previously, only inputs could be selected).

Addition of individual users to projects

You can now grant access to projects to individual users, in addition to groups.

Pan/Zoom control in Flow

You can now zoom and pan on the flow with the keyboard, and zoom and reset the zoom with dedicated buttons.

Variables expansion support in “Build”

The “Build” dialog now supports variables expansion for partitioned datasets

Variables expansion support in “Explicit values”

The “Explicit Values” partition dependency function now supports variables expansion

Schema reload and propagation as scenario steps

In many situations, it is expected that the schema of a Flow input dataset will change frequently, and that these changes should be accepted and their impacts propagated without further manual intervention.

In order to ease the situations, DSS 9 introduces two new scenario steps:

  • “Reload dataset schema” to reload the schema of an input dataset from the underlying data source

  • “Propagate schema” to perform an automated schema propagation across the Flow.

These steps should usually be used before a recursive Build step.

Experimental read support for kdb+

Dataiku now features experimental support for reading from kdb+

Other enhancements and fixes

Datasets

  • Snowflake: the JDBC driver and Spark connector are now preinstalled and do not need manual installation anymore

  • Snowflake: added post-connect statements

  • Snowflake: added support for Snowflake -> S3 fast-path when the target bucket mandates encryption

  • Vertica: fixed partitioning outside of the default schema

  • PostgreSQL: the builtin PostgreSQL has been updated to a more recent version, which notably fixes issues with importing tables on PostgreSQL 12

  • S3: It is now possible to force “path-style” rather than “virtualhost-style” S3 access. This is mainly useful for “S3-compatible” storages.

  • BigQuery: fixed ability to use “high throughput” mode for the JDBC driver

Flow

  • Added detection of changes in editable datasets, which will now properly trigger rebuilds

  • Fixed missing refresh of “Building” indicator with flow zones

  • Fixed wrong “current” flow zone remembered when browsing

Visual recipes

  • Prepare on Snowflake: fixed handling of accentuated column names

  • Fixed handling of “contains” formula operator on Impala when the string to match contains _

  • Fixed “Use an existing folder” on download recipe

  • Added variables expansion on “Flag rows where formula matches” processor

Machine Learning

  • The Evaluation recipe can now output the cost matrix gain

  • PMML export now supports dummy-encoded variables

  • Custom models can now access the list of feature names

  • Fixed failure scoring on SQL with numerical features stored as text

  • Text features: fixed stop words when training in containers

  • Fixed warning in Jupyter when exporting a model to a Jupyter notebook

  • Added ability to define a class inline for a custom model

  • Switched XGBoost feature importances to use the “gain” method (library default since version 0.82)

Elastic AI and Kubernetes

  • AKS: fixed node pool creation with a zero minimum number of nodes

  • AKS: added ability to select the system node pool

  • Disabling an already-disabled Kubernetes-based API deployment will not fail anymore

  • Fixed webapps on Kubernetes leaking “Deployment” objects in Kubernetes

  • Fixed possible failures deploying webapps due to invalid Kubernetes labels

  • Fixed possible failures running Spark pipelines due to invalid Kubernetes labels

  • Added support for CUDA 11 when building base images

  • Fixed validation of Hive recipes containing “UNION ALL” on HDP 3 and EMR

Collaboration

  • Fixed “Back” button when going to the catalog

  • Fixed tags filtering with spaces in tag names

  • Fixed links to DSS items when putting a wiki page on the home page

  • Fixed display of Scala notebooks in Catalog

Automation

  • Display in project home page when triggers are disbaled

  • Added ability for administrators to force the SMTP sender, preventing users from setting it

  • Performance improvements on “Automation monitoring” pages

Coding

  • Fixed handling of records containing \r in Python when using write_dataframe

  • Fixed code env rebuilding if the code env folder had been removed

  • Fixed “with_default_env” on project settings class

  • Fixed ability to delete a code env if a broken dataset exists

Charts

  • Added a safety against potential memory overruns when requesting too high number of bins

  • Fixed sort with null values on PostgreSQL

Notebooks

  • SQL notebooks: added explain plans directly in SQL notebooks

  • Jupyter: Fixed “File > New” and “File > Copy” actions

  • Fixed renamed notebooks not appeareding in “recent elements”

  • Fixed icon of SQL notebooks in “recent elements”

Misc

  • RMarkdown: fixed support for project libraries

  • Fixed erroneous behavior of the browser’s “Back” button when going to the catalog

  • Small UI improvements in multiple locations