DSS 9.0 Release notes¶
Migration notes¶
Migration paths to DSS 9.0¶
From DSS 8.0: Automatic migration is supported, with the restrictions and warnings described in Limitations and warnings
From DSS 7.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 7.0 -> 8.0
From DSS 6.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 6.0 -> 7.0, 7.0 -> 8.0
From DSS 5.1: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0
From DSS 5.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0
From DSS 4.3: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0
From DSS 4.2: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0
From DSS 4.1: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.1 -> 4.2, 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0
From DSS 4.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.0 -> 4.1, 4.1 -> 4.2, 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0
Migration from DSS 3.1 and below is not supported. You must first upgrade to 5.0. See DSS 5.0 Release notes
How to upgrade¶
It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.
For automatic upgrade information, see Upgrading a DSS instance.
Pay attention to the warnings described in Limitations and warnings.
Limitations and warnings¶
Automatic migration from previous versions (see above) is supported. Please pay attention to the following removal and deprecation notices.
API change:
dataikuapi.dss.apideployer.DSSAPIDeployerService.import_version()
. This method does not take version_id as a parameter anymore
Support removal¶
Some features that were previously announced are deprecated are now removed or unsupported.
Support for RedHat 6, CentOS 6 and Oracle Linux 6 is removed
Support for Amazon Linux 2017.XX is removed
Support for Spark 1 (1.6) is removed. We strongly advise you to migrate to Spark 2. All supported Hadoop distributions can use Spark 2.
Support for Pig is removed
Support for Machine Learning through Vertica Advanced Analytics is removed We recommend that you switch to In-memory based machine learning models. In-database scoring of in-memory-trained machine learnings will remain available
Support for Hive SequenceFile and RCFile formats is removed
Deprecation notice¶
DSS 9.0 deprecates support for some features and versions. Support for these will be removed in a later release.
Support for Ubuntu 16.04 LTS is deprecated and will be removed in a future release
Support for Debian 9 is deprecated and will be removed in a future release
Support for SuSE 12 SP2, SP3 and SP4 is deprecated and will be removed in a future release. SuSE 12 SP5 remains supported
Support for Amazon Linux 1 is deprecated and will be removed in a future release.
Support for Hortonworks HDP 2.5 and 2.6 is deprecated and will be removed in a future release. These platforms are not supported anymore by Cloudera.
Support for Cloudera CDH 5 is deprecated and will be removed in a future release. These platforms are not supported anymore by Cloudera.
Support for EMR below 5.30 is deprecated and will be removed in a future release.
As a reminder from DSS 7.0, support for “Hive CLI” execution modes for Hive is deprecated and will be removed in a future release. We recommend that you switch to HiveServer2. Please note that “Hive CLI” execution modes are already incompatible with User Isolation Framework.
As a reminder from DSS 7.0, Support for Microsoft HDInsight is now deprecated and will be removed in a future release. We recommend that users plan a migration toward a Kubernetes-based infrastructure.
Version 9.0.1 - April 6th, 2021¶
Datasets and connections¶
Azure Synapse: Fixed “contains” formula function and visual operator
Snowflake: Added support for explain plans
Azure Blob: Fixed issue with restrictive ACLs on parent folders of datasets
Delta Lake: Fixed preview of large Delta datasets
Deployer¶
Fixed failure when re-deploying API services from pre-existing infrastructures and deployments from DSS 7.0
Fixed project list search in Project Deployer
When bundle preload fails, keep the failure logs visible
Improved error message readability in the health status of a deployment
Improved Deployer integration in the Global Finder
Fixed inability to import projects in case of failure during code env remapping
Machine learning¶
Regression Fixed scatter plot when model is trained in Python 3
Performance and scalability¶
Improved performance when a very large number of scenarios start at the same time
Improved performance of automation home page with high number of projects and scenario runs
Improved performance of project home page with high number of scenario runs
Improved performance of scenario page with high number of runs
Improved performance of automation monitoring pages with high number of runs
Reduced resource consumption of backend with very large number of triggers
Reduced resource consumption of backend with very large number of “Build” scenario steps
Reduced resource consumption of backend with very large number of connected users
Prepare recipe¶
UI and UX improvements on the smart pattern generator
UI and UX improvements on the smart date modal
Cloud stacks¶
Made public IP optional on Fleet Manager CloudFormation template
Added EBS encryption for the Fleet Manager EBS
Version 9.0.0 - March 1st, 2021¶
DSS 9.0.0 is a major upgrade to DSS with major new features.
New features¶
Unified Deployer¶
The DSS Deployer provides a unified environment for fully-managed production deployments of both projects and API services. It allows you to have a central view of all of your production assets, to manage CI/CD pipelines with testing/preproduction/production stages, and is fully API-drivable.
For more details, please see Production deployments and bundles.
Interactive scoring and What-if¶
Interactive scoring is a simulator that enables any AI builder or consumer to run “what-if” analyses (i.e., qualitative sensibility analyses) to get a better understanding of what impact changing a given feature value has on the prediction by displaying in real time the resulting prediction and the individual prediction explanations.
For more details, please see Interactive scoring.
Dash Webapps¶
Dash by Plotly is a framework for easily building rich web applications. DSS now includes the ability to write, deploy and manage Dash webapps. Dash joins Flask, Bokeh and Shiny as webapps building frameworks to help data scientists go much further than simple dashboards and provide full interactivity to users.
For more details, please see Dash web apps.
Fuzzy join recipe¶
A very frequent data wrangling use case is to join datasets with “almost equal” data. The new “fuzzy join” recipe is dedicated to joins between two datasets when join keys don’t match exactly. It handles inner, left, right and outer fuzzy joins, and handles text, numerical and geographic fuzziness.
For more details, please see Fuzzy join: joining two datasets
Smart Pattern Builder¶
In Data Preparation, you can now highlight a part of a cell in order to automatically generate suggestions to extract information “similar” to the one you highlighted. You can then add other examples to guide the automated pattern builder of DSS, and choose the pattern that provides you with the best results.
Visual ML Diagnostics¶
ML Diagnostics help you detect common pitfalls while training models, such as overfitting, leakage, insufficient learning and such. It can suggest possible improvements.
For more details, please see ML Diagnostics
Model assertions¶
Model assertions streamline and accelerate the model evaluation process, by automatically checking that predictions for specified subpopulations meet certain conditions. You can automatically compare “expected predictions” on segments of your test data with the model’s output. DSS will check that the model’s predictions are aligned with your business judgment.
For more details, please see ML Assertions
Distributed Hyperparameters Search¶
It is now possible to distribute the training of a single model over multiple containers. Dataiku will automatically distribute all the points of the hyperparameter search. The distribution happens transparently, leveraging Kubernetes. No additional setup is required.
Distributed hyperparamter search permits vastly increased depth and precision of hyperparameter search while keeping an acceptable time for training.
Git push and pull for notebooks¶
It is now possible to fetch Jupyter notebooks from existing Git repositories, and to push them back to their origin. Pulls and pushes can be made notebook-per-notebook or for a group of notebooks.
For more details, please see Importing Jupyter Notebooks from Git
Wiki Export¶
Wikis can now be exported to PDF, either on a per-article basis or globally.
For more details, please see Wikis
Model Fairness report¶
Evaluating the fairness of machine learning models has been a topic of both academic and business interest in recent years. Before prescribing any resolution to the problem of model bias, it is crucial to learn more about how biased a model is, by measuring some fairness metrics. The model fairness report provides you with assistance in this measurement task.
For. more details, please see Model fairness report
Streaming (experimental)¶
DSS now features an experimental real-time processing framework, notably targeting Kafka and Spark Structured Streaming.
For more details, please see Streaming data
Delta Lake reading (experimental)¶
DSS now features experimental support for directly reading the latest version of Delta Lake datasets.
For more details, please see Delta Lake
Other notable enhancements¶
Azure Synapse support¶
DSS now officially supports Azure Synapse (dedicated SQL pools)
For more details, please see Azure Synapse
Date Preparation¶
DSS brings a lot of new capabilities for date preparation:
New visual prepare processors for incrementing or truncating dates, and for finding differences between dates
New ability to delete, keep or flag rows based on various time intervals
Better date filtering capabilities for Explore view
For more details, please see Managing dates
Formula editor¶
The formula editor has been strongly enhanced with better code completion, inline help for all functions and features, and better examples.
For more details, please see Formula language
Spark 3¶
DSS now supports Spark 3.
If using Dataiku Cloud Stacks for AWS or Elastic AI for Spark, Spark 3 is builtin.
It is also now possible to use SparkSession
in Pyspark code
Python 3.7¶
DSS now supports Python 3.7
You can now create Python 3.7 code envs. In addition, on Linux distributions where Python 3.7 is the default, DSS will automatically use it.
In addition, new DSS setups will now use Python 3.6 or Python 3.7 as the default builtin environment.
In Python 3.7, async is promoted to a reserved keyword and thus cannot be used as a keyword argument in a method or a function anymore. As a consequence, the DSS Scenario API is replacing the async keyword argument, formerly used in some methods, by the asynchronous keyword argument. Please make sure to update uses of the Scenario class accordingly if running Python scenarios or Python scenario steps with Python 3.7. Impacted methods are: run_scenario, run_step, build_dataset, build_folder, train_model, invalidate_dataset_cache, clear_dataset, clear_folder, run_dataset_checks, compute_dataset_metrics, synchronize_hive_metastore, update_from_hive_metastore, execute_sql, set_project_variables, set_global_variables, run_global_variables_update, create_jupyter_export, package_api_service.
Builtin Snowflake driver¶
DSS now comes with the Snowflake JDBC driver and native Spark connector builtin. You do not need to install JDBC drivers for Snowflake anymore.
Enhanced “time-based” trigger¶
The time-based trigger in scenario has been strongly enhanced with the following capabilities:
Ability to show and handle triggering times in all timezones, not only server timezone
Ability to run every X hours instead of only every hour
Ability to run every X days instead of only every day
Ability to run every X week instead of only every week
Ability to run every X months instead of only every month
For once every X month triggers, ability to run on “last Monday” or “third Tuesday”
Ability to set a starting date for a trigger
Enhanced cross-connection and no-input SQL recipes¶
SQL recipes can now work without an input dataset. The recipe will run in the connection of the output dataset.
For SQL recipes with both inputs and outputs, it is now possible to enable “cross-connection” handling while using the connection of the output (previously, only inputs could be selected).
Addition of individual users to projects¶
You can now grant access to projects to individual users, in addition to groups.
Pan/Zoom control in Flow¶
You can now zoom and pan on the flow with the keyboard, and zoom and reset the zoom with dedicated buttons.
Variables expansion support in “Build”¶
The “Build” dialog now supports variables expansion for partitioned datasets
Variables expansion support in “Explicit values”¶
The “Explicit Values” partition dependency function now supports variables expansion
Schema reload and propagation as scenario steps¶
In many situations, it is expected that the schema of a Flow input dataset will change frequently, and that these changes should be accepted and their impacts propagated without further manual intervention.
In order to ease the situations, DSS 9 introduces two new scenario steps:
“Reload dataset schema” to reload the schema of an input dataset from the underlying data source
“Propagate schema” to perform an automated schema propagation across the Flow.
These steps should usually be used before a recursive Build step.
Experimental read support for kdb+¶
Dataiku now features experimental support for reading from kdb+
Other enhancements and fixes¶
Datasets¶
Snowflake: the JDBC driver and Spark connector are now preinstalled and do not need manual installation anymore
Snowflake: added post-connect statements
Snowflake: added support for Snowflake -> S3 fast-path when the target bucket mandates encryption
Vertica: fixed partitioning outside of the default schema
PostgreSQL: the builtin PostgreSQL has been updated to a more recent version, which notably fixes issues with importing tables on PostgreSQL 12
S3: It is now possible to force “path-style” rather than “virtualhost-style” S3 access. This is mainly useful for “S3-compatible” storages.
BigQuery: fixed ability to use “high throughput” mode for the JDBC driver
Flow¶
Added detection of changes in editable datasets, which will now properly trigger rebuilds
Fixed missing refresh of “Building” indicator with flow zones
Fixed wrong “current” flow zone remembered when browsing
Visual recipes¶
Prepare on Snowflake: fixed handling of accentuated column names
Fixed handling of “contains” formula operator on Impala when the string to match contains
_
Fixed “Use an existing folder” on download recipe
Added variables expansion on “Flag rows where formula matches” processor
Machine Learning¶
The Evaluation recipe can now output the cost matrix gain
PMML export now supports dummy-encoded variables
Custom models can now access the list of feature names
Fixed failure scoring on SQL with numerical features stored as text
Text features: fixed stop words when training in containers
Fixed warning in Jupyter when exporting a model to a Jupyter notebook
Added ability to define a class inline for a custom model
Switched XGBoost feature importances to use the “gain” method (library default since version 0.82)
Elastic AI and Kubernetes¶
AKS: fixed node pool creation with a zero minimum number of nodes
AKS: added ability to select the system node pool
Disabling an already-disabled Kubernetes-based API deployment will not fail anymore
Fixed webapps on Kubernetes leaking “Deployment” objects in Kubernetes
Fixed possible failures deploying webapps due to invalid Kubernetes labels
Fixed possible failures running Spark pipelines due to invalid Kubernetes labels
Added support for CUDA 11 when building base images
Fixed validation of Hive recipes containing “UNION ALL” on HDP 3 and EMR
Collaboration¶
Fixed “Back” button when going to the catalog
Fixed tags filtering with spaces in tag names
Fixed links to DSS items when putting a wiki page on the home page
Fixed display of Scala notebooks in Catalog
Automation¶
Display in project home page when triggers are disbaled
Added ability for administrators to force the SMTP sender, preventing users from setting it
Performance improvements on “Automation monitoring” pages
Coding¶
Fixed handling of records containing
\r
in Python when usingwrite_dataframe
Fixed code env rebuilding if the code env folder had been removed
Fixed “with_default_env” on project settings class
Fixed ability to delete a code env if a broken dataset exists
Charts¶
Added a safety against potential memory overruns when requesting too high number of bins
Fixed sort with null values on PostgreSQL