DSS 7.0 Release notes

Migration notes

Migration paths to DSS 7.0

How to upgrade

It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.

For automatic upgrade information, see Upgrading a DSS instance.

Pay attention to the warnings described in Limitations and warnings.

Limitations and warnings

Automatic migration from previous versions (see above) is supported, but there are a few points that need manual attention.

Fix for typed variables in Python

In DSS 5.1 and 6.0, a regression affected dataiku.get_custom_variables(typed=True). This regression was fixed in DSS 7.0, so variables typing will be restored. This may affect workarounds that you may have setup in order to work around the regression.

“origin” as remote name

DSS 7.0 introduces a new Git integration for projects, with vastly enhanced features like multiple branches and pulling from Git remotes.

In order to introduce this, DSS 7.0 also introduces a unified name for Git remotes. DSS will now only consider the remote named “origin” (the “standard” Git naming). As a result, if you had already added Git remotes with a different name, you may need to re-add it to your projects, following the instructions in Version control of projects.

Deprecation notice

DSS 7.0 deprecates support for some features and versions. Support for these will be removed in a later release.

  • Support for “Hive CLI” execution modes for Hive is deprecated and will be removed in a future release. We recommend that you switch to HiveServer2. Please note that “Hive CLI” execution modes are already incompatible with User Isolation Framework.
  • Support for Microsoft HDInsight is now deprecated and will be removed in a future release. We recommend that users plan a migration toward a Kubernetes-based infrastructure.
  • Support for Machine Learning through Vertica Advanced Analytics is now deprecated and will be removed in a future release. We recommend that you switch to In-memory based machine learning models. In-database scoring of in-memory-trained machine learnings will remain available.
  • Support for Hive SequenceFile and RCFile formats is deprecated and will be removed in a future release.
  • As a reminder from 6.0, support for Spark 1 (1.6) is deprecated. We strongly advise you to migrate to Spark 2. All Hadoop distributions can use Spark 2. Support for Spark 1 will be removed in DSS 8
  • As a reminder from 6.0, support for Pig is deprecated. We strongly advise you to migrate to Spark.

Version 7.0.1 - March, 13th, 2020

DSS 7.0.1 is a bugfix release. For a summary of major changes in 7.0, see below

Datasets

  • Fixed ‘Export Table’ option of dataset metrics in ‘column view’ display mode
  • Fixed column width resizing in dataset explore tab

Recipes

  • Fixed the translation of the ‘log’ DSS formula when run on SQL databases
  • Fixed the dkuReadDataset R function that could, in case of error, hide the real error message
  • Fixed support for S3 to Redshift fast-path with S3 connections having restrictions on writable paths

Statistics

  • Fixed statistics computation on Kubernetes
  • Fixed UI issues with statistics on migrated DSS instances

Kubernetes

  • Better validation of cluster name when creating a Kubernetes cluster from plugin

Machine learning

  • Added computation of the aggregated score on partitioned models when a custom score is used
  • Added computation of the aggregated score on multiclass partitioned models when the ‘Log loss’ metric is used
  • Fixed usage of the native Python processor when defined in the script section of an analysis
  • Fixed display of the starting time when training partitioned models

Flow

  • Improved display of unbuilt datasets when using flow filters
  • Improved display of partitioned models when using flow views
  • Improved display of plugin names in the right panel
  • Fixed preview of folder content in the right panel

Misc

  • Fixed DSS objects link creation in DSS objects descriptions on Firefox
  • Various fixes around multi selection of list items
  • Fixed issue when moving project to folder by drag and drop
  • Fixed the ‘send report’ scenario step when targeting a dataset
  • Fixed abort of SQL notebook query when using the ‘regular statement’ option

Plugins

  • Fixed language selection when creating a plugin component
  • Make chart filters available for custom charts

Version 7.0.0 - March, 2nd, 2020

DSS 7.0.0 is a major upgrade to DSS with major new features.

New features

Interactive statistics

Dataiku DSS now features a dedicated interface for performing exploratory data analysis (EDA) on datasets. EDA is useful for analyzing datasets and summarizing their main characteristics. Common tasks in EDA include visual data exploration, statistical testing, detecting correlations, and dimensionality reduction.

Some of the features of interactive statistics in Dataiku DSS are:

  • Univariate analysis (descriptive statistics, histograms, boxplots, quantile tables, frequency tables, cross-filter, …)
  • Bivariate analysis (scatter plots, correlation analysis, bivariate frequency tables, …)
  • Statistical tests (mean tests, distribution tests, two-sample tests, Anova, Chi-Square, …)
  • Distribution fitting (normal, beta, exponential, mixtures, …)
  • Kernel Density Estimations
  • Curves fitting
  • Multi-variables correlation matrix
  • Principal component analysis
  • Arbitrary grouping and filtering

For more details, please see Interactive statistics

Row-level interpretability

Dataiku DSS now includes row-level interpretability for Machine Learning models. This allows you to get a detailed explanation of why a Dataiku model made a given prediction, even when said model is a “black-box” model.

Dataiku DSS features two computation methods for row-level intepretability:

  • ICE (individual conditional explanations)
  • Shapley values

In the model results screen, you can directly view explanations for the “most extreme” predictions on the test set. You can also compute explanations on a complete dataset in the scoring recipe.

For more details, please see Individual prediction explanations

Git integration of projects: pulling and branching

The per-project Git integration now features several key additional features:

  • Pulling changes from a remote repository
  • Creating branches and switching branches
  • Creating new branches as new projects to work on multiple branches simultaneously

For more details, please see Version control of projects

Fetch path and partition information in prepare recipe

The prepare recipe now includes a new processor “Enrich with context information” that can be used to add, for each row, information about the source file and source partition.

This processor is especially useful when using partitioned-by-files datasets where the file path may contain important semantic information, that was previously not retrievable.

This processor only works in the “DSS” engine for prepare (i.e. it cannot be used with Spark).

For more details, please see Enrich with record context

Project creation macros

Many administrators wish to have more control on how projects are created. Examples of use cases include forcing a default code env, container runtime config, automatically creating a new code env, setting up authorizations, setting up UIF settings, creating a Hive database, …

This led many administrators to deny project creation to users, leading to higher administrative burden for administrators.

With project creation macros, administrators can delegate the creation of projects to users, but the project will be created using administrator-controlled code, in order to perform additional actions or setup.

For more details, please see Creating projects through macros

Other notable enhancements

Resize columns

It is now possible to resize columns in the Explore and Prepare views.

Retry in scenarios

It is now possible to confiure each scenario step to retry a given number of times, with a configurable delay between retries.

Signing of SAML requests

Dataiku DSS now supports signing SAML requests, for the cases where the SAML IdP requires it.

OAuth flow and credentials for plugins

Plugins can now leverage a new infrastructure that allows their users to store per-user credentials, and to perform OAuth flows.

This is particularly useful for plugins that need to connect to OAuth-protected data sources. With this new infrastructure, your plugin can allow each user to access his own data after performing the OAuth authentication flow through DSS.

For more details, please see Parameters

Merge folders recipe

A new visual recipe to merge the content of multiple managed folders into one “stacked” managed folder

Reload button on notebooks

The Jupyter notebook UI now features a “Force reload” button that performs the full-unload-and-reload of the notebook that is needed:

  • If the project libraries were modified and need to be reloaded
  • If the DSS backend had restarted and the notebook can’t authenticate anymore
  • If the Hadoop delegation tokens had expired

Scalable webapps on Kubernetes

Webapps can now be deployed on Kubernetes. This allows having multiple backends serving a webapp.

Advanced Kubernetes exposition

Exposing API services and webapps on Kubernetes now support more advanced exposition options and custom YAML for expositions, allowing for more flexibility in advanced Kubernetes deployments.

Other enhancements and fixes

Hadoop, Spark, Kubernetes

  • Fixed “inherit from host” network on AKS
  • Added ability to set Kubernetes version on EKS
  • Fixed potential generation of too long Kubernetes namespaces
  • Automatically set spark.master when using Managed-Spark-on-Kubernetes on a non-managed Kubernetes cluster
  • Added support for Hortonworks HDP 3.1.4
  • Fixed potential infinite loop when building Spark pipelines
  • Automatically cleanup pods generated when using interactive SparkSQL on Kubernetes
  • Added variables expansion in Spark configuration
  • Test of container execution configuration now properly uses the active cluster

Datasets

  • BigQuery: Added support for “append”
  • GCS: Fixed slow read
  • GCS: Added proxy support
  • PostgreSQL: Fixed ability to use custom JDBC URL
  • FTP: Fixed file format detection
  • MySQL: Fixed duplicate column names in SQL notebook table list

Webapps

  • Flask webapp backend can now be multithreaded and multiprocessed. This allows greatly increasing the concurrency when the webapp performs blocking API calls but does not consume CPU (for example, if the webapp is waiting for a scenario to complete running)
  • Fixed History tab
  • Fixed restart of Bokeh webapps in dashboards

Data preparation

  • Fixed possible wrongful detection of “bigint” storage type instead of “string”, even in the presence of 0-leading values
  • Fixed SQL translation for column renamer when doing renames like A->B, B->C

Visual recipes

  • Sync recipe: GCS to BigQuery fast-path: added support for data stored in mono-regional locations
  • Sync recipe: Redshift to S3 fast-path: fixed support for @ in column names

Coding recipes

  • Fixed Hive->Impala and Impala->Hive conversion actions

Machine learning

  • Fixed strict conformance of generated PMML models
  • Fixed impact coding when “impute missing” is set to “drop rows”
  • Fixed ability to run Evaluation recipe with Keras Deep Learning models on Kubernetes
  • Added “revert design to this session” for clustering models
  • Fixed XGBoost early stopping when the best iteration is the first one
  • Fixed support for Tensorboard with Tensorflow >= 1.10

Python API

  • Fixed regression on dataiku.get_custom_variables(typed=True) - type will now be preserved
  • Added dataiku.Project().get_variables and dataiku.Project().set_variables to get/set project variables in a recipe in a way that will be directly reflected
  • Fixed insights.save_plotly, insights.save_bokeh, … in Python 3
  • Added API to obtain credentials for a connection directly in Python code (if authorized)
  • Added API to delete a scenario
  • Added API to delete a file from a managed folder
  • Made it possible to work on developing plugin recipes and clusters outside of DSS

R API

  • Added dkuGetProjectVariables and dkuSetProjectVariables to get/set project variables in a recipe in a way that will be directly reflected
  • Added API to delete a file from a managed folder

API node & API deployer

  • Fixed adding test queries from a dataset on a custom prediction endpoint

Cloud

  • Fixed generation of role-assumed STS tokens with too long login names or from APIs

Performance & Scalability

  • Various performance enhancements, especially for instances with high concurrency of users

Automation

  • Fixed wrongful date displayed in report mail when aborting a scenario
  • Fixed ability to clear old job logs from the UI

Administration

  • Added mass actions on the Users screen

Misc

  • Fixed issues where data would not be reloaded after installing a new plugin
  • Fixed adding insight from content of a managed folder
  • Enabled “drop data” by default when deleting datasets