DSS 5.1 Release notes

Migration notes

Migration paths to DSS 5.1

How to upgrade

It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.

For automatic upgrade information, see Upgrading a DSS instance.

Pay attention to the warnings described in Limitations and warnings.

Limitations and warnings

Automatic migration from previous versions (see above) is supported, but there are a few points that need manual attention.

Upgrade of Python packages

The following Python packages have been upgraded in the builtin environment:

  • pandas (0.20 -> 0.23)
  • numpy (1.13 -> 1.15)
  • scikit-learn (0.19 -> 0.20)
  • xgboost (0.72 -> 0.80)

The pandas dependency is also upgraded in code environments.

Importantly, the dataiku Python package is not compatible with pandas 0.20 anymore. You must upgrade to pandas 0.23.

Rebuild of code environments

Due to the upgraded dependency on pandas, it is necessary to update all previous Python code environments.

In most cases, you simply need, for each code environment, to go to the code environment page and click on the “Update” button (since the pandas 0.23 requirement is part of the base packages).

Retraining of machine-learning models

  • XGBoost models trained with prior versions of DSS must be retrained when upgrading to 5.1. This includes models deployed to the flow (re-run the training recipe), models in analysis (retrain them before deploying) and API package models (retrain the flow saved model and build a new package)
  • “Isolation forests” models trained with prior versions of DSS and using “In-memory” engine must be retrained when upgrading to 5.1. This includes models deployed to the flow (re-run the training recipe), models in analysis (retrain them before deploying) and API package models (retrain the flow saved model and build a new package)

Multi-user-security configuration file move

For improved security, the security module configuration file for Multi-User-Security has been moved from DATADIR/security/security-config.ini to /etc/dataiku-security/INSTALL_ID/security-config.ini.

DSS will automatically move the file upon upgrade, so you don’t need to perform any operation. However, any further update must be done on the /etc/dataiku-security/INSTALL_ID/security-config.ini. For more information, and details about INSTALL_ID, see the MUS setup documentation.

Dashboard exports

Like with any upgrade, the Dashboards export feature must be reinstalled after upgrade. For more details, on how to reinstall this feature please see Setting up Dashboards and Flow export to PDF or images.

Deprecation notice

DSS 5.1 deprecates support for some feataures and versions. Support for these will be removed in a later release.

  • The prepare recipe running on the Hadoop Mapreduce engine is deprecated. We strongly advise you to use the Spark engine instead.

Version 5.1.1 - February 13th, 2019

DSS 5.1.1 is a minor release. For a summary of major changes in 5.1, see below

Machine learning

  • Fixed error in Isolation forest when no anomaly was found
  • Fixed support for calibration in K-Fold cross-test mode
  • Fixed training recipes in “train on 100% of the data” mode
  • Fixed possible error when training on containers

Datasets

  • Fixed a display issue in metrics
  • A metrics dataset showing checks will now be named “_checks”
  • Fixed computation of percentile metrics on Spark

Webapps

  • Improved the default code sample for standard webapps
  • Fixed Shiny plugin webapps
  • Fixed permissions when copying a webapp

Hadoop & Spark

  • Added support for CDH 6.1
  • Added experimental support for Spark 2.4
  • Added special option to handle cases where the Hive staging dir is in a non-standard location
  • Fixed wrongful permission logic in multi-cluster that led to rejecting valid actions

Wiki

  • Added automatic generation of a table of contents to Wiki articles
  • Fixed contributor tooltips
  • Improved Git commit messages for Wiki actions
  • Improved notifications for article renamings
  • Added ability to remove attachments in folder view
  • Added automatic scroll in the taxonomy when opening a Wiki page
  • Fixed update of the timeline after a save
  • Fixed links to items when changing the key of a project (through export/import)

API node

  • Fixed issue with enrichments in custom R prediction endpoints

Code

  • Fixed container execution that could fail depending on the number of cores of the machine / number of recipes being run
  • Fixed container execution of recipes when running on a code environment without Jupyter support
  • Fixed container execution with code environments on automation node
  • Fixed warnings when reading datasets in R
  • Fixed notebooks when importing libraries from other projects
  • Improved default package sets for conda, no more requiring external repositories
  • Added missing exported functions in the R package

Collaboration

  • Added default template to the Slack reporter
  • Fixed error appearing after pushing a project to Git remotes
  • Improved highlighting in the new Jobs UI
  • Fixed timer in Jobs UI

Security

  • Enforced connections permissions on “Execute SQL” scenario steps
  • Fixed XSS vulnerabilities
  • Fixed possible file tampering through visual recipes
  • Fixed SQL injection in Metrics datasets
  • Fixed vulnerability in license registration workflow

Misc

  • Improved auto-generated titles on some charts
  • Fixed version number on login page
  • Fixed dead documentation links

Version 5.1.0 - January 29th, 2019

DSS 5.1.0 is a very major upgrade to DSS with major new features.

New features

Git integration for plugins editor

The plugin editor now features full Git integration, allowing you to view the history of a plugin, revert changes, and to push and pull changes from a remote Git repository.

For more details, please see Git integration in the plugin editor.

Import code libraries from Git

In the library editor of each project, you can now import code from external Git repositories. For example, if you have code that has been developed outside of DSS and is available in a Git repository (for example, a library created by another team), you can import this repository (or a part of it) in the project libraries, and use it in any code capability of DSS (recipes, notebooks, webapps, …).

This code can then be updated from the external Git repository, either manually or automatically.

For more details, please see Importing code from Git in project libraries.

More code reuse capabilities

Combined with the ability to import code libraries from Git, new features for code reuse have been added:

  • R code can now use per-project libraries, just like Python code.
  • For both Python and R code, you can now have multiple libraries folders per project
  • For both Python and R code, you can now use the libraries of one project in another project

For more details, please see reusing Python code and reusing R code

Prepare recipe in-database (SQL)

A subset of preparation processors can now be translated to SQL queries. When a prepare recipe contains translatable processors, it can be executed fully in-database, which can provide speed-ups up to hundreds of times.

For more details, please see Execution engines.

Lightning-fast prepare recipe on Spark

DSS now includes a new engine for data preparation on Spark that can provide significant performance boosts.

A subset of preparation processors are compatible with the optimized Spark engine, which will be used automatically whenever possible. When non-compatible processors are present, DSS automatically falls back to the previous engine.

For more details, please see Execution engines.

Containerized execution of notebooks

Notebooks (Python and R) can now be run in Docker and Kubernetes

For more details, please see Containerized notebooks.

GDPR capabilities

A new plugin allows you to enforce a number of GDPR-related rules on projects:

  • Track which datasets and projects contain personal data
  • Enforce rules on how datasets containing personal data can be used (exported, used for machine learning, shared, …)
  • Propagate “personal data” flags when creating new datasets
  • Track purpose and consent for datasets

For more details, please see the plugin page.

Databricks

DSS now features an experimental integration with Databricks to leverage Databricks as a Spark execution engine.

For more details, please see Databricks integration.

Webapps as plugins

Web apps can now be turned into plugins. This allows you to have reusable and instantiable webapps.

Some use cases notably include making custom visualizations for datasets.

For more details, please see Component: Webapps.

Use Dataiku libs and develop code outside of DSS

You can now use the Dataiku Python and R libraries outside of DSS in order to develop code for DSS (recipes, webapps, …) outside of DSS and in your favorite IDE.

For more details, please see using Python API outside of DSS and using R API outside of DSS

Folding the Flow view

You can now hide parts of the Flow in order to improve the readability of very large flows. You can easily hide all parts of a flow upstream/downstream of a single node.

External hosting of runtime databases

DSS maintains a number of databases, called the “runtime databases” that store some additional information, which is mostly “non-primary” information (i.e. which can be rebuilt), like history of jobs, metrics, state of datasets, timelines, discussions, …

By default, the runtime databases are hosted internally by DSS, using an embedded database engine (called H2). You can also move the runtime databases to an external PostgreSQL server. Moving the runtime databases to an external PostgreSQL server improves resilience, scalability and backup capabilities.

For more details, please see The runtime databases.

Exporting the Flow as an image

You can now export the Flow of a project as an image or a PDF.

For more details, please see Exporting the Flow to PDF or images.

Probability calibration

When training a classification model, you can now choose to apply a calibration of the predicted probabilities.

The purpose of calibrating probabilities is to bring the actual frequency of classes occurence as close as possible to the predicted probability of such occurence.

For more details, please see Prediction settings.

Models export as PMML and POJO

You can now export a trained model as a PMML file for scoring with any PMML-compatible scorer.

You can also export trained models as a set of Java classes for extremely efficient scoring in any JVM application.

For more details, please see Exporting models.

Duplicate projects

You can now easily duplicate a DSS project, optionally duplicating the content of some datasets.

RStudio integration

In addition to the ability to use the DSS R API outside of DSS, DSS now features several integration points with RStudio:

  • Ability to develop code for DSS (recipes, …) directly in RStudio
  • RStudio Desktop/Server addins for easy connection to DSS and download/upload of recipes
  • Embedding of the RStudio Server UI in DSS
  • Easy configuration of RStudio Server for connection with DSS

For more details, please see RStudio integration.

Other notable enhancements

Copy-paste preparation steps

You can now copy and paste preparation steps, either within a single preparation recipe or across preparation recipes, or even across DSS instances.

Copy-paste scenario steps

You can now copy and paste scenario steps, either within a single scenario or across scenarios, or even across DSS instances.

Support for CDH 6

DSS now supports CDH 6.0 and 6.1

For more details, please see Cloudera CDH

New capabilities for Snowflake

DSS now supports a fast-path to sync from S3 to Snowflake.

For more details, please see Snowflake.

Setting distribution / primary index on Teradata, Redshift and Greenplum

Additional options are now available for these databases:

Support for impersonation on Teradata

You can now use the “proxyuser” mechanism of Teradata to impersonate end-users for all database access.

For more details, please see Teradata

Support for custom query banding on Teradata

In order to provide for better audit, it can be interesting to add in the Query band of your Teradata queries information about the queries that are being performed.

DSS now lets you easily do that and track which users and jobs, … perform Teradata queries.

For more details, please see Teradata.

More ability to use remote Git repositories

In addition to the ability to use Git for plugin development and to import code libraries from Git, including ability to use remotes, using remotes for project version control will now work in all cases where the regular Git command line works.

More graceful handling of wide SQL tables

When reading external SQL tables, DSS will now fetch the exact size of string fields and propagate them to the table definition, in order to make for smaller downstream datasets.

With some databases like MySQL or Teradata that limit the total size of the row, DSS will now more gracefully warn you of possible incompatibilities instead of preventing some recipes creations.

Per-project libraries for R

Support for per-project libraries has been added for R (just like for Python).

New Jobs UI

The Jobs UI has been redesigned and now includes a greatly enhanced Flow view to help you understand at a glance what a job is doing and how that interacts with other jobs.

Java 11

DSS now supports Java 11.

Other enhancements and fixes

Visual recipes

  • The join recipe now supports < and > operators (in addition to <= and >=)

Datasets

  • A potential memory overrun when listing too many partitions has been fixed
  • GCS: Fixed issue with datasets whose size was a multiple of 4MB
  • “Cell value” metric now works properly even in the presence of other metrics
  • Reduced the number of “getBucketLocation” AWS API calls
  • Added support for XLSM files

Code

  • It is now possible to use datasets in a Python or R recipe, even if they are not declared as inputs or outputs
  • Various bugs on SQL code formatter have been fixed

Data preparation

  • CJK characters can now be used as literals in the Python processor

Dashboards

  • It is now possible to export dashboards on a machine without outgoing Internet connection (after initial setup)

Machine learning

  • Don’t try to use Optimized scoring when a custom text preprocessing is in effect
  • It is now possible to tune the scoring batch size when using the Local (Python) engine for scoring

Misc

  • Clear job logs macro now removes the job folders instead of just emptying them
  • It is now possible to run a DSS UI that does zero connections to the outside world