DSS 6.0 Release notes¶
- Migration notes
- Version 6.0.1 - November 6th, 2019
- Version 6.0.0 - October, 24th, 2019
- New features
- Managed Kubernetes clusters
- Managed Spark on Kubernetes
- Partitioned models
- Time series visualization
- New plugins experience
- Support for AWS Athena and Glue metastore
- SQL pipelines
- Global search toolbar
- Pluggable algorithms
- Pluggable webapps
- Pluggable chart types
- Pluggable custom view for folders and models
- Time series preparation
- Native Python processor in preparation
- Scenario reporting to Microsoft Teams
- Other notable enhancements
- Improved project folders
- Time-aware cross-validation and evaluation
- Enhanced Snowflake integration
- ADLS gen2 support in Azure dataset
- Python 3 support for base env
- New field types for plugins
- Redesigned contextual right panel
- Support for HANA Calculation views
- Managed standalone Hadoop libraries
- More native support for Amazon ECR
- Other enhancements and fixes
- New features
- From DSS 5.1: Automatic migration is supported, with the restrictions and warnings described in Limitations and warnings
- From DSS 5.0: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 5.0 -> 5.1
- From DSS 4.3: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.3 -> 5.0 and 5.0 -> 5.1
- From DSS 4.2: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.2 -> 4.3, 4.3 -> 5.0 and 5.0 -> 5.1
- From DSS 4.1: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.1 -> 4.2, 4.2 -> 4.3, 4.3 -> 5.0 and 5.0 -> 5.1
- From DSS 4.0: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.0 -> 4.1, 4.1 -> 4.2, 4.2 -> 4.3, 4.3 -> 5.0 and 5.0 -> 5.1
- Migration from DSS 3.1 and below is not supported. You must first upgrade to 5.0. See DSS 5.0 Release notes
It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.
For automatic upgrade information, see Upgrading a DSS instance.
Pay attention to the warnings described in Limitations and warnings.
Automatic migration from previous versions (see above) is supported, but there are a few points that need manual attention.
Like with any upgrade, the graphics export feature (exporting Flow or dashboards) must be reinstalled after upgrade. For more details, on how to reinstall this feature please see Setting up Dashboards and Flow export to PDF or images
DSS 6.0 introduces a minor upgrade to scikit-learn which fixes a bug in the model selection feature. In some rare cases, this can cause grid searching to select a different hyperparameter value when retraining a model on the same data. For more details, please see https://scikit-learn.org/stable/whats_new/v0.20.html#sklearn-model-selection
As previously announced, DSS 6.0 removes support for the prepare recipe running on the Hadoop Mapreduce engine. We strongly advise you to use the Spark engine instead.
DSS 6.0 deprecates support for some features and versions. Support for these will be removed in a later release.
- Pig support is deprecated. We strongly advise you to migrate to Spark.
- Support for Spark 1 (1.6) is deprecated. We strongly advise you to migrate to Spark 2. All Hadoop distributions can use Spark 2.
- Conditional outputs on binary classification models are deprecated.
DSS 6.0.1 is a minor release. For a summary of major changes in 6.0, see below
- Add ability to rename columns when using SQL pipelines
- Fixed S3 to Redshift fast path on S3 partitioned datasets
- Improved support of customized metastore table name of non HDFS datasets when using Spark engine
- Make the dkuManagedFolderCopyToLocal R function recursive
- Fixed dkuManagedFolderCopyFromLocal R function which ignored beginning of each copied file
DSS 6.0.0 is a major upgrade to DSS with major new features.
DSS can automatically start, stop and manage for you multiple clusters running on the major cloud providers. This makes it very seamless to deploy Kubernetes clusters with very low setup and administration work.
DSS provides managed Kubernetes capabilities on:
- Amazon Web Services through EKS
- Azure through AKS
- Google Cloud Platform through GKE
DSS can now automatically manage deployment of Spark jobs on Kubernetes. This includes automatically setting up connectivity to cloud storages, building container images, handling multiple code environments, providing security and isolation.
Thanks to this feature, you can now deploy Spark jobs on a unified Kubernetes infrastructure, handling both Spark and non-Spark jobs. Multiple Kubernetes clusters are supported.
DSS can now build partitioned models, that is, train a separate model for each partition of an input dataset. Training separate models (also sometimes referred to as “stratified models”) is useful when you expect data to be significantly different between partitions, or when you need incrementality. For example, you may want to train one model per country, per business unit, per factory, …
Once trained, partitioned models can be used to score other partitioned data, or unpartitioned data containing partition identifiers. For more information, see Partitioned Models.
DSS now includes a dynamically zoomable line chart for time series.
For more details, please see Time Series
The plugins store has a brand new look, allowing you to find plugins much more easily.
We have also strongly improved the plugin installation experience, with guided steps to install plugins, code envs and other dependencies.
The plugin development experience has been overhauled for better productivity.
Plugins now feature a predefined parameters system, which allows you to reuse parameters between plugins, and to have sensitive information for plugins managed by the administrator.
For more details, please see Plugins
DSS now supports experimental connection with AWS Athena. This connection provides the following capabilities:
- Running interactive SQL notebooks on Athena based on previously-built S3 datasets
- Using Athena as charts engine for S3 datasets
- Running SQL queries on Athena based on previously-built S3 datasets (execution and data read through Athena, write through DSS)
DSS also adds support for leveraging AWS Glue as a metastore catalog.
DSS provides pipeline functionality for a flow that uses a SQL engine and consists of consecutive recipes sharing the same connection. SQL pipelines can minimize or avoid unnecessary writes and reads of intermediate datasets in a flow, thereby boosting workflow performance.
For more details, please see SQL pipelines in DSS.
A new unified contextual search toolbar has been added to the DSS navigation bar. Use it for contextual search in project objects, wikis, help topics, and much more
You can now add custom algorithms for the in-memory Visual ML component as plugins, making them available without any code.
For more details, please see Component: Prediction algorithm.
Webapps can now be packaged as plugins, shared and reused.
For more details, please see Component: Web Apps.
Managed folders and models now support a concept of pluggable custom views. Use cases can include:
- A custom view representing the content of a folder (for example, a neural network visualizer)
- A custom view on the results of a saved model (for example, to display interpretability results)
DSS provides a preparation plugin that includes visual recipes for performing the following operations on time series data:
- Resampling into equispaced time intervals
- Performing analytics functions over a moving window
- Extracting aggregations around a global extremum
- Extracting intervals where values lie within an acceptable range
This plugin is fully supported by Dataiku. For more details, please see Times series preparation.
The Python processor in data preparation can now use a real Python process, which allows usage of Python 3 and of any additional package through the usage of the DSS code environments feature.
The Python processor now supports vectorized operation using Pandas for fast operation.
For more details, please see Python function
The project folders feature has been strongly enhanced with the following capabilities:
- Drag & drop to add folders in projects on the “projects list” page
- Ability to view project folders on the personalized home page
- Security on project folders
- Ability to have empty project folders
- Per-folder view of the graph of projects
When running prediction tasks on time-oriented datasets (for example, a daily sales dataset), it is useful to use time-aware cross-validation for optimizing and evaluating your model. This allows you to ensure that by only looking at past data, your model is able to adequately predict future data.
For more details, please see Advanced models optimization.
Thanks to the new native Spark integration, you can now directly access Snowflake datasets in any Spark-powered recipe (either visual or code). This leverages the native Spark Snowflake connector for optimal performance.
In addition, the Sync recipe can now perform fast synchronization between Azure Blob and Snowflake datasets.
For more details, please see Snowflake
It is now possible to use Python 3 as the builtin environment of DSS.
Note that we do not currently recommend migrating existing instances to this mode due to the need to ensure that all user code using the builtin environment is compatible with Python 3.
Plugin components can now declare string lists, dates, and many other new kinds of fields.
For more details, please see Parameters
The right column panel available in Flow, objects lists and object actions has been redesigned to provide faster and more efficient access to the most common actions and information.
The HANA support can now list and read calculation views. The connection explorer can automatically filter by HANA package.
Dataiku now provides fully-managed standalone Hadoop and Spark libraries, allowing full support for Parquet, ORC, S3, ADLS gen1 and gen2, GS, … without any cluster or 3rd party integration required
- Added ability to access shared datasets in Pyspark notebooks
- Added ability to select Hive runtime configuration for exploration and direct read through DSS
- Added support for ElasticSearch 7
- Added ability to support ElasticSearch mapping type _doc
- Added ability to rename columns when importing an Excel file
- Fixed Snowflake synchronization failure with special characters
- Fixed Excel export when running on Java 11
- Fixed reading of booleans in Excel files
- Fixed “click to configure” button on “Analyze on full data”
- Split recipe: fixed “drop data” in random dispatch mode on Spark engine
- Sort recipe: fixed on MS SQL Server
- Sync recipe: improved S3 to Redshift fast-path on partitioned datasets
- Automatically install by default Jupyter kernels for containerized execution when updating a code env
- Fixed UI of prediction and clustering recipes when running on HDFS datasets
- Better variables ordering for Partial Dependencies Plot
- Added subsampling on Partial Dependencies Plot and Subpopulation Analysis for faster results
- Improved performance of Deep Learning training
- Added support for Partial dependencies and subpopulation analysis on containers
- Fixed possible non-stability across trainings when using Python 3
- Added error percentage as a metric that can be output as part of the evaluation recipe
- Added support for variables expansion in SQL triggers
- Added ability to execute or not, and to create new exports or not when attaching Jupyter notebooks to mails
- Fixed sending of Slack notifications on job builds
- Added back “description icon” on Flow
- Improved Oracle insertion performance in presence of NULL values
- Fixed potential issues while reading enormous log files