DSS 7.0 Release notes¶
- Migration notes
- Version 7.0.1 - March, 13th, 2020
- Version 7.0.0 - March, 2nd, 2020
- New features
- Other notable enhancements
- Other enhancements and fixes
- From DSS 6.0: Automatic migration is supported, with the restrictions and warnings described in Limitations and warnings
- From DSS 5.1: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 5.0 -> 5.1, 5.1 -> 6.0
- From DSS 5.0: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 5.0 -> 5.1, 5.1 -> 6.0
- From DSS 4.3: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0
- From DSS 4.2: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0
- From DSS 4.1: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.1 -> 4.2, 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0
- From DSS 4.0: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.0 -> 4.1, 4.1 -> 4.2, 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0
- Migration from DSS 3.1 and below is not supported. You must first upgrade to 5.0. See DSS 5.0 Release notes
It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.
For automatic upgrade information, see Upgrading a DSS instance.
Pay attention to the warnings described in Limitations and warnings.
Automatic migration from previous versions (see above) is supported, but there are a few points that need manual attention.
In DSS 5.1 and 6.0, a regression affected dataiku.get_custom_variables(typed=True). This regression was fixed in DSS 7.0, so variables typing will be restored. This may affect workarounds that you may have setup in order to work around the regression.
DSS 7.0 introduces a new Git integration for projects, with vastly enhanced features like multiple branches and pulling from Git remotes.
In order to introduce this, DSS 7.0 also introduces a unified name for Git remotes. DSS will now only consider the remote named “origin” (the “standard” Git naming). As a result, if you had already added Git remotes with a different name, you may need to re-add it to your projects, following the instructions in Version control of projects.
DSS 7.0 deprecates support for some features and versions. Support for these will be removed in a later release.
- Support for “Hive CLI” execution modes for Hive is deprecated and will be removed in a future release. We recommend that you switch to HiveServer2. Please note that “Hive CLI” execution modes are already incompatible with User Isolation Framework.
- Support for Microsoft HDInsight is now deprecated and will be removed in a future release. We recommend that users plan a migration toward a Kubernetes-based infrastructure.
- Support for Machine Learning through Vertica Advanced Analytics is now deprecated and will be removed in a future release. We recommend that you switch to In-memory based machine learning models. In-database scoring of in-memory-trained machine learnings will remain available.
- Support for Hive SequenceFile and RCFile formats is deprecated and will be removed in a future release.
- As a reminder from 6.0, support for Spark 1 (1.6) is deprecated. We strongly advise you to migrate to Spark 2. All Hadoop distributions can use Spark 2. Support for Spark 1 will be removed in DSS 8
- As a reminder from 6.0, support for Pig is deprecated. We strongly advise you to migrate to Spark.
DSS 7.0.1 is a bugfix release. For a summary of major changes in 7.0, see below
- Fixed ‘Export Table’ option of dataset metrics in ‘column view’ display mode
- Fixed column width resizing in dataset explore tab
- Fixed the translation of the ‘log’ DSS formula when run on SQL databases
- Fixed the dkuReadDataset R function that could, in case of error, hide the real error message
- Fixed support for S3 to Redshift fast-path with S3 connections having restrictions on writable paths
- Fixed statistics computation on Kubernetes
- Fixed UI issues with statistics on migrated DSS instances
- Added computation of the aggregated score on partitioned models when a custom score is used
- Added computation of the aggregated score on multiclass partitioned models when the ‘Log loss’ metric is used
- Fixed usage of the native Python processor when defined in the script section of an analysis
- Fixed display of the starting time when training partitioned models
- Improved display of unbuilt datasets when using flow filters
- Improved display of partitioned models when using flow views
- Improved display of plugin names in the right panel
- Fixed preview of folder content in the right panel
- Fixed DSS objects link creation in DSS objects descriptions on Firefox
- Various fixes around multi selection of list items
- Fixed issue when moving project to folder by drag and drop
- Fixed the ‘send report’ scenario step when targeting a dataset
- Fixed abort of SQL notebook query when using the ‘regular statement’ option
DSS 7.0.0 is a major upgrade to DSS with major new features.
Dataiku DSS now features a dedicated interface for performing exploratory data analysis (EDA) on datasets. EDA is useful for analyzing datasets and summarizing their main characteristics. Common tasks in EDA include visual data exploration, statistical testing, detecting correlations, and dimensionality reduction.
Some of the features of interactive statistics in Dataiku DSS are:
- Univariate analysis (descriptive statistics, histograms, boxplots, quantile tables, frequency tables, cross-filter, …)
- Bivariate analysis (scatter plots, correlation analysis, bivariate frequency tables, …)
- Statistical tests (mean tests, distribution tests, two-sample tests, Anova, Chi-Square, …)
- Distribution fitting (normal, beta, exponential, mixtures, …)
- Kernel Density Estimations
- Curves fitting
- Multi-variables correlation matrix
- Principal component analysis
- Arbitrary grouping and filtering
For more details, please see Interactive statistics
Dataiku DSS now includes row-level interpretability for Machine Learning models. This allows you to get a detailed explanation of why a Dataiku model made a given prediction, even when said model is a “black-box” model.
Dataiku DSS features two computation methods for row-level intepretability:
- ICE (individual conditional explanations)
- Shapley values
In the model results screen, you can directly view explanations for the “most extreme” predictions on the test set. You can also compute explanations on a complete dataset in the scoring recipe.
For more details, please see Individual prediction explanations
The per-project Git integration now features several key additional features:
- Pulling changes from a remote repository
- Creating branches and switching branches
- Creating new branches as new projects to work on multiple branches simultaneously
For more details, please see Version control of projects
The prepare recipe now includes a new processor “Enrich with context information” that can be used to add, for each row, information about the source file and source partition.
This processor is especially useful when using partitioned-by-files datasets where the file path may contain important semantic information, that was previously not retrievable.
This processor only works in the “DSS” engine for prepare (i.e. it cannot be used with Spark).
For more details, please see Enrich with record context
Many administrators wish to have more control on how projects are created. Examples of use cases include forcing a default code env, container runtime config, automatically creating a new code env, setting up authorizations, setting up UIF settings, creating a Hive database, …
This led many administrators to deny project creation to users, leading to higher administrative burden for administrators.
With project creation macros, administrators can delegate the creation of projects to users, but the project will be created using administrator-controlled code, in order to perform additional actions or setup.
For more details, please see Creating projects through macros
It is now possible to confiure each scenario step to retry a given number of times, with a configurable delay between retries.
Dataiku DSS now supports signing SAML requests, for the cases where the SAML IdP requires it.
Plugins can now leverage a new infrastructure that allows their users to store per-user credentials, and to perform OAuth flows.
This is particularly useful for plugins that need to connect to OAuth-protected data sources. With this new infrastructure, your plugin can allow each user to access his own data after performing the OAuth authentication flow through DSS.
For more details, please see Parameters
A new visual recipe to merge the content of multiple managed folders into one “stacked” managed folder
Webapps can now be deployed on Kubernetes. This allows having multiple backends serving a webapp.
- Fixed “inherit from host” network on AKS
- Added ability to set Kubernetes version on EKS
- Fixed potential generation of too long Kubernetes namespaces
- Automatically set spark.master when using Managed-Spark-on-Kubernetes on a non-managed Kubernetes cluster
- Added support for Hortonworks HDP 3.1.4
- Fixed potential infinite loop when building Spark pipelines
- Automatically cleanup pods generated when using interactive SparkSQL on Kubernetes
- Added variables expansion in Spark configuration
- Test of container execution configuration now properly uses the active cluster
- BigQuery: Added support for “append”
- GCS: Fixed slow read
- GCS: Added proxy support
- PostgreSQL: Fixed ability to use custom JDBC URL
- FTP: Fixed file format detection
- MySQL: Fixed duplicate column names in SQL notebook table list
- Flask webapp backend can now be multithreaded and multiprocessed. This allows greatly increasing the concurrency when the webapp performs blocking API calls but does not consume CPU (for example, if the webapp is waiting for a scenario to complete running)
- Fixed History tab
- Fixed restart of Bokeh webapps in dashboards
- Fixed possible wrongful detection of “bigint” storage type instead of “string”, even in the presence of 0-leading values
- Fixed SQL translation for column renamer when doing renames like A->B, B->C
- Sync recipe: GCS to BigQuery fast-path: added support for data stored in mono-regional locations
- Sync recipe: Redshift to S3 fast-path: fixed support for @ in column names
- Fixed strict conformance of generated PMML models
- Fixed impact coding when “impute missing” is set to “drop rows”
- Fixed ability to run Evaluation recipe with Keras Deep Learning models on Kubernetes
- Added “revert design to this session” for clustering models
- Fixed XGBoost early stopping when the best iteration is the first one
- Fixed support for Tensorboard with Tensorflow >= 1.10
- Fixed regression on dataiku.get_custom_variables(typed=True) - type will now be preserved
- Added dataiku.Project().get_variables and dataiku.Project().set_variables to get/set project variables in a recipe in a way that will be directly reflected
- Fixed insights.save_plotly, insights.save_bokeh, … in Python 3
- Added API to obtain credentials for a connection directly in Python code (if authorized)
- Added API to delete a scenario
- Added API to delete a file from a managed folder
- Made it possible to work on developing plugin recipes and clusters outside of DSS
- Added dkuGetProjectVariables and dkuSetProjectVariables to get/set project variables in a recipe in a way that will be directly reflected
- Added API to delete a file from a managed folder
- Various performance enhancements, especially for instances with high concurrency of users
- Fixed wrongful date displayed in report mail when aborting a scenario
- Fixed ability to clear old job logs from the UI