DSS 8.0 Release notes¶
- Migration notes
- Version 8.0.1 - July, 31th, 2020
- Version 8.0.0 - July, 15th, 2020
- New features
- Dataiku Applications
- Model Document Generation
- Flow Zones
- Advanced hyperparameter searching
- Programmatic usage of Row-level-interpretability
- Support for Pandas 1.0
- Centralization of audit trail
- Centralization of API node query logs
- Compute resource usage reporting
- Plugin uninstall
- Public webapps and impersonation in webapps
- Tag categories
- Other notable enhancements
- Improved Visual ML experience
- New users and authentication management APIs
- Enhanced programmatic flow building APIs
- Enhanced support for container images
- Experimental support for Openshift
- Managed Kubernetes namespaces and quotas
- Pod tolerations, affinity and node selectors
- Import notebooks
- Enhanced API node audit logging
- Disabling users
- Instance-wide default code env
- Instance-wide default containerized execution config
- Improved “Performance” ML heuristics
- Other enhancements and fixes
- New features
- From DSS 7.0: Automatic migration is supported, with the restrictions and warnings described in Limitations and warnings
- From DSS 6.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 6.0 -> 7.0
- From DSS 5.1: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 5.1 -> 6.0, 6.0 -> 7.0
- From DSS 5.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0
- From DSS 4.3: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0
- From DSS 4.2: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0
- From DSS 4.1: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.1 -> 4.2, 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0
- From DSS 4.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.0 -> 4.1, 4.1 -> 4.2, 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0
- Migration from DSS 3.1 and below is not supported. You must first upgrade to 5.0. See DSS 5.0 Release notes
It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.
For automatic upgrade information, see Upgrading a DSS instance.
Pay attention to the warnings described in Limitations and warnings.
Automatic migration from previous versions (see above) is supported, but there are a few points that need manual attention.
- The commands to build base images for container execution and API deployer have changed. All base images are now built using
- The legacy “Hadoop 2” standalone packages for Hadoop and Spark integration have been removed. Please use the universal
Some features that were previously deprecated are now removed or unsupported.
- Support for Spark 1 (1.6) is removed. We strongly advise you to migrate to Spark 2. All Hadoop distributions can use Spark 2.
DSS 8.0 deprecates support for some features and versions. Support for these will be removed in a later release.
- As a reminder from DSS 7.0, support for “Hive CLI” execution modes for Hive is deprecated and will be removed in a future release. We recommend that you switch to HiveServer2. Please note that “Hive CLI” execution modes are already incompatible with User Isolation Framework.
- As a reminder from DSS 7.0, Support for Microsoft HDInsight is now deprecated and will be removed in a future release. We recommend that users plan a migration toward a Kubernetes-based infrastructure.
- As a reminder from DSS 7.0, Support for Machine Learning through Vertica Advanced Analytics is now deprecated and will be removed in a future release. We recommend that you switch to In-memory based machine learning models. In-database scoring of in-memory-trained machine learnings will remain available.
- As a reminder from DSS 7.0, Support for Hive SequenceFile and RCFile formats is deprecated and will be removed in a future release.
- As a reminder from DSS 6.0, support for Pig is deprecated. We strongly advise you to migrate to Spark.
DSS 8.0.1 is a bugfix release. For a summary of major changes in 8.0, see below
- Prepare recipe: Fixed autocomplete of column name when using “multiple columns” step mode
- Prepare recipe: Improved error handling of the “Rename columns” processor when the step has just been created
- Fixed display of “File in folder” dataset when using Flow zones
- Fixed display of “Metrics” dataset when using Flow zones
- Fixed behaviour of the “Create prediction model” inside an analysis.
- Fixed display of the AutoML dialog images on chrome
- Fixed the “View original analysis” button of saved models when the analysis has been deleted
- Prevent silent failure when clicking on the ‘Lab’ button while user does not have the right user profile
- Fixed creation of the DSS Core Designer tutorials
- Fixed remapping of code environments when importing projects
DSS 8.0.0 is a major upgrade to DSS with major new features.
Dataiku Applications allow Dataiku designers to make their projects reusable and consumable by business users. Once a designer has made a project available as an application, business users can create their own instances of the application, set parameters, upload data, run the applications, and directly obtain results.
For more details, please see Dataiku Applications.
In regulated industries, data-scientists have to document ML models, at creation and after every change for traceability. This is often tedious. DSS now features the ability to automatically generate a DOCX document from a machine learning model.
Designers can upload their own DOCX template with placeholders that will be automatically be replaced by information, explanations and charts from the ML model. Model Document Generation has an extensive coverage of the advanced result screens of DSS Visual ML, allowing creation of rich documents.
For more details, please see Model Document Generator.
Data Science projects tend to quickly become complex, with large number of recipes and datasets in the Flow. This can make the Flow complex to read and navigate.
Flow Zones are a completely new way to organize bigger flows into more manageable sub-parts, called zones.
You can now define your zones in the Flow, and assign each dataset,recipe, … to a zone. The zones are automatically laid out in a graph, like super-sized nodes. You can work within a single zone or the whole flow, and collapse zones to create a simplified view of the flow.
For more details, please see Flow zones.
In addition to the already-existing grid searching for hyperparameters, DSS can now perform Random search and Bayesian search for faster and more thorough search for the best set of hyperparameters.
For more details, please see Advanced models optimization.
DSS 7.0 added support for row-level interpretability for Machine Learning models. This allows you to get a detailed explanation of why a Dataiku model made a given prediction, even when said model is a “black-box” model.
In DSS 7.0, Row-level interpretations were available in the UI, and as the output of the scoring recipe.
DSS 8.0 adds the ability to programmatically obtain explanations through the API node, and also through the Saved Model Python API.
For more details, please see Exposing a visual prediction model.
In addition to their “Visual re-use by business users” usage, Dataiku Applications can also be used to reuse an entire flow as if it was a single recipe. This allows designers to quickly design complex flows while making usage of “building blocks” built by other designers, without having to maintain the complexity of the underlying reused flow.
For more details, please see Application-as-recipe
Dataiku now supports Pandas 1.0 (in addition to maintained support for the legacy 0.23 version).
Support for Pandas 1.0 is only available when using a code env. Pandas 1.0 is only compatible with Python >= 3.6.1, so only code envs using Python 3.6.1 (and above) will get the ability to use Pandas 1.0
There are multiple use cases for centralizing audit logs from multiple DSS nodes in a single system.
Some of these use cases include:
- Customers with multiple instances want a centralized audit log in order to grab information like “when did each user last do something”.
- Customers with multiple instances want a centralized audit log in order to have a global view on the usage of their different audit nodes, and compliance with license
- Compute Resource Usage reporting capabilities use the audit trail, and make more sense if fully centralized. You may want to cross that information with HR resources, department assignments, …
- Most MLOps use case require centralized analysis of API node audit logs
DSS now features a complete routing dispatch mechanism for these use cases, with the ability to centralize audit log from multiple machines to a central location, and enhanced capabilities for analyzing audit logs within DSS.
For more details, please see Audit trail.
Building on audit log centralization, you can now also centralize API node query logs. This allows you to setup a feedback loop for your ML Ops strategy, in order to analyze the predictions made by the API node, either to detect input data drift or model performance drift.
For more details, please see Configuration for API nodes.
DSS acts as the central orchestrator of many computation resources, from SQL databases to Kubernetes. Through DSS, users can leverage these elastic computation resources and consume them. It is thus very important to be able to monitor and report on the usage of computation resources, for total governance and cost control of your Elastic AI stack.
DSS now includes a complete stack for reporting and tagging compute resources. For more details, please see Compute resource usage reporting.
It is now possible to uninstall plugins, both from the UI and API. Trying to uninstall a plugin will automatically warn you if the plugin is still in use.
Two new features reinforce the ability to serve webapps to large number of users:
- Webapps can now be shared to users who are not DSS users and do not have a DSS account. This allows you to share webapps widely to the whole company. For more details, please see Public webapps
- Webapp backend code can now perform API calls to the Dataiku API on behalf of the end-user viewing the webapp, with full traceability of the end-user identity. This allows better governance and tracability of actions performed on behalf of users. For more details, please see Webapps and security.
Administrators can now define tag categories. Tag categories allow you to create custom “fields” in the form of tags, and have predefined set of values.
Categorized tags can then be set easily by the end user with validation on the values.
For example, you could create a tag category for the responsible team, one for the department, one for the brand that you’re working on, …
Tag categories can be created and managed by the administrator from Administration > Settings > Tag categories.
The Visual ML user experienced has been enhanced to streamline the creation of models and understanding of the Dataiku Lab:
- Find the Lab associated to each dataset directly from the dataset’s right panel
- Faster creation of ML models, with streamlined workflow. You can now create a ML model in 3 clicks from a dataset
- Ability to create ML models directly from a column in the dataset’s Explore view
- Better explanations in-product for the various cross-validation strategies
The API for users and authentication management have been greatly enhanced with:
- Ability to set user secrets through API, either for end users or admins
- Ability to set per-user-credentials through API, either for end users or admins
- Ability to impersonate end-users using admin credentials
- Ability to manipulate user and admin properties through API
Many APIs have seen vast improvements, especially regarding the ability to entirely build and control Flows via the API:
- Ability to detect dataset settings (See Datasets (other operations))
- Much easier ability to create recipes (See Flow creation and management)
- Ability to traverse the Flow graph (See Flow creation and management)
- Ability to compute and set output schema for recipes (See Recipes)
- Ability to propagate schema across entire flows (See Flow creation and management)
- Ability to manage Flow zones (See Flow creation and management)
And many other, please see Python APIs for a complete index of the Python API.
All three kinds of container images (containerized execution, Spark-on-Kubernetes and API deployer) are now built on a single CentOS 7 base.
This release brings the following enhancement:
- Support for CUDA 10.0 and 10.1 in containers
- Full support for Python-3-only containers
- Far enhanced customization capabilities, including ability to use a proxy
- Ability to use prebuilt images for faster images build
DSS 8.0 adds experimental for Openshift as a Kubernetes runtime
For more details, please see Using Openshift.
DSS can now automatically create Kubernetes namespaces for both containerized execution and Spark-on-Kubernetes. Namespaces can be defined using variable expansion, in order to create namespaces per user/team/project/…
DSS can automatically apply policies to the dynamic namespaces, notably resource quotas (in order to limit the total amount of computation/memory available to a namespace/user/team/project/…) and limit ranges (in order to set default resource control for computations running in the dynamic namespace).
For more details, please see Dynamic namespace management.
You can now add custom Kubernetes tolerations, affinity statements or node selectors in order to control more precisely the placement of your pods on Kubernetes.
For more details, please see Dynamic namespace management.
API node audit logging now includes project key / saved model id / saved model version for prediction endpoints.
In addition, you can ask DSS to dump and/or audit the post-enrich data, when using queries enrichments.
For more details, please see Exposing a visual prediction model.
It is now possible to disable users instead of outright deleting them. Disabled users cannot login, cannot run scenarios, and don’t consume licenses.
Disabling/enabling users can be done through the UI and API.
You can now select a default code env which will be applied by default across all projects.
You can now select a default containerized execution config which will be applied by default across all projects.
- Fixed wrong value in partitioning “Test dependencies” function
- Fixed navigation issue with cross-project datasets leading to loss of flow centering
- Fixed issue when copying a subflow containing HDFS datasets to a new project
- Fixed icons display issues for plugin recipes
- Fixed wrongful attempt to write BigQuery datasets when importing a project
- Project duplication will now only duplicate uploaded datasets by default
- Fixed dynamic select widget for custom exporters
- Python plugin recipes can now accept BigQuery datasets as outputs
- Fixed issue when removing values from a “Remove rows on value” processor
- Extract Date components processor: Extracting minutes,seconds and milliseconds can now run in SQL databases
- Fixed support for Kubernetes > 1.16
- Spark install can now setup better defaults tuned for Kubernetes
- Cost matrix gain was added to the list of metrics displayed in the all metrics screen
- “Max feature proportion” on tree ensemble algorithms is now hyperparameter-searchable
- PMML export now outputs probabilities and can now use the model-specified threshold
- API node: Fixed wrongful scoring of rows that were removed by the preparation script
- Add more parameters to the Isolation Forest algorithm
- Fixed issues with empty columns with unicode column names
- Fixed clustering scoring when outliers detection is enabled and dataset to score is very small
- Code of custom models is now displayed in results
- Fixed issue when DSS is installed with base Python 3.6 environment
- Properly show the Python version in the notebooks list
- Added ability to duplicate wiki articles
- Improved Slack integration with Slack Blocks
- Enhanced API for project folders - see Project folders
- Fixed API for pushing container base images