DSS 12 Release notes

Migration notes

How to upgrade

Pay attention to the warnings described in Limitations and warnings.

Migration paths to DSS 12

Limitations and warnings

Automatic migration from previous versions is supported (see above). Please pay attention to the following cautions, removal and deprecation notices.

Cautions

  • The SQL engine can now be automatically selected on prepare recipes. In case of issues on prepare recipes that were working prior to the upgrade, you can revert to the DSS engine by clicking on the “Engine: In database” button in the prepare recipe settings.

  • Similarly, the Spark engine can now be automatically selected more eagerly when the storage and formats are compatible with fast Spark execution. In case of issues on recipes that were working prior to the upgrade, you can revert to the DSS engine by clicking on the “Engine: Spark” button in the recipe settings.

  • The Bokeh package has been removed from the builtin Python environment. If you have Bokeh webapps, please make sure to use a code environment. The Bokeh package in the builtin Python environment was using a very old version of Bokeh

  • The Seaborn package has been removed from the builtin Python environment. If you use this package, please make sure to use a code environment.

  • For Cloud Stacks setups, the OS for the DSS nodes has been updated from CentOS 7 to AlmaLinux 8 (which is a RedHat-compatible distribution similar to CentOS). Custom setup actions may require some updates.

  • For Cloud Stacks setups, R has been upgraded from R 3 to R 4. You will need to rebuild all R code envs. Some updates to packages may be required

  • For Cloud Stacks, the builtin Python environment has been upgraded from Python 3.6 to Python 3.9

  • The version of some packages in the builtin Python environment have been upgraded and your code may require some updates if you are not using your own code environment. The most notable updates are:

    • Pandas 0.23 to 1.3

    • Numpy 1.15 to 1.21

    • Scikit-learn 0.20 to 1.0

    • Matplotlib 2.2 to 3.6

  • The python packages used by Visual Machine Learning have changed, in the built-in code environment and in suggested packages. Notably, if you have KNN or SVM models trained using the built-in code environment, you will need to retrain these models to be able to uswe them for scoring.

Support removal

Some features that were previously announced as deprecated are now removed or unsupported.

  • Support for H2O Sparkling Water as a backend for Visual Machine Learning has been removed

Deprecation notices

DSS 12 deprecates support for some features and versions. Support for these will be removed in a later release.

  • Support for Cloudera CDH 6

  • Support for Cloudera HDP 3

  • Support for Amazon EMR 5

  • Support for Java 8

Version 12.2.2 - September 25th, 2023

DSS 12.2.2 is a bugfix release

Machine Learning

  • Fixed the metrics comparison chart for time series forecasting models in the models list

  • Fixed a rare race condition causing training failures with distributed hyperparameter search

Datasets

  • S3: Reduced memory consumption when writing multiple files on S3 in parallel

  • BigQuery: Fixed memory leak

  • Editable dataset: Fixed pressing “enter” in the “edit column” modal not closing the modal

  • Editable dataset: Fixed redo mechanism when a new row had been added

  • Fixed renaming of partitioned datasets causing downstream recipes to fail at runtime

  • Fixed inability to import Excel files containing Boolean cells computed with formulas

Recipes

  • Join: Fixed occasional job failures with DSS engine

  • Join: Fixed wrongly detected duplicate column name when 2 columns only differ by their case

  • Prepare: Fixed “Extract Date components” with SQL engine

  • Prepare: Fixed display issue when rearranging steps order

  • Sync: Fixed schema and catalog not taken into account when executing a Sync recipe from a Databricks dataset to an Azure Blob storage dataset.

  • Shell: Fixed quotes incorrectly added around variables

  • Fixed expansion of variables in partitioning when running a recipe from its edition screen

Deployer

  • API Designer: Fixed inability to run test queries with Python endpoints

  • Improved error message about deployer hooks code

  • Fixed an issue with the selection of core packages for Python 3.8 code environments on deployer and automation nodes

  • Added a “Validate” button in the Deployer Hooks’ code edition screen

Experiment Tracking

  • Added ability to ignore invalid SSL certificates in experiment tracking

  • Fixed several issues with starting runs (when no end time is specified, or when a name is specified but no tags)

Governance

  • Fixed workflow step not being displayed at creation time when there is one mandatory field defined (Advanced Govern only)

  • Fixed the filling of the signoff history on step deletion

Misc

  • ElasticSearch: Fixed invalid projectKey passed in custom headers

  • Charts: Fixed empty legend section displayed in the left pane for charts in Insight view mode

  • Fixed “Assumed time zone” not displaying the correct default value on existing connections

  • Webapps: Fixed ability to use dkuSourceLibR in Shiny webapps

  • Fixed required permissions to import and export projects using the public API (aligning to UI behavior)

Version 12.2.1 - September 12th, 2023

DSS 12.2.1 is bugfix release

Machine Learning

  • Fixed UI issue disabling the creation of AutoML Clustering models

Cloud Stacks

  • Fixed the reprovisioning of DSS instances from Fleet Manager following a change in PostgreSQL repositories

Misc

  • Fixed a memory leak enumerating Azure Storage containers with very large number of files

Version 12.2.0 - September 1st, 2023

DSS 12.2.0 is a significant new release with both new features, performance enhancements and bugfixes.

New features and enhancements

Custom aggregations on charts

UDAF (User Defined Aggregation Functions) allow user create custom aggregation based on a powerful formula language directly from the chart builder.

For example, you can directly create an aggregation of sum(sell_price - cost) to compute an aggregated gross margin, without having to first create that column.

Radar chart

The Radar chart is now available. Radar Charts are a way of comparing multiple quantitative variables . This makes them useful for seeing which variables have similar values or if there are any outliers amongst each variable .

Radar Charts are also useful for seeing which variables are scoring high or low within a dataset, making them suited for displaying performance.

Govern Sign-off enhancements

Improvements of the sign-off feature allowing to:

  • Reset a finished sign-off

  • Reload an updated configuration from the Blueprint Designer

  • Create a sign-off on an active step if its configuration has been created afterwards

  • Setup recurrence to automatically reset an approved sign-off

  • Have multiple feedback reviews per users

  • Edit and delete feedback reviews and approvals

  • Change the sign-off status to go back to a previous state

  • Send an email to the reviewers when the final approval is added and deleted

  • Additionally, a new validation option has been added in the sign-off configuration to prevent the workflow from going past an unapproved sign-off step.

It also comes with UI improvements such as:

  • Expand and collapse long feedback reviews

  • Display the sign-off description below the title

  • Show the feedback and approver groups with details info on which users are configured

Warning: Some changes have been made to the API around the sign-off feature, you need to pay attention to your usages of the Public API and, for Advanced Govern instances, the logical hooks around the sign-off feature. Only for Advanced Govern instances, you may currently use logical hooks that are checking the sign-off status (preventing the workflow from going past an unapproved sign-off step) which will not work anymore in 12.2.0 due to the API changes. They can be replaced by the new validation option in the sign-off configuration to prevent going past an unapproved sign-off step. After enabling it, you will need to reset the corresponding sign-offs and reload their configuration.

PCA recipe

A new PCA recipe was added. The PCA recipe produces projections, eigenvectors and eigenvalues as three separate datasets.

You can create the PCA recipe from a PCA card in a dataset’s Statistics worksheets.

External Models

External Models allow a user to surface within Dataiku a model available as an endpoint on SageMaker, Azure ML or Vertex AI. Those models can be used like others Saved Models and most noticeably be scored and evaluated.

This feature is currently Experimental.

For more details, please see External Models

Deployer Hooks

Deployer hooks allow administrators of a Project or API Deployment Infrastructure to define pre- and post-deployment hooks written in Python. For instance, a pre-deployment hooks could perform some check and prevent a deployment if it fails ; a post-deployment hook could send a notification.

Other enhancements and fixes

Flow

  • The “Records count” view now displays the exact records count under each dataset in the Flow

  • Added ability to export flow documentation when having read-only acces to the project

  • Added ability to chose the name of the new zone when you copy a Flow zone

  • Added ability to copy a zone directly from the right panel

  • Fixed copying the default zone into a new zone duplicating flow objects into the original zone instead of the new zone

  • Fixed copying a zone not duplicating datasets without inputs

  • Fixed copying a zone to another project creating 2 zones into the destination project

  • Fixed “Recipe Engines” view not listing some engines such as “Snowflake to S3”.

  • Fixed creation of new datasets when creating a new recipe from “+ Recipe” button with no input selected

Datasets

  • BigQuery: Added ability to specify labels that will be applied to BigQuery jobs

  • Editable: Automatically add additional rows and columns when pasting data larger than the current table

  • Excel files: When selecting sheets by a pattern, matching sheets are now displayed

  • CSV: Fixed possible issue reading some CSV files

  • Snowflake: Fixed fast-path from cloud storage with date-partitioned datasets but non-date partitioning column

  • Snowflake: Fixed “Parse SQL dates as DSS date” setting not taken into account for Snowflake

  • Snowflake: Fixed issue with sync from non-SQL datasets with Spark engine

  • Prevented renaming datasets with the same name as a streaming endpoint

  • Fixed renaming datasets when only changing the case (from “DS1” to “ds1” for example)

Recipes

  • Generate features: Fixed failure when input dataset contains column names longer than what the output database can accept (the limit is 59 characters on PostgreSQL for example).

  • Split: Fixed adding a second input before selecting the output during creation

Data Catalog

  • Added ability to add multiple datasets to a Data Collection (either from the Flow or from a Data Collection)

Machine Learning

  • New feature: Causal Prediction now suports multiple treatments

  • New feature: Model comparisons now allow comparing feature importance between models

  • Fixed failure to compute the feature importance of a model would cause the whole training to fail

  • Fixed failure to compute partial dependencies on features with a single value

  • Fixed missing option to use a Custom model in clustering model design settings

  • Fixed scoring of a model with Overrides using the Spark engine

  • Fixed missing dashboard model insight/tile option to show the Hyperparameter optimization report

  • Fixed incorrect aggregate computation of cost-matrix gain when using kfold cross-validation

  • Fixed possible hang of DSS when computing interactive scoring (What-if)

  • Fixed automatic selection of the code environment that could sometimes suggest an incompatible environment when creating a new modeling task

MLOps

  • When exporting a model to the MLflow format, add its required packages to the requirements.txt

  • In evaluation recipes, added ability to skip rescoring and use the prediction if provided in the evaluation dataset.

  • When computing univariate drift, better deal with missing categories by showing a very high PSI rather than having an infinite/missing value

  • With the public API, added ability to create custom model evaluation with arbitrary metrics.

  • Scoring recipes can now compute explanations for MLflow models

  • A model can now be deployed with the GUI from an Experiment Tracking run without being evaluated

  • Non classification/regression models can now be deployed with the GUI from an Experiment Tracking run

  • Monitoring wizard: only suggest the deployments that are relevant for the current project

Statistics

  • Added support for the FDR Benjamini-Hochberg method for p-values adjustment on the pairwise t-test and pairwise Mood test

Charts

  • New feature: Added ability to copy charts from one dataset to another

  • New feature: Added ability to customize tick marks

  • Scatter: Added ability to configure number of displayed records

  • Scatter: Various zoom and pan improvements

  • Scatter: Zoom and pan can now be persisted

  • Scatter: Fixed issues when there are too many colors

  • Bar charts: Improved color contrast for displayed values

  • Pivot table: Added ability to customize font size and color

  • Pie/Donut: Added option to position “others” group at the end

  • Treemap: Fixed tooltip color indicator

  • Added reset buttons for axis customization options

  • Improved zoom buttons on relevant charts (Treemap/Geometry/Grid/Scatter/Administrative filled/Administrative bubbles/Density Maps)

  • Added digit grouping formatting options

  • Fixed measure formatting update on tooltip

  • Fixed display formula for regression line

  • Increased precision for pivot table and maps tooltips

  • Improved legends display performance with many items

  • Fixed number formatting for reference lines on vertical bar and scatter charts

Workpaces and dashoards

  • Improved view/edit navigation on dashboards

  • Improved behavior of date range filter on dashboard

  • Fixed deletion of dashboard filters

  • Fixed dashboard export on air-gapped DSS instances

  • Added ability for users to override the name of workspace objects

  • Improved display of empty workspaces

  • Persist sort during a session on datasets

Coding

  • New feature: Added the ability to edit Jupyter notebooks in Visual Studio Code or JupyterLab via Code Studios

  • Project libraries: Added History tab to track, compare and revert changes.

  • Code Studios: automatically recover in case of network issues

  • Added ability to use the dataikuscoring library in the Python processor of the prepare recipe

  • Fixed ability to run a Python or R recipe from a SQL query dataset

  • Upgraded the builtin version of Visual Studio Code in Code Studios to 4.13

  • Fixed issues with uploading Jupyter notebooks from Databricks or Jupyter notebooks that do not specify a kernel

  • Code Studios: Fixed issue with Unicode characters in project libraries

  • Code Studios: Fixed ability to us Jupyter support in Visual Studio Code

Labeling

  • Input records with invalid or empty identifier / path / text are now ignored

Collaboration & Onboarding

  • Home page: Fixed clicking on a project folder - after scrolling - opens the wrong folder

  • Project activity > Contributors: Fixed error occurring on projects with a very large number of contributions

  • Help center: Added tutorials with progress tracking in Help > Educational Content > Onboarding

  • Project Version control: Added ability to create a tag from a commit

  • Project Version control: Added ability to push & pull tags when using a remote git repository

  • Project Version control: Fixed error happening during force commit not displayed

API Deployer

  • Pre-build required code environments during image build when deploying on a Kubernetes cluster, to speed up actual deployment

  • Added ability to add a commit and a tag when a bundle is created

  • Added an option to trust all certificates for infrastructures of static API nodes

  • Added support for variables for the specification of “New service id” in the “Create API service version” scenario step

  • Fixed running test queries on multi-endpoint API services

Project Deployer & Bundles

  • Added ability to add a commit and a tag when an API Service package is created

  • Better deal with bundle including a Saved Model with no active version: warn on pre-activation and activation and have a clearer exception when using the Saved model

  • Static insights can now be included in bundles

Scenarios

  • Do not clear retry settings when disabling/enabling a scenario step

  • Added a new mail channel to send emails using Microsoft365 with OAuth.

Govern

  • Added new artifact admin permission that grants all permissions for a specific artifact

  • Added the ability to export an item and its content (workflow state, field values) to CSV or PDF files

  • Governed Project’s Kanban view now also includes projects using custom templates

  • Added the ability to add a project directly from a business initiative page

  • Fixed display issue with very long Blueprint name in the Blueprint Designer

  • Fixed standard deviation display issue on Model version metrics

  • Fixed display issue for field of type number and value 0

  • Improved the performances of some queries

Cloud Stacks

  • Fixed display issue of the “Please wait, your Dataiku DSS instance is getting ready” screen

  • Fixed missing display of some errors in Fleet Manager

  • Added warning when trying to set a too small data volume

  • Moved some temporary folders to the data volume to avoid filling the OS volume

  • Fixed default value for IOPS on EBS

  • Fixed issues making the Save button unavailable

Elastic AI

  • Fixed ability to create a SparkSQL recipe based on a SQL query dataset (it however remains a very bad idea)

  • Simplified interaction with Kubernetes for containerized execution: Kubernetes Jobs are not used anymore. DSS now creates pods directly

  • Added display of DSS user / project / … to Cluster Monitoring screens

  • GKE: Improved error message when gcloud does not have authentication credentials

  • GKE: Improved handling of pod and service IP ranges

  • GKE: Added support for spot VMs

  • Added support for using a proxy for building the API deployer base image with R enabled

Streaming

  • Fixed default code sample for Spark Scala Streaming recipe

  • Fixed default code sample for Python streaming recipe

  • Added ability to perform regular reads of datasets in a Spark Scala Streaming recipe

  • Fixed read of array subfields in Kafka+Avro

  • Fixed issue with using “recursive build downstream” in flow branches containing streaming recipes

Performance and scalability

  • Improved performance for listing jobs

  • Improved IO performance for starting up jobs

  • Improved memory usage

  • Fixed possible hang when creating an editable dataset from a large existing dataset

Security

  • Fixed credentials appearing in the logs when using Cloud-to-database fast paths

  • OpenID login: added ability to configure the “prompt” parameter of OpenID

  • User provisioning: clarified how group profile mappings are applied

  • Azure AD integration: Fixed support for users having more than 20 groups

  • OAuth2 authentication on API node: added configurable timeout for fetching the JWKs

  • Jupyter notebooks trust system is now on a per-user basis

Misc

  • Added settable random seed for pseudo random sampling methods, allowing for reproducible sampling.

  • Fixed display issue with “Use global proxy” setting in connection getting wrongfully reset

  • Analyses: Fixed adding or removing tags from the right panel

  • Improved display of code env usage in code env settings

  • Fixed cases where building a code env could silently fail

  • Fixed possible failure aborting a job

  • Fixed issue with displaying large RMarkdown reports

  • Fixed possible error in Jupyter

  • Fixed possible UNIX user race condition when starting a large number of webapps at once

  • dataiku.api_client() is now available from within exporter and fs-provider plugin components

Version 12.1.3 - August 17th, 2023

DSS 12.1.3 is a security, performance and bugfix release

Machine Learning

  • Fixed UI issue in model assertions

  • Fixed partial dependencies failure with sample weights

  • Fixed computation of partial dependencies when rows are dropped by processing

MLOps

  • Fixed possible failure to display model results for imported MLflow models built from recent scikit-learn versions

  • Fixed display of model results for imported MLflow models for which performance was not evaluated

  • Fixed display of API endpoint URL in API deployer

  • Fixed ability to deploy MLflow models that are not tabular classification nor regression

  • Fixed Python requirements for exported MLflow models

Govern

  • Fixed validation error when custom templates have been deployed and standard ones have been archived

Dashboards

  • Fixed filter on “no value” when downloading dataset data from dashboards

Cloud Stacks

  • Fixed issue with authentication when upgrading Fleet Manager directly from 10 to 12.1

Performance

  • Improved performance for reading records with dates from Snowflake

  • Fixed potential slow query and failure on the “Automation monitoring” page

  • Fixed flooding of logs with bad data in Excel export

Security

Misc

  • Added the ability to embed Dataiku in another website through setting “SameSite=None” for cookies

  • Fixed Databricks sync to Azure/S3 with pass-through credentials when Unity Catalog is disabled

  • Fixed issues with display of list of scenarios in some upgrade situations

  • Fixed minor display issue in Wiki taxonomy tree

  • Fixed display of Flow in jobs page with big flows

Version 12.1.2 - July 31st, 2023

DSS 12.1.2 is a security, performance and bugfix release

Datasets

  • Explore: Fixed filtering of Decimal columns with “text facet” filtering mode

  • Editable dataset: increased display density

  • Editable dataset: fixed bad interaction with the Tab key

  • Editable dataset: improved column edition and autosizing experience

  • Editable dataset: fixed bad interaction with keyboard shortcuts while editing a column

  • Snowflake: Strongly improved performance of verifying table existence and importing tables

  • Presto/Trino: Strongly improved performance of verifying table existence and importing tables

  • Databricks: Fixed wrongful cleanup of temporary tables for auto-fast-write

Recipes

  • Prepare: Fixed a case where the formula parser would wrongfully ignore invalid formula and only execute parts of the formula

  • Prepare: removed a wrongful warning regarding dates with SQL engine

  • Prepare: fixed wrongful data loss when using “if then else” to write into an existing column with SQL engine

  • Prepare: fixed number of steps appearing in the description in the right panel of the recipe

  • Window: Fixed pre-computed columns when “always retrieve all” is selected and Spark engine is used

  • Windows: Fixed display when “always retrieve all” is selected

Machine Learning

  • Removed ability to export train set if datasets export is disabled

  • Fixed wrongful binary classification threshold in evaluation recipe

  • Fixed wrongful fugacity matrix not taking threshold into account in drift evaluation

  • Fixed precision-recall curve with Python 2.7

  • Fixed what-if when a feature is empty and selected to “drop row if empty”

  • Fixed SQL scoring on BigQuery

Labeling

  • Object detection: fixed an issue when a single image has more than 5 labels

Dashboards and worksapces

  • Fixed display of Dataiku applications viewed through a workspace

Webapps

  • Fixed ability to retrieve headers for Bokeh 2

Dataiku Govern

  • Fixed improper status computation on the review step when there are unvalidated signoffs in the following steps

  • Fixed display of SSO settings

Elastic AI

  • Fixed ability to run Spark History Server behind a reverse proxy

Cloud Stacks

  • Fixed issues saving forms in the Fleet Manager UI

  • Pre-create the “cpu/DSS” cgroup to make it easier to control CPU through cgroups

  • Increased too low system limits on some components

Performance and scalability

  • Fixed performance issue when renaming datasets on extremely large instances

  • Fixed possible instance crash when using the “compute ngrams” prepare processor with extremely large number of ngrams

  • Improved performance of the “Automation monitoring” page

Miscellaneous

  • Reemove extra whitespaces in logging remapping rules to avoid hard-to-investigate issues

Version 12.1.1 - July 19th, 2023

DSS 12.1.1 is a security, performance and bugfix release

Statistics

  • Fixed STL decomposition analysis when resampling is disabled

Machine Learning

  • Fixed charts on predicted data when a date filter is set

Performance and Scalability

  • Fixed performance issue when switching from recipe to notebook, when the recipe code contains lot of spaces

  • Fixed issue with notebooks startup when kernel takes too long to start

Security

Version 12.1.0 - June 29th, 2023

DSS 12.1.0 is a significant new release with both new features, performance enhancements and bugfixes.

New features and enhancements

Dataset preview on the Flow

You can now preview the content of datasets directly from the Flow. Simply click on “Preview”.

Databricks Connect

Support for Databricks Connect was added in Python recipes.

It is now possible to push down Pyspark code to Databricks clusters using a Databricks connection.

More charts customization and features

Many new capabilities and customization options were added to charts and dashboards

  • Added the ability to set the position of the legend of charts on dashboard

  • Added the ability to customize font size and colors for values, legend items, reference lines, axis labels and axis values

  • Added “relative date range” filters for charts and dashboards (“last week”, “this year”, …)

  • Added ability to force displayed values to overlap

  • Bar charts: Added reference lines (horizontal lines)

  • Scatter plots: Added reference lines (horizontal lines)

  • Scatter plots: Added regression lines

  • Scatter plots: Added zoom and pan

New join types

The join recipe now supports 2 new types of joins:

  • Left anti join: keep rows of the left dataset without a match from the right

  • Right anti join: keep rows of the right dataset without a match from the left

Text Labeling

In addition to image classes and object bounding boxes, Dataiku managed labeling can now label text spans in text fields.

Visual Time series decomposition

Visual Statistics now includes visual STL time series decomposition (trend and seasonality)

New editable dataset UI

A new UI for the “editable” dataset adds many new capabilities:

  • Easier resizing of columns

  • Auto-sizing of columns

  • Click-and-drag to fill

  • Ability to add several rows and columns at once

  • Ability to reorder & pin columns with drag-and-drop

  • Fixed various issues with undo/redo

  • Added warning when attempting concurrent edition

Excel sheet selection enhancements

Excel files sheet selection was revamped. It is now possible to select sheets manually or via rules based on their names or indexes, or to always select all sheets.

In addition, it is now possible to add a column containing the source sheet name.

Enhanced user management capabilities

  • Added the ability to automatically provision users at login time from their SSO identity

  • Added Azure AD integration to provision and sync users

  • Added the ability to explicitly resync users (either from the UI or from the API) from their LDAP or Azure AD identity

  • Added the ability to browse LDAP and Azure AD directories to provision users from their LDAP or Azure AD identity at will (without them having to login first)

  • Added the ability to define and use custom authentication and provisioning mechanisms

Other enhancements and fixes

Machine learning

  • New feature: Added a Precision-Recall curve to classification model reports, as well as Average-Precision metric approximating the area under this curve

  • Added support of ML Overrides to Model Documentation Generation

  • Added indicators in What-if when a prediction was overridden

  • Now showing preprocessed features in model reports even when K-fold cross test was enabled on this model

  • Added option to export the data underlying Shapley feature importance

  • Sped up training of partitioned models

  • Added a “model training diagnosis” in Lab model trainings, to download information needed for troubleshooting technical issues

  • Fixed reproducibility of Ridge regression models

  • Fixed the computation of the multiclass ROC AUC metric in the rare case of a validation set with only 2 classes

  • Fixed a possible scoring failure of ensemble models on the API node

  • Fixed overridden threshold of binary classification model when scoring with Spark, Snowflake (with Java UDF) or SQL engines

  • Fixed a failure when an evaluation recipe was run on a spark-based model with either only a metrics output or only a scored output dataset

  • Fixed a failure to score time-based partitioned models using the python (original) backend when the partitioning column is a date or timestamp

  • Fixed a scoring failure when using time-based partitioning on year only

  • Fixed inability to delete an analysis containing a Keras / Tensorflow model

  • Fixed a condition where an erroneous user-defined metric would cause the whole training to fail

  • Fixed training failure caused by incorrect stratification of stratified group k-fold with some datasets

  • Fixed a possible hang of a containerized train when the training data is very large

  • Fixed broken Design page for modeling tasks in some rare cases

  • Fixed MLlib clustering with outlier detections

Time series forecasting

  • New feature: Added Model Documentation Generation for forecasting models

  • Added experimental support for forecasting models with more than 20000 series

  • Added option to sample the first N records sorted by a given column

  • Added ML diagnostics to the evaluation & scoring recipes, warning instead of failing when a time series is too short to be evaluated or resampled, or when a new series was not * seen at train time by a statistical model

  • Added an option in multi-series forecasting models to ignore time series that are too short

  • Sped up the loading & display of multi-series forecasting models

  • Set the default thread count of forecasting models hyperparameter search to 1, to ensure full reproducibility

  • Fixed distributed hyperparameter search of time series forecasting models

  • Fixed evaluation recipe schema recomputation always using the saved model’s active version even when overridden in the recipe

  • Fixed failure when the time column contains timezone and using recent version of pandas

  • Fixed a training failure when some modalities of a categorical external feature are present in the test set but not in the train set

  • Fixed a failing train of multi-series models when an identifier column contains special characters in its name

  • Fixed a training failure when using Prophet with the growth parameter set to “logistic”

Computer Vision

  • Added support for log loss metric in Image Classification tasks

  • Added ability to publish a Computer Vision model’s What-If page to a Dashboard

  • Fixed a possible failure when coming back to the What-If screen of Computer Vision models after visiting another page

  • Fixed a possible training failure when Computer Vision models are trained in containers

  • Fixed incorrect learning rate scheduling on Computer Vision model trainings

Charts & Dashboards

  • Fixed dashboard export with filter tile

  • Fixed dashboard, on opening dataset insights appear unfiltered for a short moment

  • Stacked bars chart: Added ability to remove totals when “displaying value”

  • Bars: Fixed Excel export

  • Horizontal bars: Fixed X axis disappearing

  • Line charts: Fixed axis scale update on line charts

  • Pivot table: remove “value” column if only one measure is displayed

  • Scatter plots: Made maximum number of displayed points configurable

  • Maps: Fixed display of legends with “in chart” option

  • Boxplots: Fixed chart display when the minimum is equal to zero

  • Boxplots: Fixed display of min and max as we allow possibility to set manual range

  • Added reference lines in Excel export

  • Fixed excel export for charts with measures

  • Fixed “export insight as image” not displaying legend

  • Fixed tooltip display on each subchart

  • Improve empty state and wording for workspaces

  • Fixed issue with selecting text in chart configuration forms

  • Fixed thumbnail generation when using manual axis

  • Fixed discrepancy in filter behavior between DSS and SQL engines when data contains null values

Notebooks

  • Added “Search notebooks” to easily search within ElasticSearch datasets

Code Studios

  • Streamlit: Allowed changing the config.toml

  • Streamlit: Allowed to specify a code-env for Streamlit block, allowing to choose a custom Streamlit version

  • JupyterLab: Fixed block building failing on AlmaLinux

  • JupyterLab: Added warning when stopping Code Studio and some files have been written in JupyterLab’s root directory

  • JupyterLab: Fixed renaming folders whose names contain whitespaces in JupyterLab

  • Fixed unexpected visual behavior when clicking on a DSS link inside Code Studio while not authorized

  • Fixed wrongful display of old log messages

  • Fixed “popout the current tab” button not working under some circumstances

  • Set ownership of code-envs created with the “add code environment” block to dataiku user

Flow

  • Added “stop at flow zone boundary” option when building multiple datasets at once.

  • Fixed incorrect layout when a metrics dataset or a cycle is present in a flow zone

  • Fixed unbuilt datasets appearing as built after a change in an upstream recipe caused theirs schemas to be updated

  • Fixed zone coloring when doing rectangular selection on the Flow

  • Added support for “metrics” dataset when doing schema propagation

  • Fixed “copy subflow to another project” failing when quick sharing is enabled on the first element

  • Fixed “Drop data” option for “Change connection” action

  • Fixed update of code recipes when renaming a dataset while copying a subflow

Datasets

  • Fixed leftover file when deleting an editable dataset without checking drop data

  • Added support for direct read of JSON files from Spark

  • Fixed dataset explore view not behaving correctly if the last column is named “constructor”

  • Added support for “_source” keyword in Custom Query DSL for ElasticSearch datasets

  • Added support for Azure Blob to Synapse fast path when network restriction is enabled on the Azure Blob storage account

  • Do not propose “Append instead of overwrite” for Parquet datasets, as it’s not supported

  • Improved error reporting for various cases of invalidly-configured datasets

  • Fixed BigQuery auto-write fast path with non-string partitioning columns

  • Added support for S3/Redshift fast path when using STS tokens

Recipes

  • New feature: Generate features: now supports Spark engine

  • New feature: Added recipe summary in right panel for sample/filter, group, join and stack recipes

  • Prepare: Fixed “concat” processor on Synapse

  • Prepare: Fixed preview of Formula editor not showing anything when the formula generates null values for all input values in the sample

  • Prepare: Fixed a possible timeshift with input Snowflake datasets contain columns of type “date”.

  • Prepare: Fixed possible error when moving preparation steps when input dataset is SQL

  • Prepare: Fixed possible incorrect engine selection when input dataset is SQL

  • Prepare: Added SQL engine support for “Concatenate columns” steps on Synapse datasets.

  • Prepare: Fixed wrongful change tracking for changes made on columns that have just been added by a processor

  • Prepare: Fixed wrongful “Save” indicator whereas recipe was already saved

  • Prepare: Disable Spark engine when “Enrich with context information” processor is used

  • Prepare: Fixed saving of output schema with complex types with detailed definition

  • Group and window: Fixed using an aggregation on a column that doesn’t exist in the input of a Group or Window recipe yields an unexpected error.

  • Fuzzy Join: fixed wrongful “metadata” output when using multiple join conditions

  • Window: Added “Retrieve all” checkbox to automatically retrieve all columns in the input dataset. This option is checked by default for all newly created recipes.

  • Sync: Fixed possible timeshift when input Databricks datasets contain columns of type date.

  • Sync: Fixed redispatching partitions with both a discrete and a time-based dimension

  • Sync: Fixed computing of metrics on output dataset with partition redispatch

  • Pivot: Fixed issue with BigQuery geography columns

  • Join: Fixed “match on nearest date” on Synapse

Data collections

  • Improved loading time of the various screens

  • Fixed filters being reset when refreshing data collection page

Labeling

  • New feature: Added ability to specify additional columns to be displayed next to the image or text being annotated

  • New feature: Added ability for reviewers to reject an annotated item and send it back for annotation

  • Fixed inability to delete a Labeling Task’s data when its input dataset is shared from another project

Jobs

  • New feature: Job view now displays Flow with Flow zones

  • Fixed clicking on a Job activity for a dataset that has been deleted

  • Fixed blank flow in Jobs screen on some large flows

  • Fixed Job failure incorrectly reported when building datasets with option “Stop at zone boundary” and a dependency located outside the flow zone is not built.

  • Fixed “there was nothing to do” displayed while job is still computing dependencies

Webapps

  • Webapps do not auto-start anymore at creation

Scenarios

  • Added “stop at flow zone boundary” option.

  • Fixed unexpected error generated when a scenario “Run checks” step references a non-existing dataset.

MLOps

  • Added support for MLflow 2.3

  • Added support for Transformers, LangChain and LLM flavors of MLflow

  • Added support of MLflow model outputs as lists

  • Added a project macro to delete model evaluations.

  • Create metrics and checks datasets in the same flow zone as the object they relate to.

  • Added the ability to define a seed in the evaluation recipes when using random sampling

  • In the standalone evaluation recipe, ease the setup of classes when there are many by allowing to copy / paste them.

  • Fixed Python 2.7 encoding issues in the evaluation recipe when dealing with non-ASCII characters

  • Fixed support of MLflow models returning non-numeric results

  • Ease the setup of the standalone evaluation recipe for pure data drift monitoring (prediction column is now optional)

  • Fixed incorrect handling of forced threshold in a proba-aware, perfless standalone evaluation recipe

  • Fixed the computation of the confusion matrix with Python 3.7

  • Avoid creating a Saved Model when errors occur during the deployment of a model from an experiment tracking run.

  • Fixed the creation of API service endpoint from a MLflow imported model with prediction type “Other”

  • Fixed the import of a new Saved Model Version into an existing Saved model from a model from an experiment tracking run with prediction type “Other”.

  • Fixed an issue preventing the import of new MLflow model versions into an existing Saved Model from a plugin recipe.

  • Fixed import of projects exported with experiment tracking

Deployer

  • New feature: Added the ability to publish a bundle to the deployer without being project admin

  • Added historization and display of deployments logs in project and API deployers

  • Added autocompletion on connection remappings in deployments and deployer infrastructure settings

  • Added infrastructure status for the API node in API deployer

  • Prevent the creation of two bundles with the same name

  • Fixed the setup of permissions of deployer related folders when installing impersonation

  • Enhanced the ability of deployments to customize parts of the exposition settings of the infrastructure

Dataiku Govern

  • New feature: Improved the graphical structure of artifact pages and the way fields are displayed within it

  • New feature: Added the custom metrics in the Model Registry

  • New feature: Added the ability to filter on multiple business initiatives

  • New feature: Added the possibility to set a reference from a back reference field

  • Explicitly labeled default governance templates as “Dataiku Standard”

  • Improved the creation of items inside tables (do not propose already selected items, redirect back to the table after item creation).

  • More explicit message for object deletion

  • Simplified breadcrumb on object pages, it’s now only based on object hierarchy and not on navigation history anymore.

  • Fixed an issue with the selection of a Business Initiative at govern time when the govern template doesn’t have a Business Initiative

  • In all custom pages, by default, prevented the display of archived objects and added a checkbox to display them

  • Forbid the usage of an archive blueprint version when governing an object or creating a new one (Note: “auto” governance doesn’t take archived blueprint versions into account anymore either)

  • More explicit button labels for blueprint version activation and archiving

  • Fixed a refresh issue on the object breadcrumb when updating the object’s parent

  • Fixed an issue on deployment update when the govern API key is missing from the deployer’s settings

  • Fixed the application of the node size selected during installation

  • Fixed filters not being taken into account when mass selecting users in the administration menu

  • Various small UI enhancements

Elastic AI

  • Clusters monitoring: added CPU and memory usage information on nodes

  • Clusters monitoring: improved sorting

  • AKS: Added support for selecting subscription when using managed identity

  • AKS: Added support for deleting nodegroups

  • EKS: Fixed failures with some specific kubectl binaries

  • EKS: Wait for nodegroup to be deleted before giving back control, when resizing it to 0

  • EKS: Fixed “test network” macro

  • Fixed invalid labels that could be generated with some exotic project keys

Cloud Stacks

  • Added ability to resize root disk on Azure

  • Fixed handling of “sshv2” format for SSH keys

  • Added ability to enable assignment of public IP in subnets created with the network template

  • Added ability to retrieve Fleet Manager SSL certificate from Cloud’s secret manager.

Performance & Stability

  • Major performance enhancements on handling of datasets with double or date columns, especially when using CSV. Performance for reading datasets in Python recipes and notebooks can be increased by up to 50%

  • Added safety limits to CSV parsing, to avoid cases where broken or misconfigured CSV escaping can cause a job to fail or hang

  • Added safety limit on the number of garbage collection threads to DSS job processes and Spark processes, to limit the risk of runaway garbage collection overconsuming CPU

  • Added safety limit on filesystem and cloud storage enumerations to avoid crashes when enumerating folders containing dozens of millions of files

  • Fixed possible crash when computing extreme number of metrics (such as when performing analysis on all columns on all data with thousands of columns)

  • Performance enhancement when custom policy hooks (such as GDPR or Connections/Projects restrictions) are in use

  • Fixed possible instance hanging when a lot of job activities are running concurrently

  • Fixed possible instance slowdown when a custom filesystem provider / plugin uses partitioning variables

  • The startup phase of a new Jupyter notebook kernel will not cause pauses for other notebooks running at the same time anymore

Code envs

  • Made dsscli command to rebuild code envs more robust on automation node

  • Fixed ability to use manually uploaded code env resources without a script

Plugins

  • Fixed “run as local process” flag on plugin webapps

  • Fixed code environment of some plugins failing to install when using conda

Misc

  • New feature: DSS administrators can now display messages to DSS end users in their browser to alert them of some imminent event.

  • Fixed a bug where some deleted project library files would remain loaded after reloading a notebook

  • Fixed RFC822 date parsing with non-US locale

  • Fixed link to managed folders located in a different project from the global search page

  • Renamed “Drop pipeline views” macro to “Drop DSS internal views” macro as it can also be used to drop views created by the Generate features recipe.

  • Added back the ability for users to choose - in their profile page - whether they receive notifications when other run jobs/scenarios/ML tasks.

  • Projects API: New projects are now created with the new permission scheme introduced in DSS 10

  • Fixed deletion of foreign datasets in a project incorrectly warning that recipes associated with the original dataset in the source project would be deleted.

  • Fixed sort of dataset by size/records in datasets list view

  • Fixed listing Jupyter notebooks from git when some .ipynb files are invalid

  • Fixed dataset metrics/checks computed using Python probes considered as valid even in case an exception is raised from the code

  • Improved search for wiki articles with words in camel case (Searching for “MachineLe” would not return articles containing “machine learning”)

  • Formula: Some invalid expressions are no longer accepted and now can yield errors. Some of these invalid expressions were previously incorrectly considered as valid and accepted. An example of such an * expression is “Age * 10 (-#invalid”. It is invalid yet was previously accepted and evaluated as “Age * 10”.

  • Streaming: Fixed various issues with containerized continuous Python recipe

  • Fixed deletion of secrets from connection settings

  • Fixed wrongful caching of Git repositories with experimental caching modes

Version 12.0.1 - June 23rd, 2023

DSS 12.0.1 is a security, performance and bugfix release

Datasets

  • Fixed format preview when creating dataset from folder with XML files

  • Fixed error when reading a Snowflake dataset with a DATE column containing nulls

Streaming

  • Fixed continuous Python recipe in function mode when dataframe is empty

Machine Learning

  • Fixed scoring recipe when the treatment column is missing in the input dataset

  • Cloudstack: Fixed usage of Snowflake UDF in scoring recipe

Spark

  • Fixed support of INT type with parquet files in Spark 3

Notebooks

  • Fixed notebooks export when DSS Python base env is Python 3.7 or Python 3.9

Performance

  • Fixed run comparison charts of experiment tracking when there are > 100k steps (11.4.4)

API

  • Allowed read-only user to retrieve through the REST API, the metadata of a project they have access (11.4.4)

Security

Version 12.0.0 - May 26th, 2023

DSS 12.0.0 is a major upgrade to DSS with major new features.

Major new features

Machine Learning overrides

ML models today can achieve very high levels of performance and reliability but unfortunately this is not the general case, and often, they cannot be fully trusted for critical processes. There are many known reasons for this, including overfitting, incomplete training data, outdated models, differences between testing environment and real world…

Model overrides allow you to add an extra layer of human control over the models’ predictions, to ensure that they:

  • don’t predict outlandish values on critical systems,

  • comply with regulations,

  • enforce ethical boundaries.

By defining Overrides, you ensure that the model behaves in an expected manner under specific conditions.

Please see Prediction Overrides for more details.

Universal Feature Importance

While some models are interpretable by design, many advanced algorithms appear as black boxes to decision-makers or even data scientists themselves. The new model-agnostic global feature importance capabilities helps you:

  • explain models that could not be explained until now

  • explain models in an agnostic, comparable way (rather than only using algorithm specific methods)

  • aggregate importance across categories of a single column

  • assess relative direction (in addition to magnitude of importance)

This new feature extends and enhances the existing feature importance and individual explanation capabilities. It is fully based on Shapley values and enriched with state-of-the-art visualisation

This capability is even available for MLflow models imported into DSS.

Causal Prediction

The most common Data Science projects in Machine Learning involve predicting outcomes. However, in many cases, the focus shifts towards optimizing outcomes based on actionable variables rather than just predicting them. For example, you may desire to improve business results by identifying customers who will respond best to certain actions, rather than simply predicting which customers will churn.

Traditional prediction models are built with the assumption that their predictions will remain valid when actionable variables are manipulated. However, this assumption is often false, as there can be various reasons why acting on an actionable variable doesn’t have the expected outcome. For example, acting on one variable may have unforeseen consequences on other variables, or the distribution of the actionable variable may be unevenly distributed in the population, making it difficult to compare individuals with different values of the variable.

To address these challenges, the field of Causal Machine Learning (Causal ML) has emerged, incorporating econometric techniques into the Data Science toolbox. In Causal ML, a Data Scientist selects a treatment variable (such as a discount or an ad) and a control value to tag rows where the treatment was not received. Causal ML then performs additional steps to identify individuals who are likely to benefit the most from the treatment. This information can then be used for treatment allocation optimization, such as determining which customers are expected to respond most positively to a discount.

The Causal Prediction analysis available in the Lab provides a ready-to-use solution for training Causal models and using them to predict the effects of actionable variables, optimize interventions, and improve business outcomes.

Please see Causal Prediction for more details.

Auto feature generation

The new “Generate Features” recipe makes it easy to enrich a dataset with new columns in order to improve the results of machine learning and analytics projects. You can define relationships between datasets in your project.

DSS will then automatically join, transform, and aggregate this data, ultimately creating new features.

Please see Generate features for more details.

Data Collections and Data Catalog

Data collections allow you to gather key datasets by team or use case, so that users can easily find and share quality datasets to use in their projects.

Data Collections, Data Sources search and Connections explorer now live together as the new Data Catalog in DSS.

Run subsequent recipes and on-the-fly schema propagation

For all intermediate recipes in a flow, when you click “run” from within the recipe, you now have an option to either:

  • Run just that recipe

  • Or run that recipe and all subsequent ones in the Flow, with the effect of making the whole “downstream” branch of the Flow up-to-date.

“Run this recipe and all subsequent ones” also applies schema changes on the fly to the output datasets, until the end of the Flow

It is now also possible, from the Flow, to build “downstream” (from left to right) all datasets that are after a given starting point. This also includes the ability to perform on-the-fly schema propagation

Help Center

Dataiku now includes a brand new integrated Help Center that provides comprehensive support, including a searchable database, onboarding materials, and step-by-step tutorials. It offers contextually relevant information based on the page you’re viewing, aiding in feature discovery and keeping you updated with the latest additions.

This Help Center serves as a one-stop solution for all user needs, ensuring a seamless and efficient user experience.

Other notable enhancements and features

Build Flow Zones

It is now possible to build an entire Flow zone. This builds all “final” datasets of this zone, and does not go beyond the boundary of the zone.

Deployer permissions management upgrades

When deploying projects from the Deployer, it is now possible to choose the “Run as” user for scenarios and webapps in the deployed project on the automation node. This change can only be performed by the infrastructure administrator on the Deployer.

In addition, the infrastructure administrator on the Deployer can also configure:

  • Under which identity projects are deployed to the automation node

  • Whether to propagate the permissions from the project in the design node to the automation node

Engine selection enhancements

Various enhancements were made to engine selection, so that users need to care much less often about which engines to select. In the vast majority of cases, we recommend that auto selection of engine is left to DSS, without manually selecting engines, or without setting prefered or forbidden engines.

The most notable changes are:

  • Automatically select SQL engine for prepare recipes when possible and efficient (i.e. when both input and output are the same database)

  • Do not automatically select Spark engine when it will for sure be inefficient (when the input or output cannot use fast Spark access)

Prophet algorithm for Time Series Forecasting

Visual Time Series Forecasting now includes the popular Prophet algorithm.

API service monitoring wizard

A new wizard makes it much easier to setup a full API service monitoring loop that gathers the query logs from the API nodes in order to automate drift computation.

Govern: Management of deployments

Added the synchronization of deployments and infrastructure information from the deployer node into the govern node, providing more information in the Model and Bundle registries about how and where those objects are used.

Govern: Kanban View

A new Kanban view allows you to easily get a view of all your governed projects

Charts: Reference lines

It is now possible to define horizontal horizontal lines on Line charts and Mixed charts

Request plugin installation

Users who are not admin can now request installation of a plugin from the plugin store. The request is then sent to administrators, and the user is notified when the request is processed.

Request code env setup

Users who do not have the permission to create code envs can now request the setup of a code env from the code envs list. The request is then sent to administrators, and the user is notified when the request is processed.

Model Document Generation for imported MLflow models

The automatic Model Document Generator now supports MLflow imported models.

Other enhancements and fixes

Datasets

  • Added settings to enable the Image View for a dataset as the default view

  • Added time part in addition to the date in Last modification column in folders content listing

  • Fixed “copy row as JSON” on filtered datasets

  • Explore: Fixed issue when using relative range and alphanum values filters together

  • Fixed “Edit” tab incorrectly displayed on shared editable datasets

  • S3: increased the default max size for S3 created files to 100 GB

  • Snowflake: Added support for custom JDBC properties when using the Spark-Snowflake connector

  • Snowflake: Fixed timezone issues on fields of type DATE when parsed as a DSS date

  • Snowflake: Added support for privatekey in advanced JDBC properties when using Snowpark

  • BigQuery: Fixed internal error happening if user has access to 0 GCP projects

  • BigQuery: Fixed syncing of RECORD and JSON columns containing NULL values

  • BigQuery: Fixed missing error message when table listing is denied by BigQuery

  • BigQuery: Fixed date issues on Pivot, Sort and Split recipes

Visual recipes

  • Prepare: Stricter default behavior of column type inference at creation time. The columns types of strongly typed datasets (e.g. SQL, Parquet) are kept. Behavior can be changed in Administration > Settings > Misc.

  • Prepare: Improved summary section in the right panel to quickly assess what the recipe is doing.

  • Join: Added a new mode to automatically select columns if they do not cause name conflicts

  • Join: Fixed second dataset’s columns selection being reset when opening a recipe with a cross-join

  • Join: Fixed ability to define a Join recipe using as output dataset one of its input datasets

  • Pivot: Fixed empty screen for “Other columns” step displayed when switching tabs

  • Group: Fixed concat distinct option being disabled even for SQL databases that support it

  • Formula language: Fixed now() function in formula generating a result that cannot be compared to other dates using >, >=, < or <= operators.

Flow

  • Fixed running job icons in Flow not always correctly displayed

  • Fixed Flow zoom incorrectly reset when navigating between projects with and without zones

Visual Machine Learning

  • Added support for Python 3.8 and 3.9 to Visual Machine Learning, including Visual Time Series Forecasting and Computer Vision tasks.

  • Added support for Scikit-learn 1.0 and above for Visual Machine Learning. Note that existing models previously trained with scikit-learn below 1.0 and using the following algorithms need to be retrained when switching to scikit-learn 1.0 (which may happen if the DSS builtin env is upgraded to Python 3.7 or Python 3.9): KNN, SVM, Plugin algorithms, Custom Python algorithms

  • Updated the default versions of scikit-learn and scipy in the sets of packages for Visual Machine Learning for code environments

  • Added Sort & Filter to the Predicted Data tab

  • Added the Lift metric to the model results

  • Fixed Distance weighting parameter not taken into account when training KNN models

  • Fixed failure of clustering scoring recipe when the scored dataset lacks some features that were rejected

  • Removed redundant split computation during training

  • Fixed intermittent failures of Model Document Generator on some models

  • Fixed a rare situation where the Cost Matrix Gain metric would not display

Visual Time Series Forecasting

  • Added ML Diagnostics to TS Forecasting

  • Added a result page to show ARIMA orders

  • Added a new Mean Absolute Error (MAE) metric

  • Switched to Mean Absolute Scaled Error (MASE) as the default optimization metric. The previous default (MAPE) may lead to training failure when a series has only 0s as target values.

  • Improved display of various results for multiple-series models

  • Improved support of Month time unit, for periods ending on the last day of a month or spanning more than 12 months

  • More & more prominent warnings when a time series does not have enough (finite & well-defined) data points for forecasting

  • Fixed computation (and warning) of minimum required data points for external features in the scoring recipe

  • Fixed a bug where forecasting models trained in earlier DSS versions had their horizon changed to 0 when retrained

  • Fixed default value of low pass filter for Seasonal Trend when enabled and lower than the season length

Charts & Dashboards

  • New feature: Filters: Added ability to define filters with single selected value

  • New feature: Mix chart: Added line smoothing option

  • Line chart: Fixed tooltips not correctly triggered in subcharts other the first one

  • Line chart: Fixed axis minimum wrongly computed when switching to manual range

  • Scatter plot: Fixed axis and canvas not aligned if browser in zoomed mode

  • Scatter plot: Fixed tooltips not showing up for points where y=x

  • Treemap: Fixed treemap not rendered under certain circumstances on Firefox

  • Boxplot: Fixed sorting order

  • Filters: Fixed switching from date part to date range does not reset the date slider.

  • Filters: Fixed numeric slider displayed instead of checkboxes list when pasting an URL containing values for a numerical filter.

  • Filters: Fixed filter values not correctly displayed when using multiple date parts

  • Dashboards: Moved the fullscreen button outside the content area

  • Dashboards: Fixed “Play” button issuing an error on some dashboards

  • Fixed custom color assignations getting lost when changing the measures in the chart

Labeling

  • Added Undo/Redo when annotating images in a Labeling Task

Notebooks

  • Made Jupyter notebook export timeout configurable

Scenarios

  • New feature: Added the ability to define Cc and Bcc lists in scenario email reporters

  • Fixed timezone issue in the display of monthly triggers

Collaboration

  • Enabled emails toggles in user profile by default for new users

  • Fixed switching branch in a project that would cause the project to become inaccessible in case the git branch was badly initialized

  • Fixed hyperlinks toward DSS objects in wiki exports

  • Dataset sharing: Fixed unable to import a dataset from another project P if quick sharing is disabled on project P

  • Workspaces: Fixed public API disclosing permissions set on workspaces to users and contributors of the workspace.

  • Workspaces: Fixed error message wrongly displayed when a user with Reader profile publishes an object to a workspace

  • Workspaces: Fixed “Go to source dashboard” button incorrectly grayed out under some circumstances

Govern

  • Added the ability to customize the axis of the governed projects matrix view

  • Added the ability to configure a sign-off with only final review (no feedback groups)

  • Fixed the display of multiple governed projects at the same location in the matrix view

  • Fixed import/export of blueprints to remove user and group assignment rules in sign-off configuration

  • Fixed unselect action in the selection window for lists displayed as tables

  • Fixed an error happening when reordering attachment files

  • Fixed deduplication of items in list to only apply on reference fields

  • Added the possibility to set data drift as a “metric to focus on” in the model registry

  • Fixed the removal of items from tables

  • Fixed the redirection to home page in case of a custom page not found

  • Fixed governed saved model versions or bundles being created twice when governing directly from the object page

MLOps

  • New feature: Added an option in the Evaluation and Standalone Evaluation Recipes to disable the sub-sampling for drift computation (sub-sampling is enabled by default)

  • New feature: Added data drift p-value as an evaluation metric

  • New feature: Added the ability to track Lab models metrics as experiment tracking runs

  • In Deployer, added an option to bundle only the required model versions.

  • Fixed drift computation in evaluation recipe failing when using pandas 1.0+

  • Fixed evaluation of MLflow models on dataset with integer column with missing values

  • Improved the selection of metrics to display in a Model Evaluation Store

  • Added support for MLflow’s search_experiments API method

  • Fixed handling of integer columns in the Standalone Evaluation Recipe for binary classification use cases

  • Fixed some flow-related public API method when there is a model evaluation store in the flow

  • Fixed evaluation of MLflow models when there is a date column

  • Fixed empty versions list for MLflow models migrated from a previous version

  • In the Evaluation Recipe, added the ability to customize the handling of column in data drift computation

  • Enriched Model Evaluations with additional univariate data and prediction drift metrics (can also be retrieved through the API)

Coding

  • Improved commit messages generated when creating, editing, deleting files in folder in project libraries

  • Removed some useless empty commits when performing blank edits in project libraries

Plugins

  • Fixed several types of plugin components that did not work with Python 3.11

Performance & Scalability

  • Improved performance and responsiveness when DSS data dir IO is slow

  • Improved performance of starting jobs in projects involving shared datasets

  • Improved performance of validating very large SQL queries / scripts

  • Improved performance of some API calls returning large objects

  • Improved performance of sampling for Statistics worksheets

  • Improved performance of various other UI locations

Administration

  • New feature: Added reporting of SQL queries in Compute Resource Usage for several missing locations where DSS performs SQL queries

Setup

  • New feature: Added support for Python 3.9 for the DSS builtin environment

Dataiku Cloud

  • Code Studios: Fixed RStudio on Dataiku Cloud

Cloud Stacks

  • Switched OS for DSS instances from CentOS 7 to AlmaLinux 8

  • Switched R version for DSS instances from 3.6 to 4.2

  • Switched Python version for builtin env for DSS instances from 3.6 to 3.9

  • Fixed faulty display of errors while replaying setup actions

  • Fixed various issues with renaming instances

  • Made it easier to install the “tidyverse” R package out of the box

  • GCP: Fixed region for snapshots

  • GCP: Added ability to assign a static public IP for Fleet Manager

  • Fixed issue when declaring a govern node but not creating it

  • Made the “external URL” configurable for instances, for inter-instance links shown in the interface

Elastic AI

  • EKS: Fixed support for kubectl 1.26

  • GKE: Added support for Kubernetes 1.26

  • GKE: Fixed issue when creating cluster in a different zone than the DSS instance

  • Made it easier to debug issues with API nodes deployed on Kubernetes infrastructure (API node log now appears in pod logs)

Miscellaneous

  • Fixed broken/missing filtering (live search) in some dropdown menus

  • Fixed some Flow-related methods of the public API python client that would fail when used with labeling tasks

  • Fixed broken DSSDataset#create_analysis method of the public API python client

  • Removed limitations on size of project variables

  • Fixed failure when UIF invalid rules are defined

  • Fixed renaming of To do lists

  • Fixed possible failures of Jupyter notebooks failing to load

  • Fixed Admin > Monitoring screen failing to load if the instance contains a malformed dataset or chart definition.

  • Fixed issue with Python plugin recipes when installing plugin from Git in development mode

  • Fixed Parquet in Spark falling back to unoptimized path for minor ignorable differences in schema

  • Compute resource usage: added a new indicator that provides a better approximation of CPU usage on quick starting/stopping processes