DSS 13 Release notes

Migration notes

How to upgrade

Pay attention to the warnings described in Limitations and warnings.

Migration paths to DSS 13

Limitations and warnings

Automatic migration from previous versions is supported (see above). Please pay attention to the following cautions, removal and deprecation notices.

Cautions

XGBoost models migration

DSS 13.0 now uses XGBoost 1.5 in the default VisualML setup.

Existing models can still be used for scoring without retraining if Optimized scoring is used. Note that in particular, row-level explanations cannot use Optimized scoring.

If Optimized scoring cannot be used, you will need to either:

Python 2.7 builtin env removal

Note

If you are using Dataiku Cloud or Dataiku Cloud Stacks, you do not need to pay attention to this

Very few Dataiku Custom customers are affected by this, as this was a very legacy setup.

Python 2.7 support for the builtin env of Dataiku was deprecated years ago and is now fully removed. If your builtin env was still Python 2.7, it will automatically migrate to Python 3. This may affect:

  • Existing code running on the builtin env, that may need adaptations to work in Python 3.

  • Machine Learning models, that will usually need to be retrained

Support removal

Some features that were previously announced as deprecated are now removed or unsupported

  • Hadoop distributions support

    • Support for Cloudera CDH 6

    • Support for Cloudera HDP 3

    • Support for Amazon EMR

  • OS support

    • Support for Red Hat Enterprise Linux before 7.9

    • Support for CentOS 7 before 7.9

    • Support for Oracle Linux before 7.9

    • Support for SUSE Linux Enterprise Server 15, 15 SP1, 15 SP2

    • Support fot CentOS 8

  • Support for Java 8

  • Support for Python 2.7

Deprecation notices

DSS 13 deprecates support for some features and versions. Support for these will be removed in a later release.

  • Support for Ubuntu 18.04

  • Support for RedHat 7

  • Support for CentOS 7

  • Support for Oracle Linux 7

  • Support for SuSE Linux 12

  • Support for SuSE Linux 15 SP3

  • Support for Scala notebook for Spark

  • Support for multiple Hadoop clusters

Version 13.0.0 - June 25th, 2024

DSS 13.0.0 is a major upgrade to DSS with major new features.

Major new feature: Multimodal embeddings

In Visual ML, features can now leverage the LLM Mesh to use embeddings of images and text features

Major new feature: Deploy models to Snowflake Snowpark Container Services

In the API deployer, you can now deploy API services to Snowpark Container Services

Major new feature: Databricks Serving in Unified Monitoring

Databricks Serving endpoints can now be monitored from Dataiku Unified Monitoring

LLM Mesh

  • New feature: Added support for token streaming on local models (when using vLLM inference engine)

  • Added Langchain wrappers in the public Python API (was already available in the internal Python API). Using the API client, you can now use the LLM Mesh APIs from Langchain from outside Dataiku.

  • Added ability to share a Knowledge Bank to another project

  • Added ability to use a custom endpoint URL for OpenAI connections

  • Added ability to deep-link to a prompt inside a prompt studio

  • Added support for embedding models in SageMaker connections

  • Improved error reporting when a call to a RAG-augmented model fails

  • Faster local inference for Llama3 on Huggingface connections

  • Misc improvements to the prompt studio UI

  • Show a job warning when there were errors on some rows of a prompt recipe

  • Fixed erroneous accumulation of metadata when rebuilding a Qdrant Knowledge Bank

  • Fixed Flow propagation when it passes through a Knowledge Bank

  • Fixed RAG failure when using Llama2 on SageMaker

  • Fixed raw prompt display on custom LLM connections

Machine Learning

  • New feature: Added the HDBSCAN clustering algorithm.

  • Improved Feature effects chart (in feature importance) by coloring the top 6 modalities of categorical features.

  • Sped up computation of individual prediction explanations and feature importance.

  • Sped up retrieval of the active version of a Saved Model with many versions.

  • Fixed possible hang when creating an automation bundle including a Saved Model with many versions.

  • Fixed unclear error message in scoring recipe when the input dataset is too small to use as background rows for prediction explanation.

  • Fixed incorrect number of cluster for some AutoML clustering models.

  • Fixed incorrect filtering of time series when a multi-series forecasting model is published to a dashboard.

  • Fixed a rare breakage in feature importances on some models.

Charts & Dashboards

  • New feature: Added MAX and MIN aggregations for dates (as measures in KPI and pivot table charts, in tooltips and in custom aggregations)

  • New feature: Added the option to connect the points on scatter plot and multi-pair scatter plot

  • Added grid lines in Excel export

  • Added grid lines for cartesian charts

  • Added ability to configure max number of points in scatter plots

  • Added ability to customize the display of empty values in pivot tables

  • Added ability to set insight name for charts

  • Improved loading performance of charts with date dimensions

  • Fixed update of points size in scatter plots

  • Fixed rendering of charts when collapsing / expanding the help center

  • Fixed dimensions labels on treemaps

  • Fixed cache for COUNT aggregation

  • Fixed “link neighbors” option in line charts with SQL engine

  • Fixed “show y=x” option on scatter plot

  • Fixed dashboard’s filters when added directly after a dataset

  • Fixed “all values” filter option with SQL engine

  • Fixed dashboard filters when using mixed cased columns names on a database which is case insensitive on columns names

  • Fixed excluding cross-filters for numerical dimensions using “Treat as alphanumerical”

  • Fixed link to insight from dashboards included into workspaces

  • Improved Scatter plot performance

  • Fixed filtering on “No value” in alphanunerical filters with in-database engine

  • Fixed dashboard’s filters migration script

  • Fixed intermittent issue on Chrome browser which prevents rendering of Jupyter notebook in dashboards

  • Fixed error when disabling force inclusion of zero option in time series chart

Datasets

  • New feature: Sharepoint Online connector. DSS can now connect to Microsoft Sharepoint Online (lists and files) without requiring an additional plugin

  • Updated MongoDB support to handle versions from 3.6 up to 7.0, including Atlas and CosmosDB

  • Added read support for CSV and Parquet files compressed with Zstandard (zstd)

  • Added experimental support for Yellowbrick in JDBC connection

Data Quality

  • New feature: Added ability to create templates of Data Quality rules to reuse them across multiple datasets

MLOps

  • New feature: Added text input data drift analysis (standalone evaluation recipe only), relying on LLM Mesh embeddings

  • New feature: Added model export to Databricks Registry

  • Added the ability to create dashboard insights from the latest Model Evaluation in a Model Evaluation Store

  • Added the possibility to use plugins code environments in MLflow imported models

  • Added support for global proxy settings in Databricks managed model deployment connections

  • Added support for MLflow 2.13

  • Fixed incorrect ‘python_version’ field in MLflow exported models

  • Fixed listing of versions on Databricks registries when the model has a quote in its name

  • Fixed incorrect warnings in Evaluation recipe’s dataset diagnosis

Flow

  • Added ability to build Flows even if they contains loops

Recipes

  • Stack: Fixed wrong schema when stacking two datasets both containing a column of type string but with different maximum length

Deployer

  • API Deployer: Added a ‘run_test_queries’ endpoint in the public API to execute the test queries associated with a deployment.

  • Projects Deployer: Added the ability to define “additional content” also in the default configuration of bundles (not just directly on existing bundles)

  • Unified Monitoring: Added support for Unified Monitoring on automation nodes

  • Unified Monitoring: Added Data Quality status in Unified Monitoring

  • Unified Monitoring: Endpoint latency now displays 95th percentile

  • Unified Monitoring: display projects names rather than keys

  • Unified Monitoring: Fixed possible issue when opening project details

  • API designer: Fixed API designer test queries hanging in case of test server bootstrap failure

  • Added the ability to define environment variables for Kubernetes deployments

  • Added an “External URL” option for Project & API deployer infrastructures.

  • API Node: Added new commands to apinode-admin to clean disabled services (services-clean) and unused code environment (__clean-code-env-cache).

Governance

  • New feature: Added ability to set filters on workflow and sign-off statuses

  • New feature: Added ability to use “negate” conditions in filters

  • New feature: Added visibility conditions based on a field for views

  • New feature: Added ability to add additional role assignment rules at the artifact level

  • Removed the workflow step prefix to use only the step name defined in the blueprint version

  • Improved the display of the Dataiku instance information

  • Added project’s cost rating to the overview

  • Fixed multi-selector search filters

  • Fixed possible deadlock in hooks

  • Fixed artifact creation to be possible with just creation permission

  • Fixed file upload being cancelled on browser tab change

  • Fixed password reset for Cloud Stacks deployments

Statistics

  • Time series: when using Quarter or Year granularity, added ability to select on which month to align

Coding

  • Added support for Pandas 2.0, 2.1 and 2.2

  • Added support for conda for Python 3.11 code environments

  • Fixed write_dataframe failing in continuous Python for pandas >= 1.1

  • Upgraded Jupyter notebooks to version 6

Code studios

  • Improved performance when syncing a large number of files at once

  • Added support for ggplot2 in RStudio running inside Code Studios

Elastic AI

  • EKS: Added support for defining nodegroup-level taints

Cloud Stacks

  • Azure: Fixed deploying a new instance from a snapshot if the disk size was different from 50GB

  • Added more information (Ansible Facts) for use in Ansible setup actions

Dataiku Custom

Note: this only concerns Dataiku Custom customers

  • Added support for the following OS

    • RedHat Enterprise Linux 9

    • AlmaLinux 9

    • Rocky Linux 9

    • Oracle Linux 9

    • Amazon Linux 2, 2023

    • Ubuntu 22.04 LTS

    • Debian 11

    • SUSE Linux Enterprise Server 15 SP5

Security

  • Disabled HTTP TRACE verb

  • Fixed LDAP synchronization correctly denying access to DSS to a user that is no longer in the required LDAP groups but failing to synchronize the DSS groups for this user.

Misc

  • Switched default base OS for container images to AlmaLinux 8

  • Fixed a rare failure to restart DSS after a hard restart/crash occurring during a configuration transaction

  • Plugin usage now takes shared datasets into account

  • Added audit message for users dismissing the Alert banner

  • Fixed relative redirect for standard webapps

  • Fixed failure with non-ascii characters in plugin configuration and local UIF execution