DSS 13 Release notes¶
Migration notes¶
How to upgrade¶
For Dataiku Cloud users, your DSS will be upgraded automatically to DSS 13 within pre-announced timeframes
For Dataiku Cloud Stacks users, please see upgrade documentation
For Dataiku Custom users, please see upgrade documentation: Upgrading a DSS instance.
Pay attention to the warnings described in Limitations and warnings.
Migration paths to DSS 13¶
From DSS 12: Automatic migration is supported, with the restrictions and warnings described in Limitations and warnings
From DSS 11: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 10.0 -> 11, 11 -> 12
From DSS 10.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 10.0 -> 11, 11 -> 12
From DSS 9.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 9.0 -> 10.0, 10.0 -> 11, 11 -> 12
From DSS 8.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11, 11 -> 12
From DSS 7.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11, 11 -> 12
From DSS 6.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 6.0 -> 7.0, 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11, 11 -> 12
From DSS 5.1: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11, 11 -> 12
From DSS 5.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11, 11 -> 12
From DSS 4.3: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11, 11 -> 12
From DSS 4.2: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11, 11 -> 12
From DSS 4.1: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.1 -> 4.2, 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11, 11 -> 12
From DSS 4.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.0 -> 4.1, 4.1 -> 4.2, 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11, 11 -> 12
Migration from DSS 3.1 and below is not supported. You must first upgrade to 5.0. See DSS 5.0 Release notes
Limitations and warnings¶
Automatic migration from previous versions is supported (see above). Please pay attention to the following cautions, removal and deprecation notices.
Cautions¶
XGBoost models migration¶
(Introduced in 13.0)
DSS 13.0 now uses XGBoost 1.5 in the default VisualML setup.
No action is required on existing models when Optimized scoring is used for scoring. (Note that in particular, row-level explanations cannot use Optimized scoring.)
If Optimized scoring cannot be used, you can either:
Run the XGBoost models upgrade macros to automatically make the existing models compatible
Or, retrain the existing XGBoost models
Python 2.7 builtin env removal¶
(Introduced in 13.0)
Note
If you are using Dataiku Cloud or Dataiku Cloud Stacks, you do not need to pay attention to this
Very few Dataiku Custom customers are affected by this, as this was a very legacy setup.
Python 2.7 support for the builtin env of Dataiku was deprecated years ago and is now fully removed. If your builtin env was still Python 2.7, it will automatically migrate to Python 3. This may affect:
Existing code running on the builtin env, that may need adaptations to work in Python 3.
Machine Learning models, that will usually need to be retrained
Behavior change: handling of schema mismatch on SQL datasets¶
(Introduced in 13.1)
DSS will now by default refuse to drop SQL tables for managed datasets when the parent recipe is in append mode. In case of schema mismatch, the recipe now fails. This behavior can be reverted in the advanced settings of the output dataset
Models retraining¶
(Introduced in 13.2)
The following models, if trained using DSS’ built-in code environment, will need to be retrained after upgrading to remain usable for scoring:
Isolation Forest (AutoML Clustering Anomaly Detection)
Spectral clustering
KNN
Support removal¶
Some features that were previously announced as deprecated are now removed or unsupported
Hadoop distributions support
Support for Cloudera CDH 6
Support for Cloudera HDP 3
Support for Amazon EMR
OS support
Support for Red Hat Enterprise Linux before 7.9
Support for CentOS 7 before 7.9
Support for Oracle Linux before 7.9
Support for SUSE Linux Enterprise Server 15, 15 SP1, 15 SP2
Support fot CentOS 8
Support for Java 8
Support for Python 2.7
Support for Spark 2
Deprecation notices¶
DSS 13 deprecates support for some features and versions. Support for these will be removed in a later release.
Support for Python 3.6 and Python 3.7
Support for Ubuntu 18.04
Support for RedHat 7
Support for CentOS 7
Support for Oracle Linux 7
Support for SuSE Linux 12
Support for SuSE Linux 15 SP3
Support for Scala notebook for Spark
Support for multiple Hadoop clusters
Version 13.2.0 - October 3rd, 2024¶
DSS 13.2.0 is a significant new release with both new features, performance enhancements and bugfixes.
New feature: Column-level Data Lineage¶
Column-level data lineage offers a new view that allows performing Root cause and Impact analysis on dataset columns:
When identifying a data-related issue, investigate the upstream pipeline to find where the data comes from.
Before performing any change on a dataset column, discover the potential impact on downstream datasets and projects.
For more details, please see Column-level Data Lineage
New feature: LLM evaluation recipe¶
Note
This feature is available in Private Preview as part of the Advanced LLM Mesh Early Adopter Program
When building GenAI applications, evaluating the quality of the output is paramount. The LLM evaluation recipe uses specific GenAI & LLM techniques to compute several metrics that are relevant to the specific cases of GenAI.
The metrics can be output to a Model Evaluation Store and compared across runs.
Individual outputs of the LLMs can also be reviewed and compared across runs.
New feature: Delete & Reconnect recipes¶
From the Flow, you can now easily delete a recipe and reconnect the subsequent recipe, in order to avoid breaking the Flow.
For more information, please see Inserting and deleting recipes
New feature: Microsoft Fabric OneLake SQL Connection¶
This new connection allows you to access data stored in Microsoft Fabric OneLake through Microsoft Fabric Warehouses.
New feature: repeating mode for datasets¶
Some datasets now have the ability to “repeat” themselves based on the rows of a secondary dataset.
This feature allows for example to:
Create a files-from-folder dataset using only the files whose names come from a secondary dataset
Create a SQL dataset based on multiple tables whose names come from a secondary dataset
New feature: repeating mode for SQL query recipe¶
The SQL query recipe can now execute several times, using variables subtitution with variables coming from a secondary dataset, to generate a single concatenated output dataset
New feature: filtering & repeating mode for export recipe¶
The export recipe can now filter rows, and can now execute several times, using variables subtitution with variables coming from a secondary dataset.
This can be used to generate multiple export files, each containing a part of the data. For example, you can use this to create one file per year, one file per country, …
Upgrade notes¶
The following models, if trained using DSS’ built-in code environment, will need to be retrained after upgrading to remain usable for scoring:
Isolation Forest (AutoML Clustering Anomaly Detection)
Spectral clustering
KNN
LLM Mesh¶
New feature: Support for ElasticSearch and OpenSearch as vector store for Knowledge Banks
New feature: Support for Azure AI Search as vector store for Knowledge Banks
New feature: Prompt injection detection with Meta PromptGuard. This feature is available in Private Preview as part of the Advanced LLM Mesh Early Adopter Program.
New feature: Added support visual fine tuning on AWS Bedrock and Azure OpenAI. This feature is available in Private Preview as part of the Advanced LLM Mesh Early Adopter Program.
New feature: Added JSON mode, to ask LLMs to output valid JSON. This is supported on OpenAI & Azure OpenAI (gpt-4o, gpt-4o-mini), Mistral AI (7b, small, large), and VertexAI (Gemini)
New feature: Added an OpenAI-compatible completion API to query any completion model of the LLM Mesh (including non-OpenAI ones) from systems and libraries compatible with custom OpenAI endpoints. It supports tools calling, streaming, image input and JSON output
Added ability to select a different column for RAG augmentation than the one that was indexed for retrieval
Added simplified code environment creation and update for local LLMs (Huggingface connection), RAG and PII detection
Added support for API parameters
presencePenalty
,frequencyPenalty
,logitBias
,logProbs
on local Hugging Face modelsVertex AI: Added support for Gemini 1.5 Pro & Flash
Vertex AI: Added support for custom Vertex-supported models
Vertex AI: Added text & multimodal embedding models
Visual fine-tuning now selects the best checkpoint when fine-tuning with OpenAI and the latest checkpoint doesn’t improve on the validation loss
Visual fine-tuning can now use models from the model cache
Fixed support of LangChain shorthand syntax for tool choice when using the LangChain adapter for LLMs
Added variable expansion in Prompt studios & Prompt recipes
Machine Learning¶
New feature: Added ability to specify monotonicity constraints on numerical features when using XGBoost, LightGBM, Random Forest, Decision Tree, or Extra Trees models on binary classification and regression tasks. This requires scikit-learn at least at version 1.4, which requires the use of a dedicated code env
get_predictor
can now be used for visual AutoML models using an algorithm from a pluginImproved performance for training and scoring of Isolation Forest models
Added support for the feature effects charts in the documentation export of a multiclass classification model
Added support for XGBoost ≥1.6 <2, statsmodel 14, sklearn 1.3, and pandas 2.2 when using python 3.9+
Added support for numpy 1.24 (python 3.8) and 1.26 (python 3.9+)
Improved display of prediction error for regression models: in the Predicted Data tab, the error is no longer winsorized (for newly trained models), and the Error distribution report page shows more clearly the winsorized chart
Fixed a possible display issue when unselecting a metric on the Decision chart for a model using k-fold cross test
Fixed a possible display issue of decimal numbers on the y axis of the prediction density when doing a What-If analysis on a regression model
Fixed the engine selection of a scoring recipe from the flow when the previously selected engine is not available anymore
Datasets & Connections¶
New feature: ElasticSearch/OpenSearch: Added support for OAuth authentication
New feature: Excel: Added support for reading encrypted Excel files
Sharepoint: Added support for authentication via certificates, or user/password
Excel: Added ability to export datasets as encrypted Excel files
SCP/SFTP: Added support for SSH keys written in other formats than PEM RSA (notably the OpenSSH format)
SQream: Improved support of SQream regarding dates and other aggregation operations
S3: Added settings to configure STS endpoints for AssumeRole
Fixed issue where an empty user field in connections of type “Other databases (JDBC)” would yield connection failure even though user & password are provided in the JDBC URL or in the advanced properties.
Fixed issue where users could create a personal Athena connection using S3 connections whose details are not readable
Recipes¶
Prepare recipe: Updated INSEE data and added possibility to choose the year of the reference data
Prepare recipe: Improved AI Prepare generation when asked to parse dates
Sync recipe: Fixed possible date shift issue with Snowflake input datasets when DSS host is not on UTC timezone
Download recipe: Added repeating mode to download multiple files using variables coming from a secondary dataset
Charts and Dashboards¶
New feature: Added standard deviation as an aggregation for numeric column in charts
Added “display as percentage” number formatting option, i.e. 0.23 → 23%
Added “use parentheses” number formatting option for financial reporting, i.e -237 → (237)
Added “hide trailing zeros” number formatting option
Added support of percentiles aggregation for reference lines
Added number formatting options to use “m” instead of “M” as a suffix for Millions and “G” instead of “B” for Billions
Added the ability to display values in Lines and Mix charts
Fixed issues when dragging and dropping columns on filters (where the “ghost column” would remain visible)
Fixed flickering when dragging and dropping columns on filters
Fixed chart legend highlights sometimes not working when using number formatting options on axis.
Fixed filters in PDF export
Fixed tile size sometimes not properly computed when switching between view and edit mode
Fixed formatting pane not updating when changing binning mode
Fixed “Force inclusion of zero in axis” option in Lines and Mix charts
Fixed the ability to display pivot table despite reaching the objects count limit
Fixed Scatter multipair not refreshing when removing the X axis from the first pair, when there are more than 2 pairs
Data Quality¶
New rule: “Column value in set”. This rule checks that a particular column only contains specific values and nothing else.
New rule: “Compare values of two metrics”. This rule checks that two metrics defined on this dataset or on another dataset have the same value, or that one value is greater than the other, etc.
Scenarios¶
Disabling a step does not change its run condition anymore
MLOps & Deployer¶
Added support for Release Notes in API services
Added a deprecation warning for MLflow version below 2.0.0
Added support of the Monitoring Wizard for Dataiku Cloud instances
Fixed an error when trying to build the API service package of an ensemble model for which one of the source models was deleted and uses a plugin ML algorithm.
Labeling¶
New feature: the label can now be free text when labeling records (tabular data).
Fixed missing options when copying a single Labeling task in the Flow
Coding & API¶
Databricks-Connect: Added support for Databricks serverless clusters
Git¶
Added ability to choose the default branch name (main, master, …)
Added ability to resolve conflicts during a remote branch pull
Governance¶
Added search for the page dropdown list
Added multi-selection to the project filter on main pages
Added LLM filter checkbox on Governed Projects page
Fixed synchronization of API deployments on external infrastructure
Fixed view mapping refresh issue in custom page designer
Fixed permissions to edit blueprint migrations
Dataiku Applications¶
Added a notification on application instances when a new version is available
Code Studios¶
Added ability to configure pip options for code envs in Code Studio Templates
Workspaces¶
Fixed broken Dataiku Application link
Elastic AI¶
EKS: Added ability to add cloud tags to clusters
Fixed issue where the test button in Containerized execution configs would not work when using encrypted RPC
HOME and USER environment variables are now set properly in containers
Fixed pod leak when aborting a containerized notebook whose pod is in pending state
Cloud stacks¶
Azure: Switched from Basic SKU Public IPs to Standard SKU Public IPs
Azure: Added option to choose the Availability Zone when instantiating a DSS node, or creating a template
Azure: Added ability to choose in which Resource Group to store snapshots for a given instance
Python API: Added methods to start & stop instances from Fleet Manager
Misc¶
Added ability to connect third party accounts (such as OAuth connections to databases) directly from the dataset page
Added ability to see the members of a group in Administration > Security > Groups
Added ability to control job processes (JEK) resources consumption using cgroups
Plugins: Added ability for plugin recipes to write into an output dataset in append mode
Cloudera CDP: Added support for Impala Cloudera driver 4.2
Fixed error occurring when copying subflow containing a dataset on a deleted connection
Fixed issue that prevents deleting or modifying a user when the configuration file of a project contains invalid JSON
Fixed issue where Compute Resource Usages (CRU) when reading SQL data on a connection could be wrongly reported as being done on another connection
Version 13.1.4 - September 19th, 2024¶
DSS 13.1.4 is a bugfix release
LLM Mesh¶
Fixed broken display of Azure OpenAI connection page when it has a multimodal chat completion deployment
Fixed excessive logging when embedding images
Snowflake¶
Fixed Snowpark when the Snowflake connection uses private key authentication
Charts¶
Fixed broken display of scatter plot with some Content Security Policy headers
Version 13.1.3 - September 16th, 2024¶
DSS 13.1.3 is a feature, security and bugfix release
LLM Mesh & Generative AI¶
New feature: Added ability to use image inputs in the Prompt Studio & Prompt Recipe
Bedrock: Added Mistral Large 2 to the Bedrock connection, including tools call
Bedrock: Added Llama 3.1 8B/70B/405B models to the Bedrock connection
Anthropic: Added Claude 3.5 Sonnet to the Anthropic connection
Databricks: Added Llama 3.1 70B/405B models to the Databricks Mosaic AI connection
Bedrock: Added support for image embedding with Amazon Titan Multimodal Embeddings G1
Added support of gpt-4o-mini in the Fine-tuning recipe
Sped up inference of some LLMs that use LoRA
Added count of input & output tokens for local model inference
Added support for finish reason in streamed calls, for compatible models/connections
Added support for presence penalty and frequency penalty in Prompt Studio & Prompt Recipe
Added support for cost reporting on streamed calls (except on Azure OpenAI, which doesn’t support it)
Reduced the number of training evaluations when fine-tuning a local model
Bedrock: Fixed a UI issue enabling/disabling the Llama3 70B model on a Bedrock connection
Fixed possible issues with enforcement on cached responses when calling the LLM Mesh API
Fixed possible issue displaying the embedding model on a Knowledge Bank’s settings
Machine Learning¶
Added configurable “min samples leaf” parameters to he Gradient Tree Boosting algorithm
Time Series Forecasting: Improved API to change the forecast horizon on a time series forecasting task
Time Series Forecasting: Fixed possible failure of a time series forecasting training when using together “Equal duration folds” and “Skip too short time series” options with multiple time series
Time Series Forecasting: Fixed possible failure when using pandas 2.2+ with some algorithm/time steps combinations
Causal learning: Fixed possible training failure of causal model when using inverse propensity weighting with a calibrated propensity model
Fixed possible failure of a scoring recipe using the Spark engine in a pipeline with a model trained by a different user
Fixed display of a categorical feature in the Feature effects chart, when it only have numerical values
Fixed possibly broken display of trees on partitioned model details
Fixed possible issue with the ROC curve or PR curve plot when exporting a multiclass model’s documentation
Fixed possible scoring issue on some calibrated-probability classification models
Fixed failure to compute partial dependence plots on models with sample weights when the sample size is less than the test set size
Fixed failure to export model documentation when using time ordering and explicit extract from two datasets
Statistics¶
Fixed failure on the PCA recipe when the input dataset has fewer rows than columns
MLOps¶
Fixed Standalone Evaluation Recipe failing on classification task when using prediction weights
Fixed copy of Standalone Evaluation Recipes
Charts & Dashboards¶
Added a “Last 180 days” preset to relative date filters
Fixed failure when loading static insights with names containing underscore ( _ )
Fixed dashboard tile resizing when showing/hiding page titles in view mode
Fixed percentile calculation when there are multiple dimensions in a chart
Changed the filters mode to be “Include other values” by default
Fixed some chart options sometimes being reset on chart reload
Fixed date filter selection in charts being lost after engine or sampling change
Fixed dashboard wrongly seen as modified when clicking on saved model or model evaluation report tiles
Fixed the loading of fonts in gauge charts within dashboards
Fixed gauge chart Max/Min with very small values
Fixed gauge and scatter charts not loading when there is a relative date filter in combination with either a gauge target or a reference line aggregation
Governance¶
Added automated generation of step ID from the step name in the configuration of workflows
Added support for proxy settings for OIDC authentication
Added examples of Python logger usage and field migration to migration scripts
Added ability to collapse view containers
In the Blueprint Designer, added ability to search for fields by label or by ID when creating view components
Fixed upgrade when there are API keys without labels
Fixed deletion of reference from tables, to avoid selecting the deleted item in the right panel
Webapps¶
Added ability to have API access for Code Studio webapps (Streamlit, …)
Dataset and Connections¶
Fixed issue when building datasets using Database-to-Cloud fast paths with non-trivial partitions dependencies
Automatically refresh STS tokens when reading or writing S3 datasets using Spark
Scenarios and automation¶
Fixed scenario variable
firstFailedJobName
incorrect initialization when a build step failsAdded option to prevent DSS from escaping HTML tags in dataset cells when a dataset is rendered as an HTML variable (Starting with DSS 13.1.0, HTML tags are escaped by default)
Fixed issue where DSS reads more than the maximum number of rows indicated in SQL scenario steps when the provided SQL query starts with a comment
Deployer¶
Unified Monitoring: Fixed support for API endpoints deployed from automation nodes
Fixed code environment resources folder when deploying API services on Kubernetes infrastructures
Coding¶
Added button in Jupyter notebook right panel to delete output (useful to clean notebooks containing large outputs without actually loading them)
Fixed ability to import the
dataiku
package withoutpandas
Added
int_as_float
parameter toget_dataframe
anditer_dataframes
Added
pandas_read_kwargs
parameter toiter_dataframes
Git¶
Fixed issue where creating a remote branch does not create a local branch
Fixed issue where pulling from a remote would fail if Git has been configured without an author
Security¶
Fixed issue where DSS version is returned in HTTP response to non-logged users even when flag hideVersionStringsWhenNotLogged is set
Fixed credentials appearing in the logs when using Cloud-to-database fast paths between S3 and Redshift
Cloud stacks¶
Fixed replaying long setup actions displaying an error in the UI, even though it actually completes successfully
Performance & Scalability¶
Improved performance for
get_auth_info
API call
Misc¶
Added support for storing encryption key in Google Cloud Secrets Manager
Fixed HTML escaping issues in project timeline with names containing ampersand (&) characters
Version 13.1.2 - August 29th, 2024¶
DSS 13.1.2 is a bugfix release
Coding¶
Fixed authentication failure when connecting using python client running inside DSS and connecting to another DSS running 13.0 and below.
Spark¶
Fixed a failure on Spark jobs that need to retrieve credentials
Version 13.1.1 - August 26th, 2024¶
DSS 13.1.1 is a security and bugfix release
Recipes¶
Prepare recipe: Fixed failure when executing a “Compute difference between dates” step using SQL engine
Coding and API¶
Fixed
as_langchain_*
methods in a non-containerized kernel on Knowledge Banks built by another user
Security¶
Fixed authentication used by the Python client to connect to DSS, using Basic authentication instead of Bearer for backward compatibility with DSS versions 13.0 and below.
Fixed failure when enabling hashed API keys on upgraded Govern nodes
Fixed possible directory traversal during provisioning of a DSS node by Fleet Manager.
Version 13.1.0 - August 14th, 2024¶
DSS 13.1.0 is a significant new release with both new features, performance enhancements and bugfixes.
New feature: Managed LLM fine-tuning¶
Note
This feature is available in Private Preview as part of the Advanced LLM Mesh Early Adopter Program
LLM Fine-tuning allows you to fine-tune LLMs using your data.
Fine-tuning is available:
Using a visual recipe for local models (HuggingFace) and OpenAI models
Using Python recipes for local models (HuggingFace)
For more information, please see Model fine-tuning
New feature: Gauge chart¶
The Gauge chart, also known as speedometer, is used to display data along a circular axis to demonstrate performance or progress. This axis can be colored to offer better segmentation and clarity.
New feature: Chart median and percentile aggregations¶
Charts (and pivot tables) can now display median, as well as arbitrary percentiles of numerical values
New feature: enhanced Python dataset read API¶
The Python API to read datasets has been enhanced with numerous new capabilities and performance improvements.
The new fast-path reading Dataset.get_native_dataframe
method performs direct read from data sources. This provides massive performance improvements, especially when reading only a few columns out of a wide dataset. Fast-path reading is available for:
Parquet files stored in S3
Snowflake tables/views
For regular reading, the following have been added:
Ability to disable some thorough data checking, yielding performance improvements up to 50%
Ability to read some columns as categoricals to reduce memory usage (depending on the data, can be up to 10-100 times lower)
Ability to use pandas “nullable integers”, allowing to read integer columns with missing values as integers (rather than floating-point values)
Ability to precisely match integer types to reduce memory usage (up to 8x for columns containing only tinyints)
Added ability to completely override dtypes when reading
For samples and documentation, please see the Developer Guide
New feature: Builtin Git merging¶
In addition to the existing ability to push projects and branches to remote Git repositories and perform merges there, you can now perform Git merges directly within Dataiku, including the ability to view and resolve merge conflicts
Behavior change: handling of schema mismatch on SQL datasets¶
DSS will now by default refuse to drop SQL tables for managed datasets when the parent recipe is in append mode. In case of schema mismatch, the recipe now fails. This behavior can be reverted in the advanced settings of the output dataset
LLM Mesh¶
New feature: Added local models for toxicity detection (This feature is available in Private Preview as part of the Advanced LLM Mesh Early Adopter Program)
New feature: Added support for Tools calling (sometimes called “function calling”) in LLM API and Langchain wrapper. This is available for OpenAI, Azure OpenAI, Bedrock (for Claude 3 & 3.5), Anthropic, and Mistral AI connections
New feature: Added support for Gemma, Phi 3, Llama 3.1 8B & 70B, and Mistral NeMo 12B models on local Huggingface connection
Pinecone: Added support for Pinecone serverless indices
In API, added support for
presencePenalty
andfrequencyPenalty
for OpenAI, Azure OpenAI and VertexIn API, added support for
logProbs
andtopLogProbs
for OpenAI, Azure OpenAI and Vertex (PaLM only)In API, added support for
logitBias
for OpenAI and Azure OpenAIIn API, added
finishReason
to LLM responses, for LLMs/providers that support itAdded Langchain wrappers for embedding models in the public Python API (was already available in the internal Python API). Using the API client, you can now use the LLM Mesh APIs on embedding models with Langchain from outside Dataiku.
Added support for Embedding models in Snowflake Cortex connection
Improved API support for stop sequences on local models run with vLLM
Fixed issue in complete prompt display for RAG LLMs in Prompt Studio
Machine Learning¶
Isolation Forest: Made training up to ~4 times faster (using parallelism and sparse inputs)
Isolation Forest: Added support for “auto” contamination
Model Documentation Export: Added support for “Feature effects” chart from feature importance
Added ability to not specify an image input features in What-if
Improved performance for training of partitioned models with large number of partitions
Improved cleanup of temporary data when retraining partitioned models (reduce disk consumption)
Improved pre-training validation of ML Overrides and Assertions
Fixed computation of optimal threshold on binary classification models using k-fold cross-test
Fixed inability to upload 2 different images as input features in What-if
Fixed possible broken forecasting models when a model forecasts NaN values
Fixed a possible issue when deleting a partitioned model’s version while it was being retrained
Fixed some notebook model exports when using scikit-learn 1.2
MLOps¶
Added the possibility to do a full update in “Update API deployment” scenario step
Added the possibility to include or not editable datasets when creating bundles
Improved MLflow import code-environment errors reporting
Fixed the sorting on metrics in Model Evaluation Stores
Fixed the Monitoring Wizard to take into account deployment level auto logging settings
Charts and Dashboards¶
Dashboards: Added background opacity settings for chart, text and metrics tiles
Dashboards: Added border and title styling options to tiles
Dashboards: Added title styling options to dashboard pages
Dashboards: Added the ability to hide dashboard pages
Dashboards: Improved loading performance
Dashboards: Fixed dashboard’s save button wrongly becoming active when selecting a tile
Filters: Added support for alphanum filter facets on numerical columns in SQL, and the possibility to include/exclude null values
Scatter plots: Improved axis format for dates by displaying time when range is less than a single day
Scatter plots: Increased max scale limit when zooming with rectangle selection
Pivot tables: Persist column sizes, as well as folded state of rows or columns
Line charts: Fixed the “show X axis” option in line charts with a date axis
Added support for numeric custom aggregations used in the chart in reference lines displayed aggregations
Added an “auto” mode for the “one tick per bin” option, automatically switching to the most appropriate mode depending on the number of bins
Fixed locked tick options (interval/number) after switching between charts
Fixed the “Add insight (Add to dashboard)” action for chart insights
Fixed Y axis title options disappearing in vertical bar charts when there are 2 or more measures
Fixed broken X axis when switching to a dimension that doesn’t support log scale from a dimension where it was supported and activated
Fixed empty dashboard wrongly considered as modified
Fixed dashboard’s insights associated to deleted datasets loading forever
Governance¶
New feature: New Global Timeline: “Instance Timeline” page tracking all the item’s events
New feature: Custom filters are now available on all pages and various improvements were brought:
Added ability to filter on application template and application instance flags
Added support for search on reference fields
Added ability to filter on node type and node ID
Added ability filter on DSS tags
Added ability to filter Model versions and Bundles on deployment stages
Added text search filter for all types of fields
Added execution of hooks on govern action
Added ability to copy/paste view components in the Blueprint Designer
Added an option in the Blueprint Designer to allow only selection, only creation, or both, on reference fields
Added visual indicators of settings validation in the Blueprint Designer
Added validation of blueprint versions forked from the standard to detect issues that could break standard govern features
Added the synchronization of DSS project’s “short description” field and the ability to search on it
Fixed history of deleted signoff
Fixed sticky error panel on next user action
Fixed artifact create permission to not imply read permission anymore
Datasets and Connections¶
Fixed jobs writing multiple partitions on an SQL dataset failing when executed in containerized mode
Fixed an issue when navigating away from an ElasticSearch dataset before the sample is displayed
Data Quality¶
Added ability to publish Data Quality status of a dataset or a project to a dashboard
Added multi-column support to column validity, aggregation in range/set & unique rules
Added ability to create, view and edit Data Quality templates
Fixed Metrics computed with spark on HDFS partitioned datasets producing incorrect results
Flow¶
Added ability to rename a recipe directly from the Flow
Added ability to export the Flow documentation (without screenshots) when the graphics-export feature is not installed.
Added support for Spanish and Portuguese languages to AI Explain
Recipes¶
New feature: Prepare:
val
/strval
/numval
formula functions now support an additional argument to specify an offset. This allows retrieving values from previous rows to compute for example sliding averages or cumulative sums. This feature is only available on the DSS engine.New feature: Prepare: The new “Split into chunks” step can split a text into multiple chunks, with one new row for each chunk.
Prepare: Added a warning on recipes containing both Filter and Empty values steps, which might lead to unexpected output
Prepare: Fixed date difference step returning incorrect results on the Hive engine
Scenario and automation¶
New feature: Ability to send datasets with conditional formating, directly inline in email body
Added a “Build flow outputs” option in scenarios
Added ability to build a flow zone in scenarios
Deployer¶
New feature: added support for Snowflake Snowpark external endpoints in Unified Monitoring
Added governance status in Unified Monitoring
Added the possibility to define a specific connection for the monitoring of a managed infrastructure
Added the possibility to define an “API monitoring user” to support “per-user” connections in Unified Monitoring
Added support for labels and annotations in API deployer K8S infrastructure, optionally overridable in related deployments
Fixed the status of endpoints of external scopes in Unified Monitoring when there is an authentication issue
Fixed external scopes being monitored even when disabled
Coding¶
Added methods to interact with SQL notebooks (
DSSProject.list_sql_notebooks
,DSSProject.get_sql_notebook
, …)
Code Studios¶
Streamlit: Fixed forwarding of query parameters
Notebooks¶
Fixed HTML export of Jupyter notebooks with Python 3.7
Security¶
Added ability to authenticate on the API using a Bearer token (in addition to Basic authentication)
Added the ability to store API keys in irreversible hashed form
Fixed refresh tokens being requested too often
Cloud Stacks¶
Fixed HTTP proxy setup action not properly encoding passwords containing special characters
HTTP proxy setup action now sets the following environment variables: http_proxy, https_proxy and no_proxy, in addition to their uppercase equivalents
AWS: Switched to IMDSv2 to access instance metadata
Added ability to change the internal ports for DSS (not recommended, for very specific cases only)
Misc¶
Reduced the number of notifications enabled by default for new users
Fixed AI services when using authenticated proxies
Fixed trial seats when using authenticated proxies
Version 13.0.3 - August 1st, 2024¶
DSS 13.0.3 is a bugfix release
Dataiku Applications¶
Fixed the “Download file” tile
Charts¶
Fixed rectangle zoom when log scale option is enabled
Spark and Kubernetes¶
Fixed Spark engine on Azure datasets when DSS is installed with Java 17
Version 13.0.2 - July 25th, 2024¶
DSS 13.0.2 is a feature and bugfix release
LLM Mesh¶
New feature: AWS Bedrock: Added support for Claude 3.5 Sonnet
New feature: AWS Bedrock: Added support for Mistral models (Small, 7B, 8x7B, Large)
New feature: AWS Bedrock: Added support for Llama3 models (8B, 70B)
New feature: AWS Bedrock: Added support for Cohere Command R & R+
New feature: AWS Bedrock: Added support for Titan Embedding V2 and Titan Text Premier
New feature: AWS Bedrock: Added support for image input on Claude 3 and Claude 3.5
New feature: OpenAI: Added support for GPT-4o mini
New feature: Added support for generic chat and embedding models on AzureML
Added ability to Test custom LLM connections
Added ability to clear Knowledge Banks
Improved performance of builtin RAG LLMs
Improved performance of PII detection
HuggingFace: Improved performance of HuggingFace models download
HuggingFace: Increase default number of output tokens when using vLLM
Gemini: Fixed spaces wrongfully inserted in some LLM responses when using Gemini
Snowflake: Fixed Snowflake LLM models listed even when not enabled in the Snowflake Cortex connection
Limited ChromaDB version to prevent issues with ChromaDB 0.5.4
Dataset and Connections¶
New feature: Added support for YXDB file format
Fixed error message not displayed when previewing an indexed table on which users have no permission
Fixed scientific numbers written using the French format (example: “1,23e12”) not properly detected as “Decimal (Comma)” meaning
Disabled unimplemented normalization mode for regular expression matching custom column filter
Added statistics about length of alphanumerical columns in the Analyze dialog
Sharepoint built-in connection: Fixed UnsupportedOperationException returned for some lists
BigQuery: Added ability to configure connection timeouts
BigQuery: Added ability to include BigQuery datasets when importing/exporting projects or bundles.
BigQuery: Fixed error happening when parsing dates with timezone written using the short format (ex: “+0200”)
Athena: Fixed wrongful escaping of underscores in table names
Flow¶
When building downstream, correctly skip Flow datasets or models that are marked as “Explicit build” or “Write protected”
Recipes¶
Prepare: Improved wording of summary of empty values step when configured with multiple columns
Prepare: Fixed casting issue in Synapse/SQLServer when using a Filter by Value step on a Date column with SQL engine
Window: Disabled concat aggregation on Redshift as it is not supported by this database
Charts and Dashboards¶
Fixed Scatter Multi-Pair chart with DSS engine for some combinations of sample size and “Number of points” setting
Fixed incorrectly enabled save button in unmodified chart insight
Fixed dataset insight creation from the insight page
Fixed filtering in dashboard insights from workspace
Fixed reference lines sometimes getting doubled on scatter charts
Machine Learning¶
Multimodal models: Improved image embedding performance
Fixed serialization of very big models (>4GB)
Fixed possible UI slowness when a partitioned model has many partitions in its versions
Fixed possible UI issues when creating a clustering model with a hashed text feature
Fixed incorrect median prediction value for classification models with sample weights
Coding and API¶
Added ability to retrieve the trial status of users with the Python API
Fixed DSSDataset.iter_rows() not correctly returning an error in case of underlying failure
Fixed x0b and x0c characters in data producing incorrect results when reading datasets using Python API
Fixed DeprecationWarning: invalid escape sequence warnings reported by Python 3.7/3.8/3.9 when importing dataiku package
Code studios¶
Fixed Gradio block as webapp wrongly reported as timed-out after initial start
Fixed IDE block failing if python 3.7 is not available in the base image
Fixed Streamlit block failing with manual base image with AlmaLinux 8.10 when R is not installed
MLOps¶
Fixed interactive drift score computation
Fixed endpoint listing on Azure ML external models when in “environment” authentication mode
Fixed text drift section when using interactive drift computation buttons
Lowered the log level for too verbose External Models on AzureML
Fixed support for “Trust all certificates” when querying the MLflow artifact repository
Fixed code environment remapping for Model Evaluations
Webapps¶
Unified direct-access URL of webapps to /webapps/
Deployer & Automation¶
Fixed inability to edit additional code env settings in automation node
Fixed failure installing plugins with code env without a requirements.txt file on automation node
In Unified Monitoring, added support for new monitoring metrics available on Databricks external scope
Fixed API service error when switching from “multiple generation” with hash-based strategy to “single generation”
Added output of the logs to apimain.log file for containerized deployments even when using the “redirect logs to stdout” setting
Fixed error notification after a successful retry of an API service deployment
Fixed API deployer infrastructure creation when there are missing parameters
Fixed support for “Trust all certificates” settings in deployer hooks
Governance¶
Added the ability for an admin to invalidate the configuration cache
Prevent creation of items from backreference with blueprints that are not compliant with the backreference
Removed the “Open creation page” button for the creation of items from backreferences
Prevent the creation of Business Initiatives or Governed projects from inactive blueprint versions
Improved performances of table pages, especially when there is a matrix or kanban view
Fixed the typing of external deployments
Fixed disappearance of artifact table header when toggling edit mode
Performance & Scalability¶
Fixed possible hang when changing connections on a non-responsive data source
Fixed possible failures starting Jupyter notebooks when the Kubernetes cluster has no resources available
Security¶
Fixed DSS printing in the logs the whole authorization header (which might contains sensitive data) in case of unsupported authorization method
Fixed printing of the “token” field when using Snowpark with OAuth authentication
Miscellaneous¶
Fixed deletion of API keys in the API Designer that could delete the wrong key
Added support for CDP 7.1.9 with Java 17
Version 13.0.1 - July 16th, 2024¶
DSS 13.0.1 is a bugfix and security update. (12.6.5) denotes fixes that were also released in 12.6.5, which was published after 13.0.0
LLM Mesh¶
Improved parallelism and performance of locally-running HuggingFace models
Recipes¶
Join: Fixed loss of pre and post filter when replacing dataset in join (12.6.5)
Join: Fixed issue when doing a self-join with computed columns (12.6.5)
Prepare: Fixed help for “Flag rows with formula” (12.6.5)
Prepare: Fixed failing saving recipe when it contains certain types of invalid processors (12.6.5)
Stack: Fixed addition of datasets in manual remapping mode that caused issues with columns selection (12.6.5)
Charts & Dashboards¶
Re-added ability to view page titles in dashboards view mode (12.6.5)
Fixed filtering in dashboard on charts with zoom capability (12.6.5)
Fixed possible migration issue with date filters (12.6.5)
Fixed migration issue with alphanum filters filtering on “No value” (12.6.5)
Fixed filtering on “No value” with SQL engine (12.6.5)
Restore larger font size for metric tiles (12.6.5)
Fixed display of Jupyter notebooks in dashboards (12.6.5)
Added safety limit on number of different values returned for numerical filters treated as alphanumerical (12.6.5)
Fixed migration of MIN/MAX aggregation on alphanumerical measures
Scenarios and automation¶
Added support for Microsoft teams Workflows webhooks (Power Automate) (12.6.5)
Code Studios¶
Fixed Code Studios with encrypted RPC
Cloud Stacks¶
Fixed Ansible module dss_group
Elastic AI¶
Re-add missing Git binary on container images
Performance¶
Fixed performance issue with most activities in projects containing a very large number of managed folders (thousands) (12.6.5)
Improved short bursts of backend CPU consumption when dealing with large jobs database (12.6.5)
Fixed possible unbounded CPU consumption when renaming a dataset and a code recipe contains extremely long lines (megabytes) (12.6.5)
Visual ML: Clustering: Fixed very slow computation of silhouette when there are too many clusters (12.6.5)
Security¶
Fixed Insufficient permission checks in code envs API (12.6.5)
Misc¶
Fixed Dataset.get_location_info API
Fixed sometimes-irrelevant data quality warning when renaming a dataset (12.6.5)
Fixed EKS plugin with Python 2.7 (12.6.5)
Fixed wrongful typing of data when exporting SQL notebook results to Excel file (12.6.5)
Version 13.0.0 - June 25th, 2024¶
DSS 13.0.0 is a major upgrade to DSS with major new features.
Major new feature: Multimodal embeddings¶
In Visual ML, features can now leverage the LLM Mesh to use embeddings of images and text features
Major new feature: Deploy models to Snowflake Snowpark Container Services¶
In the API deployer, you can now deploy API services to Snowpark Container Services
Major new feature: Databricks Serving in Unified Monitoring¶
Databricks Serving endpoints can now be monitored from Dataiku Unified Monitoring
LLM Mesh¶
New feature: Added support for token streaming on local models (when using vLLM inference engine)
Added Langchain wrappers in the public Python API (was already available in the internal Python API). Using the API client, you can now use the LLM Mesh APIs from Langchain from outside Dataiku.
Added ability to share a Knowledge Bank to another project
Added ability to use a custom endpoint URL for OpenAI connections
Added ability to deep-link to a prompt inside a prompt studio
Added support for embedding models in SageMaker connections
Improved error reporting when a call to a RAG-augmented model fails
Faster local inference for Llama3 on Huggingface connections
Misc improvements to the prompt studio UI
Show a job warning when there were errors on some rows of a prompt recipe
Fixed erroneous accumulation of metadata when rebuilding a Qdrant Knowledge Bank
Fixed Flow propagation when it passes through a Knowledge Bank
Fixed RAG failure when using Llama2 on SageMaker
Fixed raw prompt display on custom LLM connections
Machine Learning¶
New feature: Added the HDBSCAN clustering algorithm.
Improved Feature effects chart (in feature importance) by coloring the top 6 modalities of categorical features.
Sped up computation of individual prediction explanations and feature importance.
Sped up retrieval of the active version of a Saved Model with many versions.
Fixed possible hang when creating an automation bundle including a Saved Model with many versions.
Fixed unclear error message in scoring recipe when the input dataset is too small to use as background rows for prediction explanation.
Fixed incorrect number of cluster for some AutoML clustering models.
Fixed incorrect filtering of time series when a multi-series forecasting model is published to a dashboard.
Fixed a rare breakage in feature importances on some models.
Charts & Dashboards¶
New feature: Added MAX and MIN aggregations for dates (as measures in KPI and pivot table charts, in tooltips and in custom aggregations)
New feature: Added the option to connect the points on scatter plot and multi-pair scatter plot
Added grid lines in Excel export
Added grid lines for cartesian charts
Added ability to configure max number of points in scatter plots
Added ability to customize the display of empty values in pivot tables
Added ability to set insight name for charts
Improved loading performance of charts with date dimensions
Fixed update of points size in scatter plots
Fixed rendering of charts when collapsing / expanding the help center
Fixed dimensions labels on treemaps
Fixed cache for COUNT aggregation
Fixed “link neighbors” option in line charts with SQL engine
Fixed “show y=x” option on scatter plot
Fixed dashboard’s filters when added directly after a dataset
Fixed “all values” filter option with SQL engine
Fixed dashboard filters when using mixed cased columns names on a database which is case insensitive on columns names
Fixed excluding cross-filters for numerical dimensions using “Treat as alphanumerical”
Fixed link to insight from dashboards included into workspaces
Improved Scatter plot performance
Fixed filtering on “No value” in alphanunerical filters with in-database engine
Fixed dashboard’s filters migration script
Fixed intermittent issue on Chrome browser which prevents rendering of Jupyter notebook in dashboards
Fixed error when disabling force inclusion of zero option in time series chart
Datasets¶
New feature: Sharepoint Online connector. DSS can now connect to Microsoft Sharepoint Online (lists and files) without requiring an additional plugin
Updated MongoDB support to handle versions from 3.6 up to 7.0, including Atlas and CosmosDB
Added read support for CSV and Parquet files compressed with Zstandard (zstd)
Added experimental support for Yellowbrick in JDBC connection
Data Quality¶
New feature: Added ability to create templates of Data Quality rules to reuse them across multiple datasets
MLOps¶
New feature: Added text input data drift analysis (standalone evaluation recipe only), relying on LLM Mesh embeddings
New feature: Added model export to Databricks Registry
Added the ability to create dashboard insights from the latest Model Evaluation in a Model Evaluation Store
Added the possibility to use plugins code environments in MLflow imported models
Added support for global proxy settings in Databricks managed model deployment connections
Added support for MLflow 2.13
Fixed incorrect ‘python_version’ field in MLflow exported models
Fixed listing of versions on Databricks registries when the model has a quote in its name
Fixed incorrect warnings in Evaluation recipe’s dataset diagnosis
Flow¶
Added ability to build Flows even if they contains loops
Recipes¶
Stack: Fixed wrong schema when stacking two datasets both containing a column of type string but with different maximum length
Deployer¶
API Deployer: Added a ‘run_test_queries’ endpoint in the public API to execute the test queries associated with a deployment.
Projects Deployer: Added the ability to define “additional content” also in the default configuration of bundles (not just directly on existing bundles)
Unified Monitoring: Added support for Unified Monitoring on automation nodes
Unified Monitoring: Added Data Quality status in Unified Monitoring
Unified Monitoring: Endpoint latency now displays 95th percentile
Unified Monitoring: display projects names rather than keys
Unified Monitoring: Fixed possible issue when opening project details
API designer: Fixed API designer test queries hanging in case of test server bootstrap failure
Added the ability to define environment variables for Kubernetes deployments
Added an “External URL” option for Project & API deployer infrastructures.
API Node: Added new commands to apinode-admin to clean disabled services (services-clean) and unused code environment (__clean-code-env-cache).
Governance¶
New feature: Added ability to set filters on workflow and sign-off statuses
New feature: Added ability to use “negate” conditions in filters
New feature: Added visibility conditions based on a field for views
New feature: Added ability to add additional role assignment rules at the artifact level
Removed the workflow step prefix to use only the step name defined in the blueprint version
Improved the display of the Dataiku instance information
Added project’s cost rating to the overview
Fixed multi-selector search filters
Fixed possible deadlock in hooks
Fixed artifact creation to be possible with just creation permission
Fixed file upload being cancelled on browser tab change
Fixed password reset for Cloud Stacks deployments
Statistics¶
Time series: when using Quarter or Year granularity, added ability to select on which month to align
Coding¶
Added support for Pandas 2.0, 2.1 and 2.2
Added support for conda for Python 3.11 code environments
Fixed write_dataframe failing in continuous Python for pandas >= 1.1
Upgraded Jupyter notebooks to version 6
Code studios¶
Improved performance when syncing a large number of files at once
Added support for ggplot2 in RStudio running inside Code Studios
Elastic AI¶
EKS: Added support for defining nodegroup-level taints
Cloud Stacks¶
Azure: Fixed deploying a new instance from a snapshot if the disk size was different from 50GB
Added more information (Ansible Facts) for use in Ansible setup actions
Dataiku Custom¶
Note: this only concerns Dataiku Custom customers
Added support for the following OS
RedHat Enterprise Linux 9
AlmaLinux 9
Rocky Linux 9
Oracle Linux 9
Amazon Linux 2, 2023
Ubuntu 22.04 LTS
Debian 11
SUSE Linux Enterprise Server 15 SP5
Security¶
Disabled HTTP TRACE verb
Fixed LDAP synchronization correctly denying access to DSS to a user that is no longer in the required LDAP groups but failing to synchronize the DSS groups for this user.
Misc¶
Switched default base OS for container images to AlmaLinux 8
Fixed a rare failure to restart DSS after a hard restart/crash occurring during a configuration transaction
Plugin usage now takes shared datasets into account
Added audit message for users dismissing the Alert banner
Fixed relative redirect for standard webapps
Fixed failure with non-ascii characters in plugin configuration and local UIF execution