DSS 12 Release notes

Migration notes

How to upgrade

Pay attention to the warnings described in Limitations and warnings.

Migration paths to DSS 12

Limitations and warnings

Automatic migration from previous versions is supported (see above). Please pay attention to the following cautions, removal and deprecation notices.

Cautions

  • The SQL engine can now be automatically selected on prepare recipes. In case of issues on prepare recipes that were working prior to the upgrade, you can revert to the DSS engine by clicking on the “Engine: In database” button in the prepare recipe settings.

  • Similarly, the Spark engine can now be automatically selected more eagerly when the storage and formats are compatible with fast Spark execution. In case of issues on recipes that were working prior to the upgrade, you can revert to the DSS engine by clicking on the “Engine: Spark” button in the recipe settings.

  • The Bokeh package has been removed from the builtin Python environment. If you have Bokeh webapps, please make sure to use a code environment. The Bokeh package in the builtin Python environment was using a very old version of Bokeh.

  • The Seaborn package has been removed from the builtin Python environment. If you use this package, please make sure to use a code environment.

  • For Cloud Stacks setups, the OS for the DSS nodes has been updated from CentOS 7 to AlmaLinux 8 (which is a RedHat-compatible distribution similar to CentOS). Custom setup actions may require some updates.

  • For Cloud Stacks setups, R has been upgraded from R 3 to R 4. You will need to rebuild all R code envs. Some updates to packages may be required

  • For Cloud Stacks, the builtin Python environment has been upgraded from Python 3.6 to Python 3.9

  • The version of some packages in the builtin Python environment have been upgraded and your code may require some updates if you are not using your own code environment. The most notable updates are:

    • Pandas 0.23 to 1.3

    • Numpy 1.15 to 1.21

    • Scikit-learn 0.20 to 1.0

    • Matplotlib 2.2 to 3.6

  • The python packages used by Visual Machine Learning have changed, in the built-in code environment and in suggested packages. Notably, if you have KNN or SVM models trained using the built-in code environment, you will need to retrain these models to be able to use them for scoring.

Support removal

Some features that were previously announced as deprecated are now removed or unsupported.

  • Support for H2O Sparkling Water as a backend for Visual Machine Learning has been removed

Deprecation notices

DSS 12 deprecates support for some features and versions. Support for these will be removed in a later release.

  • Support for Cloudera CDH 6

  • Support for Cloudera HDP 3

  • Support for Amazon EMR 5

  • Support for Java 8

  • Support for CentOS 8

Version 12.6.2 - May 16th, 2024

DSS 12.6.2 is a new feature and bugfix release.

LLM Mesh

  • New feature: Added support for Mistral AI La Plateforme, supporting Mistral Small, Large, Mistral Embed, Mistral 7B and Mixtral 8x7B

  • New feature: Added support for Snowflake Cortex LLMs, including Snowflake Arctic

  • New feature: Added support for Llama3 local LLM

  • New feature: Added support for OpenAI GPT-4o

  • New feature: Added support for Claude 3 on AWS Bedrock

  • New feature: Added support for DBRX-Instruct on Databrick Mosaic AI

  • New feature: Added support for token streaming on Azure OpenAI, Mistral AI, and custom LLM connections (when supported by the plugin).

  • Added support for Mistral 78 v0.2 local LLM

  • Added support for the non-preview GPT-4-Turbo LLMs on OpenAI connections

  • Added support for the Organization field in OpenAI connections

  • Improved connection remapping when importing projects that use LLM connections and were exported from DSS older than 12.6.0

  • Improved resiliency when calls to LLM service providers fail

  • Improved surfacing of error details when failing to download models from huggingface

  • Sped up of single-completion API calls

  • Updated LLM pricing information

  • Fixed RAG usage of some augmented models

  • Fixed PII detection failure when a message is made of purely non-alphabetic characters

  • Fixed an error where some Anthropic connections lack a “anthropic-version” header

  • Fixed an issue with DKULLM / DKUChatLLM when using stop sequences with some regex-significant characters

Machine learning

  • Added ability to hold a configurable part of the train set to fit probability calibration, instead of fitting it on the test set.

  • Added support for custom metrics in learning curves

  • Improved consistency of Java scoring engine for XGBoost models

  • Improved memory usage and performance when deleting some partitioned models with a lot of partitions

  • Improved display of calibration curve

  • Added support for scipy 1.10

Charts and Dashboards

  • Fixed an issue with imported dashboards, charts or datasets that contain a date filter, preventing them to load

  • Fixed editing of tile default title

  • Fixed repeated first page when exporting long dashboards to PDF

  • Fixed dashboard filters not working with shared datasets

  • Fixed an issue happening when moving from a chart with the “generate one bin per tick” enabled on a dimension to a chart which is not compatible with this option (ex: scatter plot)

  • Fixed rectangle zoom selection on chart to actually set the zoom to the window defined by the user (rather than trying to keep the aspect ratio)

  • Fixed truncated chart when downloading as an image when the height is too small compared to the width

  • Made dashboard filters keep active user selection when switching pages

  • Fixed date range filter summary to actually reflect user selection

  • Decreased default point size for scatter multi-pair

  • Fixed numeric and date-part filters not clearable when in “multiple values” mode with lots of possible values

  • Fixed the “Generate one tick per bin” generating an empty chart

  • Fixed drag and drop of columns not opening tabs in left panel

  • Fixed reading shared datasets in dashboards without read permission on the source project

Code Studios

  • Fixed Gradio block failing

  • Disabled copying of non-needed expensive folders (‘CachedExtensions’ and ‘CachedExtensionVSIX’)

  • Added ability to select a specific user to run backend of webapps

  • Fixed template export failing when using ${template.resources}

Datasets and connections

  • Databricks: Added OAuth support on AWS

  • Databricks: Fixed issues with recipes failing when the input dataset is using Parquet format with logical types such as Date or Decimal

  • Snowflake: Fixed issues with recipes failing when the input dataset is using Parquet format with logical types such as Date or Decimal

  • GCS: Fixed authentication failure when using p12 credential files with Parquet

  • BigQuery: Added ability to specify a Customer-managed encryption key (CMEK) to encrypt/decrypt data in the built-in driver

  • Excel: Added ability to create multiple datasets when uploading files containing multiple sheets

  • Added progress dialog when downloading/exporting managed folders

  • Fixed issue where a dataset created from a managed folder stored on S3 could not be deleted

  • Fixed failing managed folder download when folder name is less than 3 characters long

Data Quality

  • Fixed column statistics metrics failing on partitioned datasets

  • Fixed computation of “unique value count” metric and rule

Visual recipes

  • Prepare: Fixed recipe failing to run on SQL engine if the same column is added twice or more in a “Keep columns” step

  • Prepare: Fixed “User Agent Classifier” step failing when running on Snowflake with UDF

  • Prepare: Fixed slowness in the user interface when a “Keep/remove column” step contains a large number of columns

  • Prepare: Fixed recipe failing to run on SQL if a “Find/Replace” step is misconfigured

  • Prepare: Added support for XML, JSON and “one record per line” formats in the “Enrich with record context” step

  • Sync: Fixed issue when running recipe where BigQuery input dataset contains a column of type boolean containing Yes/No values

  • Stack: Added ability to insert Stack recipes between 2 existing datasets

  • Fixed issue preventing to use project libraries in Python/R continuous recipes

  • Plugin recipes are now displayed in alphabetical order in right panels

  • Fixed missing warning status on some jobs running in containers using the DSS engine

Scenarios and automation

  • Fixed copy of Python-based scenario that did not copy the script

  • Fixed creation of build steps that wrongfully displayed datasets from previous steps

Webapps

  • Added ability to load static resources for webapps from project-level libraries in addition to global ones

Deployer

  • Optimized the deployment of API services by avoiding multiple builds of the same code environments when used in multiple API endpoints

  • Fixed API deployer infrastructure extra options for building code environments not taken into account when deploying an API service

  • Fixed code environment resources initialization script not being executed when building API node image

  • Fixed failing deployments when infrastructure monitoring uses the “auto push” mode and deployer URL is empty (may happen with broken nodes directory).

Coding

  • Fixed issue with flow traversal APIs when there is a Knowledge Bank in the flow

  • Fixed calling read-data from R code running in parallel in containers causing failure

  • Added Dataset.to_html method to export a dataset as HTML with conditional formatting applied

  • Added as_type parameter to DSSLibraryFile.read method to allow read files in binary format

  • Fixed DSSProjectGit.add_library, set_library and remove_library methods failing when called outside DSS

  • Fixed no_check_certificate not being taken into account when calling dataiku.set_remote_dss

Security

  • Added support for OAuth refresh tokens rotation

  • Removed ability for users to grant permission to “All users” built-in group when visibility of groups and users is restricted

Cloud Stacks

  • Fixed dss_group ansible module

Misc

  • Added ability to access logs of a Python macro even when the code doesn’t fail

  • Fixed inapplicable warning when installing an API node

  • Fixed compute resource information that could include wrongful context information

  • Automatically delete old images from repository when rebuilding code envs, or containerized execution images

  • Added explanation messages in the Jobs user interface when a job is waiting for external resources

  • Fixed API node status reporting failing when it is exposed through a load balancer

  • Fixed cases of stuck Python recipe appearing as wrongly successful when running in a container

  • Added ability for administrators to display a custom message for users who request to upgrade their profiles

  • Fixed failures with plugins with presents when enabling containerization for DSS engine

  • Fixed writing dataframe using Spark failing if dataframe is empty

Version 12.6.1 - April 26th, 2024

DSS 12.6.1 is a security, performance and bugfix release

Datasets and connections

  • Snowflake: Fixed per-user credentials in user/password mode

Charts

  • Fixed PDF export of scatter multi-pair chart

LLM Mesh

  • Fixed tokens streaming on AWS Bedrock

Jobs

  • Restored proper error message when job resolution fails

  • Fixed jobs hanging after aborting pending jobs

Cloud Stacks

  • Fixed GPU support on EKS and AKS

Security

Misc

  • Fixed instance crash when a library file name contains emojis characters

  • Fixed Spark SQL validation when encrypted RPC is enabled

Version 12.6.0 - April 3rd, 2024

DSS 12.6.0 is a new feature and bugfix release.

New feature: Data Quality

Data Quality offers pre-built dashboards for monitoring datasets quality within a single project or across an entire Dataiku instance. Users now have the option to select from a range of rules to evaluate the quality of data in datasets.

Data Quality replaces Checks for Datasets. Existing checks on datasets are seamlessly migrated to Data Quality rules.

New feature: Filter panel in Dashboards

It is now possible to display filters outside of the page’s grid with the flexibility to position them on top, right, or left. It is also still possible to put filters directly within the grid, as previously.

Filters layout has been optimized for both horizontal and vertical display.

It’s also now possible to define the order of filters by drag-and-drop

LLM Mesh

  • New feature: Added support for Claude 3 (Opus, Sonnet, Haiku) models in the Anthropic connection.

  • New feature: Added support for Mixtral-8x7B on HuggingFace local connection

  • Improved performance of local HuggingFace inference, as well as improved stability in low memory situations

  • Added ability to remap LLM connections and Knowledge Bank code environments when exporting/importing a project.

  • Added contextual menu to Knowledge Banks in the Flow.

  • Added support for stop sequences in the completion API, for API-based LLMs that support it.

  • Fixed RAG compatibility with langchain_community 0.0.27

  • Fixed rebuild of a ChromaDB or Qdrant local Knowledge Bank that could cause duplicate content or metadata.

  • Added proxy support to the Databricks Mosaic AI connection.

  • Removed support for MosaicML connections (MosaicML Inference was retired on February 29, 2024), you should now use Databricks Mosaic AI connections instead.

Charts & Dashboards

  • New feature: Added Min and Max aggregations for alphanumeric columns

  • Added more predefined options to relative date filters

  • Added options to persist zoom and to display timeline when publishing a line chart on a dashboard

  • Fixed an issue with dashboard’s title edition not being applied

  • Fixed an issue with pivot tables and tree map when not using the “Group extra values as ‘others’” option with SQL engine

  • Fixed an issue when switching from 2D distribution to box plot, then to pie chart

  • Fixed a few issues downloading charts as images

  • Fixed an issue on line chart when Y axis has a manual range set

  • Fixed an issue with zoom not being well preserved on scatter plots when reloading the chart

  • Fixed a few issues with zooming on scatter multi-pair plot

  • Fixed an issue with zoom brush not being updated on line charts when using the mouse wheel

  • Fixed an issue with keyboard shortcuts for rectangle zoom selection on windows (now using Alt+Shift)

  • Fixed an issue with axis ticks configuration on scatter multi-pair plot

  • Fixed an issue with the axis scale not being updated on scatter plot when zooming in.

  • Fixed the scatter plot to disable axis padding when a manual range is set

  • Fixed scatter plot to not display “chart saved” after zooming when “preserve zoom” option is not used

  • Fixed an issue with the “revert to DSS engine” button

Machine learning

  • New feature: regression models can now output prediction intervals, and those intervals are also usable in ML Overrides and java model export.

  • Added override information to java export of overridden models.

  • Added a Predicted data preview in saved models, similar to that of the analysis Lab.

  • Added support for Poisson and Tweedie objectives for XGBoost regression models.

  • Added support for scikit-learn 1.3 in Visual Machine Learning.

  • Added ability to use sparse matrices with Random Forest and Gradient Boosting models, that can help train faster and with less memory.

  • Added support for Python export and SQL scoring for XGBoost models trained with sparse matrices.

  • Added configurable GPU settings on scoring and evaluation recipes using DNN, XGBoost and some time series forecasting models.

  • Time series: Added ability to cross-test time series forecasting model on folds of equal duration.

  • Time series: Added ability to select time series forecast alignment month within a quarter or year.

  • Improved automated selection of features when computing feature importance of models using PCA feature reduction in preprocessing.

  • Improved compatibility for y_valid in custom metrics between evaluation and optimization.

  • Improved training speed of some time series forecasting models.

  • Fixed class switching on Feature Importance charts for multiclass classification models.

  • Fixed possible race condition causing the retraining of multiple partitions of a partitioned saved model to fail.

  • Fixed failing java scoring, in partition dispatch mode, of a retrained partitioned saved model.

  • Fixed failing computation of learning curves in Lab models for some preprocessing configurations.

  • Fixed display of only the first tree on the Decisions Trees reports of MLlib models that use multiple trees.

  • Fixed incorrect early stopping info in XGBoost Training Information report.

  • Improved memory efficiency of impact coding

  • Fixed impact coding with SQL scoring

Datasets

  • Fixed the sampling user interface when a dataset is displayed within a Workplace

  • Added ability to write BigQuery datasets without going through GCS (via Storage API)

  • Added thousands-separators on numbers in Analyze view to improve readability

  • Editable dataset: Added right-click menu on headers

  • SQL: Added ability to write DSS dates (i.e. date+timestamp) into existing SQL tables with a SQL date (i.e. date only) type

  • Delta Lake: Fixed reading of Decimal data type on Delta Lake files written by Pyspark

Recipes

  • Filter/Sampling: Changed defaults for newly created recipes to use no sampling

  • Prepare: Fixed Create If/then rules not detecting that some conditions cannot be translated to SQL, leading to wrong engine selection.

  • Prepare: Fixed filtering on NUMERIC SQL data types

  • Prepare: Fixed filtering on boolean SQL data types on SQLServer

  • Prepare: Fixed computation error with date difference processor on PostgreSQL

  • Fixed some inconsistencies in partitioning testing

AI Assistants

  • Added ability to explain a project & automatically generate its description even when its Flow contains zones

  • Fixed Code assistant in Code Studios when encrypted RPC is enabled

Coding & API

  • Added Python API to send emails via the channels defined by administrators

  • Added Python API for git capabilities of project libraries

  • Fixed DSSWiki.get_export_stream and DSSWikiArticle.get_export_stream to correctly take into account the export_attachment parameter when it’s set to True.

  • Fixed REST API listing projects with a single value for the optional tag parameter

  • Fixed Python API DSSScenarioRun.duration property raising an error when invoked

Code studios

  • Added impersonation support to streamlit webapps

  • Fixed RStudio preferences not correctly synchronized back to DSS

  • Fixed deployment of webapps with code envs on automation node

Governance

  • New feature: added the ability to specify custom filters on Business Initiatives, Governed Projects and custom pages

  • New feature: Added a link on Dataiku projects to materialize the relationship between Dataiku Applications and their instances

  • Added indicators for bundles making use of external AI services or local LLMs

  • Added blueprint version migrations in exported blueprint versions

  • Improved the robustness of the full synchronization of a design node by not failing the whole process when one item fails to sync

  • Fixed deletion of empty elements in lists

Statistics

  • New feature : Univariate Analysis recipe. Export a univariate analysis card from the Statistics tab into its own recipe, for more automation and flexibility. Create Statistics recipes (Univariate Analysis, Statistical tests, PCA) from the flow, by selecting a dataset and going to the right side menu, or by clicking the +Recipe button (in Visual > Generate Statistics).

  • Added support for missing value in custom binning Split.

  • Expose the ANOVA degrees of freedom, in both card and recipe output.

  • Fixed configuration UI of Bivariate analysis cards, options were not always immediately reflecting changes in column selection.

  • Fixed refresh of Statistics Dashboard insights when changing filter settings.

  • Fixed possible out of bound error in time series cards on large datasets.

  • Fixed possible statistics computation slowness when using more recent versions of pandas.

Labeling

  • Added ability to use empty annotations for Object Detection and Text Span Annotation tasks, i.e. it’s possible to label an image where there are none of the objects in scope.

MLOps & API Deployer

  • Fixed build of API deployer image with python 3.10 built-in environment

  • Fixed usage of inherited code environment in API test queries

  • Fixed an issue with updating SageMaker deployments when no change happened in the deployment configuration

  • Allow ‘/’ in images prefix setting on Deploy Anywhere infrastructures

  • Added the ability to edit existing external monitoring scopes in Unified Monitoring settings

  • Added an option to clean generated files when deleting a scope in Unified Monitoring settings (true by default).

  • Added an option in monitoring infrastructure settings to automate the connection to the event server

  • Fixed an issue with monitoring wizard when “path within connection” is empty in the event server connection settings

  • Fixed the display of “run as” user in Unified Monitoring when scenario is set to run as last author

Cloud Stacks

  • Fleet manager images now run AlmaLinux 8

  • Added session expiration option on Fleet Manager

  • Azure: Allow data disk resize (while instance is deprovisioned)

  • AWS / GCP: Added ability to specify tags that will be added to resources deployed by a Fleet manager template.

  • Fixed rare startup failure

Performance & Scalability

  • New feature: The DSS engine can now run containerized for most visual recipes, which alleviates load on the DSS machine and permits more scalability even with the DSS engine

  • New feature: Tableau and PowerBI exports can now run containerized

  • Added a maximum number of concurrently running jobs (not only job activities). Excess jobs are automatically queued

  • Improved responsiveness when starting a job

  • Improved memory efficiency of Prepare recipe previews with many deleted columns.

  • Improved performance for computing code env usage

  • Performance improvements in various API calls

  • Fixed possible instance hang when using extremely nested formulas in computed columns

  • Improved performance of starting up Spark jobs with large number of connections

Hadoop

  • Added support for CDP 7.1.9

  • Added timeout for validation of Hive queries to avoid blocking further validations

Security

  • Added ability to force LDAP (as well as Azure AD and custom suppliers) group synchronization before starting a scenario on behalf of another user

  • Fixed issues with login groups restriction for groups containing commas in their names.

  • Sessions of users that become disabled due to an LDAP group synchronization are now immediately invalidated

  • User names from 1 to 241 characters are now accepted (was previously 3 to 80).

  • Added ability to disable signature checks for self signed certificates on API nodes with OAuth2

Elastic AI

  • Added support for configurable shared memory (/dev/shm) size in container execution configuration. This is useful notably for multi-GPU scenarios

  • Automatically gather pod CPU usage while running jobs

  • Fixed case where terminated pods for failed containerized notebooks could remain registered in the cluster

Miscellaneous

  • Added ability for administrators to display a custom message on login screen, for example to explain how to get granted access, or to reset a password, etc.

Version 12.5.2 - February 26th, 2024

DSS 12.5.2 is a new feature and bugfix release.

LLM Mesh

  • New feature: Added support for Databricks Mosaic AI LLMs (completion + embedding)

  • New feature: Added support for AWS Sagemaker JumpStart LLMs

  • OpenAI: Added text-embedding-3-small and text-embedding-3-large embedding models

  • Azure OpenAI: Added ability to customize cost

  • Bedrock: Added support for embedding models

  • Bedrock: Added support for custom models

  • Bedrock: Added configurable timeout

  • HuggingFace: Added support for multi-GPUs for LLM inference in container execution

  • HuggingFace: Added support for batching for HuggingFace completion

  • HuggingFace: Fixed download of Distillbert SST 2 when using model cache

  • Vertex: Fixed Vertex AI LLM connection’s network settings

  • Fixed display of icons for Knowledge Banks & Prompt Studios in Wikis

Machine learning

  • Fixed minor issues in model overrides

  • Fixed subpopulation analysis for MLFlow models and containerized execution

  • Fixed training of LightGBM algorithm with bayesian search and kfold enabled when an explicit number of leaves is set

  • Added “Minimum sampled per leaf” option in LightGBM algorithm settings

  • Added support for MLFlow and External models to the “get_predictor” method

  • Fixed too strict workspace name validation for AzureML External Models

  • Fixed “Test” action of Databricks model deployment connection at creation time

MLOps

  • Improved UI performance when a single store contains thousands of evaluations

  • Fixed subpopulation computation on Model Evaluation from standalone evaluate recipe when run in a container

Charts

  • New feature Added zoom on scatter and scatter multi-pair charts by drawing a rectangle

  • Improved performance of scatter multi-pair charts with large number of points

  • Fixed wrong safety on X-axis when there are more than 10k points to be drawn

  • Fixed possible overlap between chart values and axis with negative values

  • Fixed Sankey chart if curvature parameter had been previously set to 0

  • Fixed scatter charts with reference line display

  • Various small bugs fixes on reference lines, scatter multi-pair and Geo Map charts

Dashboards

  • Fixed dashboard filters in dashboard exports

AI Assistants

  • Fixed missing scrollbars when using AI Explain on a highlighted segment of code

Governance

  • Fixed item edition when opened from a table row

Code Studios

  • Fixed possible error when a Code Studio template does not have a label defined

  • Fixed Dataiku API initialization in RStudio when encrypted RPC is enabled

Datasets and connections

  • Elasticsearch: added typing for nested object fields

  • Elasticsearch: Fixed filters on time dimensions for field-based partitioning

  • Elasticsearch: Fixed export to recipe does not include Elasticsearch query string anymore

  • Snowflake: Fixed project-level override of connection variables with Snowpark

  • Databricks: Fixed project-level override of connection variables

  • Files in folder: Fixed the “Test & Get schema” button on existing datasets

  • GCS: Fixed access token refresh

  • Added ability to define extra Hadoop configuration keys on cloud storage connections

Visual recipes

  • Prepare: Prevented “Date Part” mode settings of the “Keep rows” processor to be reset when opening the recipe

  • Prepare: Fixed disabled run button after running “remove column” step with error

  • Prepare: Fixed find and replace step on PostgreSQL if replacement values is not the same type than input column

  • Sync: Fixed sync from BigQuery to GCS when source is a BigQuery view

  • Improved the display of dataset with long names in dataset selection fields

  • Improved valiation error when some connections have been deleted

Scenarios and automation

  • Fixed UI issue when adding several non saved steps of the same type

  • Fixed “Include only required saved model versions” option when the bundle contains a clustering model

  • Fixed API service package generation from the automation node

Deployer

  • Fixed code env variables access from API endpoint when running on Kubernetes

Coding

  • Added Python API to setup SSO, LDAP and AzureAD settings

  • Added Python API to setup FM keyvault

  • Added helper to manage code env presets

  • Added safety to prevent creation of DSS_INTERNAL code environment

Cloud Stacks

  • Fixed “re-provision from snapshot” action in Fleet Manager if existing disk cannot be saved

  • GCP: default to “eu” image for region outside of us/eu/asia

  • GCP: avoid NTP drift if “restrict metadata access” option is checked

  • Upgraded kubectl version in DSS images. Old managed Kubernetes clusters may need to be recreated.

Elastic AI

  • Added a mechanism to retry instead of fail in case a ResourceQuota error is raised

Misc

  • Fixed Requests database not copied when migrating internal database from H2 engine to PostgreSQL

  • Fixed “Add remote” in project version control when setting a custom ssh command

  • Fixed a potential issue saving a deeply nested file

Version 12.5.1 - January 31st, 2024

DSS 12.5.1 is a performance and bugfix release.

Code studios

  • Fixed existing Code Studios not starting

Deployer

  • Fixed error when saving project deployment settings

Kubernetes

  • Fixed an issue leading to Kubernetes pods leak when aborting a job

Code environments

  • Fixed possible issue when listing code environments if there is a corrupted code environment definition

Version 12.5.0 - January 23th, 2024

DSS 12.5.0 is a significant new release with both new features, performance enhancements and bugfixes.

New feature: Unified Monitoring

Unified Monitoring provides out-of-the-box dashboards to monitor the health and status of both Dataiku projects and API deployments.

Unified Monitoring is part of the Dataiku Deployer.

Monitoring of API deployments includes both DSS-managed endpoints and other endpoints not managed in DSS (referred to as External Endpoints).

New feature: LLM Mesh General Availability

The LLM Mesh is now Generally Available

New feature: AI Code Assistant

The new Dataiku AI Code Assistant for Python helps you write, explain, or debug code, comment and document your work, create unit tests, and more.

The Dataiku AI Code Assistant is available:

  • In Visual Studio Code (in newly created Code Studios)

  • In Jupyter notebooks. Run %load_ext ai_code_assistant to start using

AI Code Assistant must first be enabled by the administrator in Admin > Settings > AI Services. It requires a connection to the LLM through the LLM Mesh

New feature: AI Prepare

In the prepare recipe, AI Prepare allows you to describe the transformation you want to apply. The AI assistant generates the necessary data preparation steps.

AI Prepare must first be enabled by the administrator in Admin > Settings > AI Services, and requires agreeing to Dataiku AI Services Terms of Service.

New feature: AI Explain Flow

In the flow, AI Explain can automatically generate descriptions of what the whole Flow, or individual Flow zones do.

AI Explain Flow must first be enabled by the administrator in Admin > Settings > AI Services, and requires agreeing to Dataiku AI Services Terms of Service.

New feature: AI Explain Code

In Python code recipes, this feature can automatically generate descriptions of what the code does.

AI Explain Code must first be enabled by the administrator in Admin > Settings > AI Services, and requires agreeing to Dataiku AI Services Terms of Service.

New feature: OpenAPI generation

In the API designer, you can now define and publish OpenAPI specifications for APIs endpoints. OpenAPI, formerly known as Swagger, is a standard specification for APIs. OpenAPI support helps makes Dataiku API endpoints accessible and usable by standard API portals. It also helps organizing and advertising to end users how to use these endpoints.

New feature: conditional formatting

You can now color cells of a Dataset containing values that satisfy particular conditions or criteria.

Conditional formatting is available in Explore, in Excel exports, and in Excel attachments in scenarios

New feature: text and record labeling

In addition to image classes, object bounding boxes and text spans, Dataiku managed labeling can now label records (dataset rows) and text.

LLM Mesh

  • Added ability to use custom HuggingFace models. Only fully-checkpointed models are supported (adapter models are not) and must belong to the family of a supported built-in model. This allows the use of LLMs based on Mistral, Llama2, Falcon… as well as embedding, text classification or text summarization models.

  • Knowledge Banks can be included in project exports and automation bundles

  • Knowledge Banks can be built in scenarios

  • Prompt Studio: Added Chain-of-Thought prompt sample

  • Prompt Studio: Improved experience for prompt creation

  • Added ability to customize the contextual insertion message for RAG

  • Fixed broken toxicity detection and forbidden terms filtering in recipes

  • Fixed prompt formatting for Mosaic MPT model

Charts & Dashboards

  • Formatting of charts is now centralized in a new “Format” panel

  • In dashboards, added the ability to create Exclude cross-filters for charts and datasets

  • Filters applied on Explore are now kept when publishing a dataset to a dashboard

  • Added the ability to format both left and right axes in the following charts: vertical bars, lines and mix charts.

  • Added the possibility to show reference lines on Scatter multi-pair charts

  • Added the ability for the user to disable tooltips

  • Fixed records count display in scatter multi-pair

  • Fixed an issue in pivot tables when dealing with large numbers

  • Fixed an issue with the copy of charts with large amount of settings (reference lines, filters, …)

  • Added the ability to set thumbnails and descriptions for workspaces

  • Fixed issue with refresh of workspace permissions when losing access to a workspace

Flow

  • Added a new Flow view, “Column usage”, to see where some given columns appear

Recipes

  • Prepare & Sample/Filter: Added “is between” and “is not between” operators

  • Prepare: Added SQL support for date parsing patterns with 2-digit years

Datasets

  • Filters in Explore view now display up to 500 values (up from 50)

  • Metrics: Fixed display of history of metrics of type “Most frequent values > Mode”

  • BigQuery: Fixed error when listing the partitions of a table partitioned in BigQuery with BigQuery option “partition filter required”

  • New view in dataset to display one or multiple large values from a single line

  • Databricks: Fixed error when performing fast write to DataBricks on partitioned datasets with an empty partition

Visual Machine Learning

  • New feature: Clustering models now feature a chart plotting the model’s performance against the number of clusters. This helps to visually check the optimal number of clusters (elbow chart)

  • New feature: Prediction Overrides can now use Uncertainty (for binary or multiclass classification tasks) as criteria for matching. This enables rules such as “if the uncertainty of the model is above 30%, decline to make a prediction”.

  • Added support for sparse matrices with XGBoost and LightGBM, allowing for faster and more memory-efficient training

  • Added support for scikit-learn 1.2

  • Added support for Python 3.11 (except for Visual Deep Learning and Causal ML)

  • Added support for pmdarima 2.0.x and prophet 1.1.x for Time series forecasting

  • Removed upper limit on L1 and L2 regularization for XGBoost

  • Fixed display of value distribution for text features in Design settings

  • Fixed incorrect reuse of older dataset extracts for training following reversal of design settings to a previous session

  • Fixed failing download of training diagnostic in some partitioned training cases

  • Fixed clustering & scoring recipes using Isolation Forest with Outlier Detection enabled

  • Fixed clustering training recipe on automation node when no version was bundled in the Saved Model

  • Fixed possible permission issue in Model Error Analysis when a model was trained by another user

Govern

  • New feature: Artifact timeline: Enriched the timeline view in artifacts with details about the changes made to the artifact, and with filtering capabilities

  • Added indicators for projects making use of external AI services or local LLMs

  • Added indicators for projects that are either a Dataiku application template or and application instance

  • Added support for AWS SES in email settings

  • Fixed blueprint version list issue when deleting a blueprint version that was forked to create other blueprint versions

  • Fixed an issue preventing to save again a blueprint version when a first save was rejected because of validation errors

Webapps

  • Fixed an issue accessing webapps with a vanity URL on Cloud Stacks

  • Fixed accessing Dash, Shiny & Bokeh Webapps with an URL that does not end with a slash

Code Studios

  • New feature: Added ability to create and deploy Gradio webapps

  • New feature: Added ability to create and deploy Voila webapps

  • Added ability to disable the container cache when building templates

MLOps

  • In the Evaluation recipe for AWS Sagemaker external models, the data drift column handling now relies on the model’s schema rather than on the dataset’s columns

  • Added support for Sagemaker log format in the Evaluation recipe for models deployed through the Deployer

  • Fixed an issue in the Evaluation recipe when the evaluation dataset contains columns with a name that looks like a probability column

  • Fixed a compatibility issue for model import from Databricks

Labeling

  • Added support for comments on Labeling tasks. When enabled on a Labeling task, Annotators and Reviewers can add a comment along with their annotations. Comments show up in Review, and are included in the output dataset.

APIs and coding experience

  • Added support for Pandas 1.4 and 1.5 in code environments

  • Added support for using Conda for Python 3.10 code environments

  • Added the ability to edit SQL recipes in SQL notebooks

  • Added APIs to rename datasets and recipes

  • Code Studios: Added an option in the API to rebuild dependent templates when updating a Code env

  • Added DSSCodeStudioObject.change_owner method to change the owner of a Code studio

  • Fixed an issue with R recipes sometimes reporting a successful run whereas code failed

Elastic AI

  • Fixed an issue preventing Jobs from starting when executed on Pods with init containers

  • Added ability to use Spark on Kubernetes with GCS connections where the private key is passed as a file path

Streaming

  • Improved latency of StreamingEndpointStream.get_message_iterator()

  • Fixed scala streaming recipe from a DSS Dataset to a streaming endpoint

Plugins

  • More icons: Plugin developers can now choose their icons among all icons available in Font Awesome 5.15.4

Performance & Scalability

  • Reduced IO cost of starting a job

  • Reduced perceived latency of starting a job in many cases

  • Improved startup performance for containerized execution, when large data exists in project libraries

  • Improved startup performance for containerized execution, when scoring large models

  • Added the ability to control resource consumption (through cgroup) for plugin data-related components (custom datasets, custom formats, custom exporters, custom FS providers)

  • Performance enhancement upon modifying permissions, for DSS instances with large number of both projects and users

  • Fixed possible hangs related to dataset using custom FS provider from plugins

  • Added ability to “pre-heat” job execution processes to lower the job start latency

  • Added protection against too large API service package exports (20GB max)

  • Added protection against hang due to too large number of AzureAD groups

Cloud Stacks

  • Allow provisioning DSS instances with a temporary self-signed certificate (while waiting for the actual certificate to be installed)

  • Fixed an issue when trying to create several setup actions of same type (only affects DSS 12.4)

  • Added ability to automatically set up a Visual Machine Learning code env

Miscellaneous

  • Added ability to use Python 3.10 for the builtin environment

  • Added new built-in macro to clear assets from expired trial users

  • Added ability to filter datasets by data steward in Data Catalog

  • Added ability to quickly search for scenario steps in the UI

  • Fixed an error happening when users that do not have access to any connection try to create a managed folder

  • Version control: When creating a new branch and duplicating the project, the newly created project now keeps the same name, description, permissions as the original project

  • Fixed a possible R setup issue

  • Fixed inconsistencies in accessing DSS login page, depending on the presence of a trailing slash

Version 12.4.2 - January 12th, 2024

DSS 12.4.2 is a security, performance and bugfix release.

LLM Mesh

  • Added support for Google Vertex Gemini Pro model

  • Added support for custom models in OpenAI connection

  • Added support for embedding and multiple models in plugin LLMs

  • Added ability to enter a base URL instead of resource name in Azure OpenAI connections

  • HuggingFace: Fixed Mistral model when DSS-managed model cache is in use

  • HuggingFace: DSS-managed model cache now abides by proxy settings

  • RAG: Added ability to build Knowledge Banks from scenarios

  • RAG: Added display of estimated cost for Augmented LLMs

  • RAG: Added support for local Qdrant as the vector store for a Knowledge Bank

  • RAG: Improved Knowledge Bank display in flow zones, flow views and project timeline

  • RAG: Fixed support for ChromaDB on Cloud Stacks and RHEL 8 OS

  • API: Added cost support to DKULLM and DKUChatLLM

  • API: Add batching support to DSSLLM and DKULLM

  • API: Fixed KnowledgeBank usage from code recipes

Datasets and connections

  • BigQuery: Fixed issue with native BigQuery partition on ‘DATE’ column when require_partition_filter is activated

  • Databricks: Fixed connection issue when using Databricks with S3 Fastpath

  • ElasticSearch: Fixed error handling when testing connections

  • SQLServer: Fixed all-catalogs table listing when credentials do not allow to access some databases

  • Plugin datasets (such as Sharepoint): Fixed process and resource leaks in case of error during initialization

  • Fixed object authorization when sharing multiple object from the side panel

Charts and Dashboards

  • Dashboards: Fixed the “Clear all filters” action leading to wrong results until page reload

  • Geo charts: Fixed log scale for colors

  • Fixed an issue with the isNumeric function for null values in User Defined Aggregation Functions definition

Visual recipes

  • Prepare: Added additional safety when converting DSS formula to SQL to prevent possible hangs

  • Prepare: Fixed “Create If… Then” rule on Snowflake when the expression involves a GeoPoint column with literals

  • Prepare: Fixed “is any of the strings” filter condition when run with Spark engine

  • Prepare: Fixed Python step when run on Spark with local config

  • Fixed issue on SQL Server when the database user does not have “SHOWPLAN” permission

Machine learning

  • Fixed “What If” analysis on Object detection and Image classification models in containers

  • Fixed Lasso path model training when using bayesian hyper parameter search

  • Fixed failure on evaluate recipe on MLFlow model when the target column has missing values

  • Fixed error when a user browses a visual ML task without “manage-all code envs” privilege

  • Fixed displayed models in sentence embedding preprocessing with “inherit” or “builtin code-env” settings

  • Fixed output of probabilities in scoring recipe

  • Fixed evaluation and scoring of MLFlow binary classification models when the prediction is a boolean

API Deployer

  • Fixed deployment on Kubernetes of endpoints with a code env with resources

  • Deploy Anywhere now features the deployment of an API service on an endpoint provisioning GPU(s)

  • Improved deployment health check computation for VertexAI, Azure ML and SageMaker model

  • SageMaker: Fixed deletion of cloud resources when deleting SageMaker deployments

  • Databricks: Fixed not persisted OAuth token settings on Databricks model deployment connections

Automation

  • Fixed missing Git tag creation when a bundle is created from a scenario or from the API

  • Fixed error when manually creating an analysis on automation (not recommended)

Coding

  • Added API to manipulate project level git capabilities

  • Fixed the user profile returned by the public api when user is involved in a trial

Code Studios

  • Fixed download action from the “Resources” tab

Elastic AI

  • Fixed image building for conda-based code environments

Cloud Stacks

  • Fixed “Install system packages” setup action

  • Fixed safety limits that could hinder SSH login

Security

  • Prevented users without “freely use” privilege from writing custom SQL filters

Performance & Scalability

  • Improved dataset column statistics computation scalability

  • Fixed possible hang when reconfiguring auditing settings while the event server is unresponsive

  • Fixed performance degradation when adding a vast number of files at once in an existing upload dataset

Version 12.4.1 - December 21st, 2023

DSS 12.4.1 is a security and bugfix release. It contains a critical security fix. We strongly encourage all customers to upgrade to this version.

LLM Mesh

  • Fixed failure accessing private HuggingFace models when using DSS-managed model cache

  • Fixed support for Mistral when using DSS-managed model cache

Recipes

  • Prepared: Fixed failure of “if-then-else” processor running on SQL databases when setting date columns

Collaboration

  • Fixed broken styling of Wiki PDF export

Webapps

  • Fixed redirection to login page when accessing a public webapp URL

Code Studios

  • Fixed Code Studios with R Code Envs on automation node

Cloud Stacks

  • Azure: Added the ability to provision instances with data disks above 4 TB

Upgrade

  • Fixed possible upgrade failure to DSS 12.4

Security

Version 12.4.0 - December 6th, 2023

DSS 12.4.0 is a significant new release with both new features, performance enhancements and bugfixes.

New feature: Deploy models to AWS Sagemaker, Azure Machine Learning and Google Vertex

Dataiku can now deploy API services designed in Dataiku to other platforms, besides Dataiku API nodes.

This capability is available for AWS Sagemaker, Azure Machine Learning and Google Vertex

New feature: connect to external models hosted in Databricks Serving

It is now possible to create External Models from Databricks Serving endpoints

New feature: import models from Databricks MLflow registry

It is now posible to import MLflow models directly from a Databricks-hosted MLflow registry

New feature: Dashboard Cross-filters

The Dashboard now features cross-filters. Cross-filters allow users to interactively explore and analyze data from different perspectives simultaneously. By applying filters across multiple visualizations or data sources, users can dynamically drill down into specific data subsets. This interactivity enables users to gain a comprehensive understanding of their data.

New feature: “Insert recipe”

You can now insert a new recipe after an existing dataset or between an existing dataset and recipe. This makes it easy, for example, to add a prepare recipe in an existing Flow.

New feature: Learning curves

On a trained prediction model, you can now compute Learning curves to see how metrics evolve on the train and test set when training the model with only a part of the data.

New feature: Statistical tests recipe

A new Statistical Test recipe allows you to perform many statistical tests via the flow and can be automated from a scenario. Create it from a test card, in a dataset’s Statistics tab.

Redesigned home page

The home page has been slightly redesigned and now features a panel with recommended content, aimed at improving self-onboarding on DSS.

LLM Mesh & Prompt Studio

  • New feature: Added support for embedding with locally-running models using HuggingFace

  • New feature: Added ability to import, export and delete the weights in the DSS model cache for local HuggingFace LLMs

  • New feature: Added Cohere and Llama2 models to Bedrock connections

  • New feature: Added Mistral and Zephyr models to Huggingface connections

  • Added an option for LLM-based text analysis recipes to show the underlying prompt

  • Added OAuth authentication for Azure OpenAI

  • New Public API to list LLMs and Knowledge Banks

  • Fixed copy of a part of a flow that would fail if it contained a Knowledge bank

  • Fixed Azure OpenAI connection test

Charts & Dashboards

  • New feature: A new chart has been added to the builtin charts: the Sankey diagram

  • New feature: Reference lines can now use dynamic aggregation values. For example, on a bar chart showing the average of price per department, you can also have a reference line showing the global average.

  • New feature: Scatter plot: Added support for multiple pairs of (X, Y) is scatter plot

  • New feature: Added a color dimension to the Boxplot and Mix chart

  • New feature: Added ability to automatically refresh tiles on a loaded dashboard when a scenario impacting them is executed

  • Added the ability to copy content from pivot tables

  • Improved reliability of DSS engine when working with datasets containing high-cardinality alphanumeric columns

  • Fixed the pinning of tooltips in Boxplot

  • Fixed grand total not showing in pivot table when either rows or columns are empty

  • Fixed an error happening in Scatter plots when changing the meaning of a column from numeric to alphanumeric

  • Fixed removal of filter tile not taken into account in “edit” mode

  • Fixed behavior difference between DSS and SQL engines when using numerical axis

  • Fixed behavior difference between DSS and SQL engines when filtering on dates

  • Fixed tooltip display on chart miniatures with no thumbnail

  • Fixed “out of memory” error eventually happening when building charts on big datasets

Datasets

  • Improved user experience when creating datasets from managed folder

  • Added default steward to datasets. In case no steward is explicitly defined, the user who created the dataset is now considered as the default steward

  • Databricks: Automatically refresh access tokens before their expiration to support long running jobs

  • Databricks: Added support for per-user personal access tokens on the Databricks Connect integration

  • Editable dataset: Added back the ability to enable/disable “Keep track of manual changes” setting

  • Editable dataset: Added action to allow a row to be used as column names

  • Editable dataset: Added ability to quickly filter rows using keywords

  • Editable dataset: Added ability to sort rows by column

  • BigQuery: Fixed jobs failure with SQL pipelines and BigQuery datasets with a variable in the schema field

  • Fixed sharing a dataset resetting its object authorizations

  • Faster and more robust read support for Delta format

  • Only drop metastore tables associated with managed datasets upon project deletion if also dropping the managed dataset data

Recipes

  • Prepare: Added ability to create If … then Rule based on cell values with right-click menu

  • Prepare: Added ability to increment date by hours, minutes or seconds in Date Increment steps

  • Prepare: Deprecated Anonymizer processor in favor of Pseudomyze processor

  • Prepare: Fixed new column cannot be renamed immediately after it has been created by a “split … on” step

  • Prepare: Improved distances precision in the Compute distance processor on DSS engine (we now use an ellipsoidal coordinates reference system instead of the previous spherical one)

  • Prepare: Fixed various casting issue with BigQuery and PostgreSQL inputs datasets

  • Prepare: Fixed SQL pipeline failing with Snowflake inputs datasets when some processors are present in the recipe

  • Prepare: Fixed parsing a date with multiple formats failing with Databricks input datasets and SQL engine

  • Prepare & Sample/Filter: Added “Does not contain” operator

  • Sample/Filter: Fixed possible time-shifts with dates with Snowflake inputs datasets with TIMEZONE_NTZ columns

  • Sync: Added support for Parquet file format when syncing datasets between BigQuery and GCS

  • SparkSQL: Fixed checking consistency failing

Webapps

  • Fixed missing redirection to actual public webapps URL after a user logs in with SSO

  • Added automatic restart of Webapps when their security settings are modified

  • Added support for Bokeh 3.1.1

  • Fixed issue deploying Streamlit as webapp with Streamlit versions 1.23 and above.

Machine Learning

  • New feature: A Model override outcome can now be set to Decline to predict, allowing you to define explicit cases where the model will refuse to give a prediction.

  • New feature: Causal ML: a new Treatment Analysis option lets you use inverse probability weighting for causal metrics, mitigating against misleading metrics in the case of non-random or imbalanced treatments. New ML Diagnostics automatically detect such imbalanced cases.

  • Added support for Python 3.10 on Visual Machine Learning

  • Improved GPU settings on modeling tasks that support GPU acceleration

  • Feature Importance now displays reading tips for the current features

  • Custom scoring for model optimization can now reuse any of the model’s custom metrics used for performance evaluation

  • Sped up computation of feature importance for some preprocessing options

  • Added more public APIs to manage Causal ML models

  • Fixed permission error when using code to access a model trained by a different user

  • Fixed time series forecasting with integer external features used as categorical

  • Fixed scoring failure on the SQL engine of models that use feature selection

  • Fixed scoring using scikit-learn ≥ 0.24 of calibrated or SVM models trained with scikit-learn < 0.24

  • Fixed training failure on some cases of feature reduction and all-numeric features

  • Fixed export of Predicted data when the meaning does not correspond to the data, NaN values were incorrectly inserted

  • Fixed a model documentation generation bug that left it stuck at the “resolving placeholders” stage

  • Fixed a race condition on ML recipes when a partitioned dataset is both being built and used in concurrent activities of a single job

  • Fixed a rare bug where a schema change would cause an evaluation recipe to fail but keep running

  • Added option to not fail model-less evaluation recipe when some metrics compuations fail

  • Added custom metrics for time series forecasting models

  • Added model coefficients to time series forecasting model reports for algorithms that support it

  • Fixed slow time series forecasting training in containers when using ARIMA

Computer Vision

  • Take into account EXIF rotation information in images (when training and scoring)

Labeling

  • New feature: Labeling tasks can now take into account pre-existing labels, that annotators can edit, remove or validate as their answer. Such pre-labels also help to configure the labeling task automatically.

  • Improved support of right-to-left languages on Text labeling tasks

  • Added ability to quickly delete all annotations when annotating a given entry on Object detection or Text annotation tasks

  • Fix unusable task when an input column is named state

  • Fix hung task when source dataset creation fails

  • Fix unusable task when source dataset is on a connection with per-user credentials

Visual Statistics

  • Added support for confidence interval in 1-sample and 2-sample t-tests as well as ANOVA

  • Added support for one-sided testing on 1-sample, 2-sample and pairwise t-tests

  • Automatically refresh statistics card published as Dashboard Insights

Govern

  • New feature: Blueprint migration (Advanced Govern only): In some cases, you might want to switch a Govern artifact from one template (blueprint version) to another. With the advanced option, it is now possible to create a template migration to preserve, remove, or add information for the migration.

  • New feature Custom Page Designer revamping (Advanced Govern only): Advanced Dataiku Govern provides the capability to create custom pages with the Custom Pages Designer. The UI of the Custom Pages Designer has been improved and some functionalities have been added:

    • Ability to show/hide pages (incl. standard pages)

    • Ability to specify the order of the pages in the navigation bar (incl. for standard pages)

    • Possibility to create a page defined by pure HTML code. This allows the embedding of external content such as dashboards or videos for example.

  • Notifications: Added more notification triggers on sign-off events (feedback or final approval submitted / edited, sign-off abandon, cancelation, reset, or scheduled reset)

  • Notifications: improved email content

  • Notifications: added the ability to subscribe/unsubscribe to notifications

  • Improved fields in the standard templates to help better address the Pipeline Management and Value Monitoring use case. Note: The “Business functions” list has been updated. It has been removed from the Govern project standard blueprint version. This list has been enhanced with new values and is now located at the Business Initiative level. As a consequence, existing filled-in values are not available anymore.

  • Added Kanban and Matrix views in the Business Initiatives page

  • Added a default 3-step workflow in the Business Initiatives standard template

  • Added the possibility to create a Governed Project directly from the projects page (not linked to a Dataiku Project)

  • Eased the selection of users by groups in user selectors within artifacts

  • Added the ability to edit values from a row directly from a table view

  • Added a new “Govern manager” permission with all administrator permissions except Users, Groups, LDAP, SSO, and everything in administration settings

  • In Roles & Permissions settings, added the ability to inherit role assignment roles from multiple sources

  • Added more evaluation metrics in the Model Registry

  • Fixed table width inconsistencies among screens

  • Added a general setting to disable the signoff delegation feature for non-admin users

  • Fixed an issue preventing field ids from being displayed in view components in the Blueprint Designer

  • Fixed an issue with PDF export of artifacts in case of line breaks in the fields

  • Fixed migration issue with PostgreSQL server distribution that has a modified maximum length for identifiers (NAMEDATALEN)

  • Fixed the “all day” option removing the entry from list of dates when unchecked

  • Fixed the search in category fields where selecting a category from the search results was unselecting other categories not displayed in the search results

  • Fixed date column overlapping when included in a nested table view

MLOps

  • Added support of “common” sagemaker inference data formats for External Models

  • Added a check that the classes defined for External Models or MLFlow Models are matching the specified evaluation dataset.

  • Added a public API method for importing Lab Models into Experiment Tracking

  • Fixed some cases where the Evaluation Recipe was failing in drift computation even though “Consider drift computation failures as errors” option was disabled.

  • Fixed monitoring wizard to require only read access to the infrastructure (and not admin as before)

  • Fixed the evaluation of MLflow models imported through mlflow_extension.deploy_run_model()

  • Added the support for column handling in the Evaluation Recipe for External Models

Deployer

  • Improved performance for bundle preparation

  • Fixed the creation & update of a plugin’s Python code environment via API on an automation node

  • Added support for custom code environment and removed the need for “unsafe code” permission in API test queries

  • Fixed an issue with remapped and bundled connections in API deployments when intermediate build of a code environment image is activated

  • Added automatic rebuild of code environment image if base image is modified

  • Improved the handling of “internal” code environments when deploying to automation (i.e. computer vision and External Models code environments)

  • Fixed remapping of code environment for Continuous Python recipes

Code Studios

  • Added ability for users to upload files in templates

  • Fixed build of conda code envs for Code Studios on automation node

API

  • Fixed DSSDataCollectionItem.get_as_dataset Python function

  • Added new helper function DSSManagedFolder.create_dataset_from_files to create a new dataset of type ‘FilesInFolder’

  • Added new helper functions DSSProject.create_gcs_dataset and DSSProject.create_azure_blob_dataset to create GCS and Azure Blob storage datasets

  • Added support for catalog in DSSProject.create_sql_table_dataset

Performance & Scalability

  • Improved performance for tag listing on DSS instances with many projects

  • Improved performance for computing code environment usage

  • Improved performance for notifying users when changing access of many project/users

Cloud Stacks

  • Fixed time zone issues with user’s last activity time on licence management page

  • Fixed issue caused by Java upgrading itself while DSS is running

  • Fixed issue preventing users from downloading files larger than 1GB

  • Added ability to reorder setup actions in Fleet Manager

  • AWS: Added ability to use a static private IP for Fleet Manager

  • Azure: Fixed issue where reprovisioning from a snapshot changed the type of the data disk

Elastic AI

  • GCP: Fixed ability to attach GKE clusters instantiated in another project than DSS

  • Fixed Spark failing to delete pods due to insufficient permissions

  • Fixed horizontal pod autoscaling with K8S versions 1.26 and above

  • Fixed possible failure of DSS jobs when scheduled on a K8S node that is still starting

Version control

  • Modified storage of DSS objects to remove the versionTag to make them more git-friendly (less “useless changes” generated)

  • Changed the way Git security rules are matched. DSS now considers the first rule for which both the group and repository match instead of the first rule matching the repository. This allows for more fine-grained control of which SSH key is used, for example.

Plugins

  • Fixed an issue where renaming a file or folder in the plugin editor did not refresh the view

  • Fixed plugin recipes failing when containerized execution is selected

  • Allow DSS admins to overwrite the OAuth token endpoint in the presets

Miscellaneous

  • Fixed export of Pandas dataframes from Jupyter notebooks when package requests is at or newer than version 2.29

  • Fixed Flow zoom level being sometimes incorrectly reset when navigating away from a project

  • Fixed permission issue preventing users with only “Read dashboards” permission on a project from accessing a shared dataset included in a dashboard

  • Fixed permission issue preventing users from viewing whole preview of shared datasets if they did not have the “Read project content” permission on the source project

  • Added a setting for administrators to prevent users from changing their emails

  • Help center is now displayed as a draggable panel on small screens

  • Help center: Added setting for administrators to prevent users from opening Dataiku support tickets

  • Made ability to delete code env dependent on permission to manage it

  • Fixed Event server losing data when the flush interval is higher than 50 minutes

  • Fixed issues with PostgreSQL runtime databases

  • Fixed an issue where an App-as-recipe created without a name/label prevented users from creating new recipes

  • Fixed authentication issue when connecting to strict OAuth2s servers

  • Fixed authentication issue preventing some users from login after ‘enable group support’ has been unchecked

  • Code envs: Fixed issue with “Update all packages” option

Version 12.3.2 - November 15th, 2023

DSS 12.3.2 is a security, performance and bugfix release

LLM Mesh & Prompt Studio

  • New feature: Added a Convert to prompt recipe button in the classification and summarization

  • OpenAI: Added GPT4-turbo model

  • Azure OpenAI: Added & Fixed embedding support

  • AWS Bedrock: Adapted to new AWS API

  • AWS Bedrock: Added support for topP and topK

  • RAG: Improved error handling in augmented LLMs

  • Prompt studio: Added ability to select a recipe to update from the Save as recipe option

  • Prompt studio: improved UX when switching between Managed & Advanced modes

  • API: Fixed dataiku.KnowledgeBank python API

  • PII: Added possibility to ignore instead of failing if unsupported language is detected

  • PII: Fixed PII detection on embedding queries

Datasets

  • New feature Added an experimental dialect to connect to Sqream (through the “other databases” connection)

  • Editable dataset: Fixed edition after dataset has been cleared

  • Editable dataset: Added an option to handle large copy/paste of data

  • Excel export: Added support for exporting more than 1 million records

  • Improved error handling when trying to analyze a previously removed or renamed column

Recipes

  • Prepare: Improved mass renaming of columns based on column name pattern (regex)

  • Prepare: Fixed migration of ‘Flag rows on values’ processor

  • Prepare: Fixed creation of “filter on” step by selecting a substring in cell

  • Prepare: Do not select Athena engine by default

  • Join: Fixed icons display when using unmatched row option

  • Pivot: Improved error message when summing non numerical columns

Notebooks

  • Fixed importing from Git when the notebook’s name contains a dot

Dashboards

  • Fixed scrolling in dataset insight

Govern

  • Fixed installation and update of Govern node when the DB schema is not “public”

Spark

  • Prevented auto-selection of Spark engine when one of the outputs is in append mode

  • Fixed SparkSQL / Scala and SparkSQL notebook on CDH 6 (deprecated)

  • Fixed MLLIB training on old hadoop distributions: HDP 3.1.5 (deprecated), CDH6 (deprecated)

Machine Learning

  • Fixed display of median and standard deviation in cluster profiles if the value is zero

  • Fixed MLFlow error with ignoring TLS certificate validity check when DSS is configured for HTTPS

  • External models: Added support of non-JSON probat format for SageMaker CSV

Cloud Stacks

  • AWS: Fixed installation using the fleet-manager-network template

SSO and LDAP

  • Fixed on demand provisioning on Azure AD

Security

  • Fixed Directory Traversal in cluster logs retrieval endpoint

  • Added new HTTP security header to all requests: Permissions-Policy: fullscreen.

  • Added ability to specify additional HTTP security headers: Referrer-Policy, Permissions-Policy, Cross-Origin-Embedder-Policy, Cross-Origin-Opener-Policy and Cross-Origin-Resource-Policy.

Version 12.3.1 - October 30th, 2023

DSS 12.3.1 is a bugfix release

LLM Mesh

  • New feature: Added support for MosaicML

  • Fixed support for GPT 3.5 Instruct

  • Fixed embedding recipe with Azure Open AI models

  • Fixed embedding recipe when containerized execution is enabled by default

  • Fixed “embedding settings” display in Knowledge Banks

  • Fixed NLP classification recipe when output mode is “All classes”

Coding

  • Fixed installation of the “dataiku” Python package outside of DSS

Machine Learning

  • Fixed usage of TF-IDF Text preprocessing in Visual ML when stop words are enabled

API Designer

  • Fixed set_remote_dss dataiku api function when used in API designer

Bundle and Automation

  • Fixed revert of bundles on the design node

Charts

  • Fixed usage of Snowflake engine when the database is set at the session level

Jobs

  • Fixed the “Re-run this job” action from the job page

Webapps

  • Fixed login redirection for public webapps created from Code Studios

Cloud Stacks

  • Fixed loss of LDAP and Azure AD settings when Fleet Manager is restarted

Version 12.3.0 - October 23rd, 2023

DSS 12.3.0 is a significant new release with both new features, performance enhancements and bugfixes.

The LLM Mesh

With the recent advances in Generative AI and particularly large language models, new kind of applications are ready to be built, leveraging their power to structure natural language, generate new content, and provide powerful question answering capabilities.

However, there is a lack of oversight, governance, and centralization, which hinders deployment of LLM-based applications.

The LLM Mesh is the common backbone for Enterprise Generative AI Applications.

It provides:

  • Connectivity to a large number of Large Language Models, both as APIs or locally hosted

  • Full permissioning of access to these LLMs, through new kinds of connections

  • Full support for locally-hosted HuggingFace models running on GPU

  • Audit tracing

  • Cost monitoring

  • Personally Identifiable Information (PII) detection and redaction

  • Toxicity detection

  • Caching

  • Native support for Retrieval Augmented Generation pattern, using connections to Vector Stores and Embedding recipe.

The LLM Mesh is available in Public Preview in DSS 12.3.0.

For more details, please see Generative AI and LLM Mesh.

Prompt Studios and LLM-powered recipes

On top of the LLM Mesh, Dataiku now includes a full-featured development environment for Prompt Engineering, the Prompt Studio. In the Prompt Studio, you can test and iterate on your prompts, compare prompts, compare various LLMs (either APIs or locally hosted), and, when satisfied, deploy your prompts as Prompt Recipes for large-scale batch generation.

In addition, Dataiku now includes two new recipes that make it very easy to perform two common LLM-powered tasks:

  • Classifying text (either using classes that have been trained into the model, or classes that are provided by the user)

  • Summarizing text

Prompt Studio and LLM-powered recipes are available in Public Preview in DSS 12.3.0.

For more details, please see Generative AI and LLM Mesh.

Datasets

  • Databricks: Added support for global (non-per-user) OAuth login

  • Snowflake: Added support for global (non-per-user) OAuth login

  • Snowflake: Added support for variables in the Scope field for OAuth mode

  • JSON: Fixed Spark engine not properly unnesting JSON fields

Machine Learning

  • Added support for What-if on partitioned models without the need to go in an individual partition

  • Added support of custom model views when the view backend runs in containerized execution

  • Added ability to use a Visual model’s predictor Python API from code running in containerized execution

  • Fixed computation of feature importance when there are less than 15 rows in the test set

  • Fixed failing training of deep neural network visual model when the only feature is text using sentence embedding

  • Fixed DSSTrainedPredictionModelDetails.compute_shapley_feature_importance Python API that was broken for saved models

Dashboards

  • Fixed downloading of filtered datasets within dashboard that did not filter

  • Fixed inability to copy chart from insight view to any other chart

  • Fixed error display when a chart hits the limit of displayed points in a dashboard

Flow

  • Fixed “Generate Flow Documentation” failing on servers with non-English locales

Recipes

  • Shell: Fixed renaming not taking into account dataset references in pipes

  • Prepare: Fixed “Filter and flag on formula” step causing SQL engine to fail on some databases such as Redshift.

  • Prepare: Fixed “Rename” step causing SQL engine to fail in some situations such as renaming a column twice, or renaming a column with an empty string.

Deployer

  • Fixed issue with custom base image tag in API Deployer Kubernetes images (custom base images remain discouraged)

  • Added more details in the right panel of API services

Governance

  • Fixed Kanban views not bucketing projects correctly

MLOps

  • Fixed incorrect trainDate in the return of the list_versions() API method for MLflow models

IAM

  • Fixed fetching LDAP users with “Import from external source” not returning usernames if Display name attribute is different from Username attribute

  • Fixed LDAP bind password being wrongfully required, whereas it’s optional

Cloud Stacks

  • Added setup action to add a custom CA into the trust stores of DSS

  • Added ability to reload security settings without having to restart Fleet Manager

Code envs

  • Added support for per-code-env Dockerfile additions

  • Added support for per-code-env CUDA support, removing future need for CUDA-specific container images

Misc

  • Fixed catalog or global search failing when query contained special characters such as @ or ~.

  • Compute Resource Usage: added CPU and memory request and limit to Kubernetes CRU events

Version 12.2.3 - October 10th, 2023

DSS 12.2.3 is a bugfix release

Charts

  • Fixed thumbnails display of Boxplot charts

Recipes

  • Fixed usage of isBlank() formula function in a recipe causing incorrect results when executed with SQL engine

Misc

  • Fixed error occurring when an event server target is configured with an “Path within connection”

  • Fixed exception being added to the logs each time an API node starts

Version 12.2.2 - September 25th, 2023

DSS 12.2.2 is a bugfix release

Machine Learning

  • Fixed the metrics comparison chart for time series forecasting models in the models list

  • Fixed a rare race condition causing training failures with distributed hyperparameter search

Datasets

  • S3: Reduced memory consumption when writing multiple files on S3 in parallel

  • BigQuery: Fixed memory leak

  • Editable dataset: Fixed pressing “enter” in the “edit column” modal not closing the modal

  • Editable dataset: Fixed redo mechanism when a new row had been added

  • Fixed renaming of partitioned datasets causing downstream recipes to fail at runtime

  • Fixed inability to import Excel files containing Boolean cells computed with formulas

Recipes

  • Join: Fixed occasional job failures with DSS engine

  • Join: Fixed wrongly detected duplicate column name when 2 columns only differ by their case

  • Prepare: Fixed “Extract Date components” with SQL engine

  • Prepare: Fixed display issue when rearranging steps order

  • Sync: Fixed schema and catalog not taken into account when executing a Sync recipe from a Databricks dataset to an Azure Blob storage dataset.

  • Shell: Fixed quotes incorrectly added around variables

  • Fixed expansion of variables in partitioning when running a recipe from its edition screen

Deployer

  • API Designer: Fixed inability to run test queries with Python endpoints

  • Improved error message about deployer hooks code

  • Fixed an issue with the selection of core packages for Python 3.8 code environments on deployer and automation nodes

  • Added a “Validate” button in the Deployer Hooks’ code edition screen

Experiment Tracking

  • Added ability to ignore invalid SSL certificates in experiment tracking

  • Fixed several issues with starting runs (when no end time is specified, or when a name is specified but no tags)

Governance

  • Fixed workflow step not being displayed at creation time when there is one mandatory field defined (Advanced Govern only)

  • Fixed the filling of the signoff history on step deletion

Misc

  • ElasticSearch: Fixed invalid projectKey passed in custom headers

  • Charts: Fixed empty legend section displayed in the left pane for charts in Insight view mode

  • Charts: Fixed timeout when exporting a dashboard containing donuts charts that need scrolling to be visible

  • Fixed “Assumed time zone” not displaying the correct default value on existing connections

  • Webapps: Fixed ability to use dkuSourceLibR in Shiny webapps

  • Fixed required permissions to import and export projects using the public API (aligning to UI behavior)

Version 12.2.1 - September 12th, 2023

DSS 12.2.1 is bugfix release

Machine Learning

  • Fixed UI issue disabling the creation of AutoML Clustering models

Cloud Stacks

  • Fixed the reprovisioning of DSS instances from Fleet Manager following a change in PostgreSQL repositories

Misc

  • Fixed a memory leak enumerating Azure Storage containers with very large number of files

Version 12.2.0 - September 1st, 2023

DSS 12.2.0 is a significant new release with both new features, performance enhancements and bugfixes.

New features and enhancements

Custom aggregations on charts

UDAF (User Defined Aggregation Functions) allow user create custom aggregation based on a powerful formula language directly from the chart builder.

For example, you can directly create an aggregation of sum(sell_price - cost) to compute an aggregated gross margin, without having to first create that column.

Radar chart

The Radar chart is now available. Radar Charts are a way of comparing multiple quantitative variables . This makes them useful for seeing which variables have similar values or if there are any outliers amongst each variable .

Radar Charts are also useful for seeing which variables are scoring high or low within a dataset, making them suited for displaying performance.

Govern Sign-off enhancements

Improvements of the sign-off feature allowing to:

  • Reset a finished sign-off

  • Reload an updated configuration from the Blueprint Designer

  • Create a sign-off on an active step if its configuration has been created afterwards

  • Setup recurrence to automatically reset an approved sign-off

  • Have multiple feedback reviews per users

  • Edit and delete feedback reviews and approvals

  • Change the sign-off status to go back to a previous state

  • Send an email to the reviewers when the final approval is added and deleted

  • Additionally, a new validation option has been added in the sign-off configuration to prevent the workflow from going past an unapproved sign-off step.

It also comes with UI improvements such as:

  • Expand and collapse long feedback reviews

  • Display the sign-off description below the title

  • Show the feedback and approver groups with details info on which users are configured

Warning: Some changes have been made to the API around the sign-off feature, you need to pay attention to your usages of the Public API and, for Advanced Govern instances, the logical hooks around the sign-off feature. Only for Advanced Govern instances, you may currently use logical hooks that are checking the sign-off status (preventing the workflow from going past an unapproved sign-off step) which will not work anymore in 12.2.0 due to the API changes. They can be replaced by the new validation option in the sign-off configuration to prevent going past an unapproved sign-off step. After enabling it, you will need to reset the corresponding sign-offs and reload their configuration.

PCA recipe

A new PCA recipe was added. The PCA recipe produces projections, eigenvectors and eigenvalues as three separate datasets.

You can create the PCA recipe from a PCA card in a dataset’s Statistics worksheets.

External Models

External Models allow a user to surface within Dataiku a model available as an endpoint on SageMaker, Azure ML or Vertex AI. Those models can be used like others Saved Models and most noticeably be scored and evaluated.

This feature is currently Experimental.

For more details, please see External Models

Deployer Hooks

Deployer hooks allow administrators of a Project or API Deployment Infrastructure to define pre- and post-deployment hooks written in Python. For instance, a pre-deployment hooks could perform some check and prevent a deployment if it fails ; a post-deployment hook could send a notification.

Other enhancements and fixes

Flow

  • The “Records count” view now displays the exact records count under each dataset in the Flow

  • Added ability to export flow documentation when having read-only acces to the project

  • Added ability to chose the name of the new zone when you copy a Flow zone

  • Added ability to copy a zone directly from the right panel

  • Fixed copying the default zone into a new zone duplicating flow objects into the original zone instead of the new zone

  • Fixed copying a zone not duplicating datasets without inputs

  • Fixed copying a zone to another project creating 2 zones into the destination project

  • Fixed “Recipe Engines” view not listing some engines such as “Snowflake to S3”.

  • Fixed creation of new datasets when creating a new recipe from “+ Recipe” button with no input selected

Datasets

  • BigQuery: Added ability to specify labels that will be applied to BigQuery jobs

  • Editable: Automatically add additional rows and columns when pasting data larger than the current table

  • Excel files: When selecting sheets by a pattern, matching sheets are now displayed

  • CSV: Fixed possible issue reading some CSV files

  • Snowflake: Fixed fast-path from cloud storage with date-partitioned datasets but non-date partitioning column

  • Snowflake: Fixed “Parse SQL dates as DSS date” setting not taken into account for Snowflake

  • Snowflake: Fixed issue with sync from non-SQL datasets with Spark engine

  • Prevented renaming datasets with the same name as a streaming endpoint

  • Fixed renaming datasets when only changing the case (from “DS1” to “ds1” for example)

Recipes

  • Generate features: Fixed failure when input dataset contains column names longer than what the output database can accept (the limit is 59 characters on PostgreSQL for example).

  • Split: Fixed adding a second input before selecting the output during creation

Data Catalog

  • Added ability to add multiple datasets to a Data Collection (either from the Flow or from a Data Collection)

Machine Learning

  • New feature: Causal Prediction now supports multiple treatments

  • New feature: Model comparisons now allow comparing feature importance between models

  • Fixed failure to compute the feature importance of a model would cause the whole training to fail

  • Fixed failure to compute partial dependencies on features with a single value

  • Fixed missing option to use a Custom model in clustering model design settings

  • Fixed scoring of a model with Overrides using the Spark engine

  • Fixed missing dashboard model insight/tile option to show the Hyperparameter optimization report

  • Fixed incorrect aggregate computation of cost-matrix gain when using kfold cross-validation

  • Fixed possible hang of DSS when computing interactive scoring (What-if)

  • Fixed automatic selection of the code environment that could sometimes suggest an incompatible environment when creating a new modeling task

MLOps

  • When exporting a model to the MLflow format, add its required packages to the requirements.txt

  • In evaluation recipes, added ability to skip rescoring and use the prediction if provided in the evaluation dataset.

  • When computing univariate drift, better deal with missing categories by showing a very high PSI rather than having an infinite/missing value

  • With the public API, added ability to create custom model evaluation with arbitrary metrics.

  • Scoring recipes can now compute explanations for MLflow models

  • A model can now be deployed with the GUI from an Experiment Tracking run without being evaluated

  • Non classification/regression models can now be deployed with the GUI from an Experiment Tracking run

  • Monitoring wizard: only suggest the deployments that are relevant for the current project

Statistics

  • Added support for the FDR Benjamini-Hochberg method for p-values adjustment on the pairwise t-test and pairwise Mood test

Charts

  • New feature: Added ability to copy charts from one dataset to another

  • New feature: Added ability to customize tick marks

  • Scatter: Added ability to configure number of displayed records

  • Scatter: Various zoom and pan improvements

  • Scatter: Zoom and pan can now be persisted

  • Scatter: Fixed issues when there are too many colors

  • Bar charts: Improved color contrast for displayed values

  • Pivot table: Added ability to customize font size and color

  • Pie/Donut: Added option to position “others” group at the end

  • Treemap: Fixed tooltip color indicator

  • Added reset buttons for axis customization options

  • Improved zoom buttons on relevant charts (Treemap/Geometry/Grid/Scatter/Administrative filled/Administrative bubbles/Density Maps)

  • Added digit grouping formatting options

  • Fixed measure formatting update on tooltip

  • Fixed display formula for regression line

  • Increased precision for pivot table and maps tooltips

  • Improved legends display performance with many items

  • Fixed number formatting for reference lines on vertical bar and scatter charts

Workspaces and dashboards

  • Improved view/edit navigation on dashboards

  • Improved behavior of date range filter on dashboard

  • Fixed deletion of dashboard filters

  • Fixed dashboard export on air-gapped DSS instances

  • Added ability for users to override the name of workspace objects

  • Improved display of empty workspaces

  • Persist sort during a session on datasets

Coding

  • New feature: Added the ability to edit Jupyter notebooks in Visual Studio Code or JupyterLab via Code Studios

  • Project libraries: Added History tab to track, compare and revert changes.

  • Code Studios: automatically recover in case of network issues

  • Added ability to use the dataikuscoring library in the Python processor of the prepare recipe

  • Fixed ability to run a Python or R recipe from a SQL query dataset

  • Upgraded the builtin version of Visual Studio Code in Code Studios to 4.13

  • Fixed issues with uploading Jupyter notebooks from Databricks or Jupyter notebooks that do not specify a kernel

  • Code Studios: Fixed issue with Unicode characters in project libraries

  • Code Studios: Fixed ability to us Jupyter support in Visual Studio Code

Labeling

  • Input records with invalid or empty identifier / path / text are now ignored

Collaboration & Onboarding

  • Home page: Fixed clicking on a project folder - after scrolling - opens the wrong folder

  • Project activity > Contributors: Fixed error occurring on projects with a very large number of contributions

  • Help center: Added tutorials with progress tracking in Help > Educational Content > Onboarding

  • Project Version control: Added ability to create a tag from a commit

  • Project Version control: Added ability to push & pull tags when using a remote git repository

  • Project Version control: Fixed error happening during force commit not displayed

API Deployer

  • Pre-build required code environments during image build when deploying on a Kubernetes cluster, to speed up actual deployment

  • Added ability to add a commit and a tag when a bundle is created

  • Added an option to trust all certificates for infrastructures of static API nodes

  • Added support for variables for the specification of “New service id” in the “Create API service version” scenario step

  • Fixed running test queries on multi-endpoint API services

Project Deployer & Bundles

  • Added ability to add a commit and a tag when an API Service package is created

  • Better deal with bundle including a Saved Model with no active version: warn on pre-activation and activation and have a clearer exception when using the Saved model

  • Static insights can now be included in bundles

Scenarios

  • Do not clear retry settings when disabling/enabling a scenario step

  • Added a new mail channel to send emails using Microsoft365 with OAuth.

Govern

  • Added new artifact admin permission that grants all permissions for a specific artifact

  • Added the ability to export an item and its content (workflow state, field values) to CSV or PDF files

  • Governed Project’s Kanban view now also includes projects using custom templates

  • Added the ability to add a project directly from a business initiative page

  • Fixed display issue with very long Blueprint name in the Blueprint Designer

  • Fixed standard deviation display issue on Model version metrics

  • Fixed display issue for field of type number and value 0

  • Improved the performances of some queries

Cloud Stacks

  • Fixed display issue of the “Please wait, your Dataiku DSS instance is getting ready” screen

  • Fixed missing display of some errors in Fleet Manager

  • Added warning when trying to set a too small data volume

  • Moved some temporary folders to the data volume to avoid filling the OS volume

  • Fixed default value for IOPS on EBS

  • Fixed issues making the Save button unavailable

Elastic AI

  • Fixed ability to create a SparkSQL recipe based on a SQL query dataset (it however remains a very bad idea)

  • Simplified interaction with Kubernetes for containerized execution: Kubernetes Jobs are not used anymore. DSS now creates pods directly

  • Added display of DSS user / project / … to Cluster Monitoring screens

  • GKE: Improved error message when gcloud does not have authentication credentials

  • GKE: Improved handling of pod and service IP ranges

  • GKE: Added support for spot VMs

  • Added support for using a proxy for building the API deployer base image with R enabled

Streaming

  • Fixed default code sample for Spark Scala Streaming recipe

  • Fixed default code sample for Python streaming recipe

  • Added ability to perform regular reads of datasets in a Spark Scala Streaming recipe

  • Fixed read of array subfields in Kafka+Avro

  • Fixed issue with using “recursive build downstream” in flow branches containing streaming recipes

Performance and scalability

  • Improved performance for listing jobs

  • Improved IO performance for starting up jobs

  • Improved memory usage

  • Fixed possible hang when creating an editable dataset from a large existing dataset

Security

  • Fixed credentials appearing in the logs when using Cloud-to-database fast paths

  • OpenID login: added ability to configure the “prompt” parameter of OpenID

  • User provisioning: clarified how group profile mappings are applied

  • Azure AD integration: Fixed support for users having more than 20 groups

  • OAuth2 authentication on API node: added configurable timeout for fetching the JWKs

  • Jupyter notebooks trust system is now on a per-user basis

Misc

  • Added settable random seed for pseudo random sampling methods, allowing for reproducible sampling.

  • Fixed display issue with “Use global proxy” setting in connection getting wrongfully reset

  • Analyses: Fixed adding or removing tags from the right panel

  • Improved display of code env usage in code env settings

  • Fixed cases where building a code env could silently fail

  • Fixed possible failure aborting a job

  • Fixed issue with displaying large RMarkdown reports

  • Fixed possible error in Jupyter

  • Fixed possible UNIX user race condition when starting a large number of webapps at once

  • dataiku.api_client() is now available from within exporter and fs-provider plugin components

Version 12.1.3 - August 17th, 2023

DSS 12.1.3 is a security, performance and bugfix release

Machine Learning

  • Fixed UI issue in model assertions

  • Fixed partial dependencies failure with sample weights

  • Fixed computation of partial dependencies when rows are dropped by processing

MLOps

  • Fixed possible failure to display model results for imported MLflow models built from recent scikit-learn versions

  • Fixed display of model results for imported MLflow models for which performance was not evaluated

  • Fixed display of API endpoint URL in API deployer

  • Fixed ability to deploy MLflow models that are not tabular classification nor regression

  • Fixed Python requirements for exported MLflow models

Govern

  • Fixed validation error when custom templates have been deployed and standard ones have been archived

Dashboards

  • Fixed filter on “no value” when downloading dataset data from dashboards

Cloud Stacks

  • Fixed issue with authentication when upgrading Fleet Manager directly from 10 to 12.1

Performance

  • Improved performance for reading records with dates from Snowflake

  • Fixed potential slow query and failure on the “Automation monitoring” page

  • Fixed flooding of logs with bad data in Excel export

Security

Misc

  • Added the ability to embed Dataiku in another website through setting “SameSite=None” for cookies

  • Fixed Databricks sync to Azure/S3 with pass-through credentials when Unity Catalog is disabled

  • Fixed issues with display of list of scenarios in some upgrade situations

  • Fixed minor display issue in Wiki taxonomy tree

  • Fixed display of Flow in jobs page with big flows

Version 12.1.2 - July 31st, 2023

DSS 12.1.2 is a security, performance and bugfix release

Datasets

  • Explore: Fixed filtering of Decimal columns with “text facet” filtering mode

  • Editable dataset: increased display density

  • Editable dataset: fixed bad interaction with the Tab key

  • Editable dataset: improved column edition and autosizing experience

  • Editable dataset: fixed bad interaction with keyboard shortcuts while editing a column

  • Snowflake: Strongly improved performance of verifying table existence and importing tables

  • Presto/Trino: Strongly improved performance of verifying table existence and importing tables

  • Databricks: Fixed wrongful cleanup of temporary tables for auto-fast-write

Recipes

  • Prepare: Fixed a case where the formula parser would wrongfully ignore invalid formula and only execute parts of the formula

  • Prepare: removed a wrongful warning regarding dates with SQL engine

  • Prepare: fixed wrongful data loss when using “if then else” to write into an existing column with SQL engine

  • Prepare: fixed number of steps appearing in the description in the right panel of the recipe

  • Window: Fixed pre-computed columns when “always retrieve all” is selected and Spark engine is used

  • Windows: Fixed display when “always retrieve all” is selected

Machine Learning

  • Removed ability to export train set if datasets export is disabled

  • Fixed wrongful binary classification threshold in evaluation recipe

  • Fixed wrongful fugacity matrix not taking threshold into account in drift evaluation

  • Fixed precision-recall curve with Python 2.7

  • Fixed what-if when a feature is empty and selected to “drop row if empty”

  • Fixed SQL scoring on BigQuery

Labeling

  • Object detection: fixed an issue when a single image has more than 5 labels

Dashboards and workspaces

  • Fixed display of Dataiku applications viewed through a workspace

Webapps

  • Fixed ability to retrieve headers for Bokeh 2

Dataiku Govern

  • Fixed improper status computation on the review step when there are unvalidated signoffs in the following steps

  • Fixed display of SSO settings

Elastic AI

  • Fixed ability to run Spark History Server behind a reverse proxy

Cloud Stacks

  • Fixed issues saving forms in the Fleet Manager UI

  • Pre-create the “cpu/DSS” cgroup to make it easier to control CPU through cgroups

  • Increased too low system limits on some components

Performance and scalability

  • Fixed performance issue when renaming datasets on extremely large instances

  • Fixed possible instance crash when using the “compute ngrams” prepare processor with extremely large number of ngrams

  • Improved performance of the “Automation monitoring” page

Miscellaneous

  • Remove extra whitespaces in logging remapping rules to avoid hard-to-investigate issues

Version 12.1.1 - July 19th, 2023

DSS 12.1.1 is a security, performance and bugfix release

Statistics

  • Fixed STL decomposition analysis when resampling is disabled

Machine Learning

  • Fixed charts on predicted data when a date filter is set

Performance and Scalability

  • Fixed performance issue when switching from recipe to notebook, when the recipe code contains lot of spaces

  • Fixed issue with notebooks startup when kernel takes too long to start

Security

Version 12.1.0 - June 29th, 2023

DSS 12.1.0 is a significant new release with both new features, performance enhancements and bugfixes.

New features and enhancements

Dataset preview on the Flow

You can now preview the content of datasets directly from the Flow. Simply click on “Preview”.

Databricks Connect

Support for Databricks Connect was added in Python recipes.

It is now possible to push down Pyspark code to Databricks clusters using a Databricks connection.

More charts customization and features

Many new capabilities and customization options were added to charts and dashboards

  • Added the ability to set the position of the legend of charts on dashboard

  • Added the ability to customize font size and colors for values, legend items, reference lines, axis labels and axis values

  • Added “relative date range” filters for charts and dashboards (“last week”, “this year”, …)

  • Added ability to force displayed values to overlap

  • Bar charts: Added reference lines (horizontal lines)

  • Scatter plots: Added reference lines (horizontal lines)

  • Scatter plots: Added regression lines

  • Scatter plots: Added zoom and pan

New join types

The join recipe now supports 2 new types of joins:

  • Left anti join: keep rows of the left dataset without a match from the right

  • Right anti join: keep rows of the right dataset without a match from the left

Text Labeling

In addition to image classes and object bounding boxes, Dataiku managed labeling can now label text spans in text fields.

Visual Time series decomposition

Visual Statistics now includes visual STL time series decomposition (trend and seasonality)

New editable dataset UI

A new UI for the “editable” dataset adds many new capabilities:

  • Easier resizing of columns

  • Auto-sizing of columns

  • Click-and-drag to fill

  • Ability to add several rows and columns at once

  • Ability to reorder & pin columns with drag-and-drop

  • Fixed various issues with undo/redo

  • Added warning when attempting concurrent edition

Excel sheet selection enhancements

Excel files sheet selection was revamped. It is now possible to select sheets manually or via rules based on their names or indexes, or to always select all sheets.

In addition, it is now possible to add a column containing the source sheet name.

Enhanced user management capabilities

  • Added the ability to automatically provision users at login time from their SSO identity

  • Added Azure AD integration to provision and sync users

  • Added the ability to explicitly resync users (either from the UI or from the API) from their LDAP or Azure AD identity

  • Added the ability to browse LDAP and Azure AD directories to provision users from their LDAP or Azure AD identity at will (without them having to login first)

  • Added the ability to define and use custom authentication and provisioning mechanisms

Other enhancements and fixes

Machine learning

  • New feature: Added a Precision-Recall curve to classification model reports, as well as Average-Precision metric approximating the area under this curve

  • Added support of ML Overrides to Model Documentation Generation

  • Added indicators in What-if when a prediction was overridden

  • Now showing preprocessed features in model reports even when K-fold cross test was enabled on this model

  • Added option to export the data underlying Shapley feature importance

  • Sped up training of partitioned models

  • Added a “model training diagnosis” in Lab model trainings, to download information needed for troubleshooting technical issues

  • Fixed reproducibility of Ridge regression models

  • Fixed the computation of the multiclass ROC AUC metric in the rare case of a validation set with only 2 classes

  • Fixed a possible scoring failure of ensemble models on the API node

  • Fixed overridden threshold of binary classification model when scoring with Spark, Snowflake (with Java UDF) or SQL engines

  • Fixed a failure when an evaluation recipe was run on a spark-based model with either only a metrics output or only a scored output dataset

  • Fixed a failure to score time-based partitioned models using the python (original) backend when the partitioning column is a date or timestamp

  • Fixed a scoring failure when using time-based partitioning on year only

  • Fixed inability to delete an analysis containing a Keras / Tensorflow model

  • Fixed a condition where an erroneous user-defined metric would cause the whole training to fail

  • Fixed training failure caused by incorrect stratification of stratified group k-fold with some datasets

  • Fixed a possible hang of a containerized train when the training data is very large

  • Fixed broken Design page for modeling tasks in some rare cases

  • Fixed MLlib clustering with outlier detections

Time series forecasting

  • New feature: Added Model Documentation Generation for forecasting models

  • Added experimental support for forecasting models with more than 20000 series

  • Added option to sample the first N records sorted by a given column

  • Added ML diagnostics to the evaluation & scoring recipes, warning instead of failing when a time series is too short to be evaluated or resampled, or when a new series was not * seen at train time by a statistical model

  • Added an option in multi-series forecasting models to ignore time series that are too short

  • Sped up the loading & display of multi-series forecasting models

  • Set the default thread count of forecasting models hyperparameter search to 1, to ensure full reproducibility

  • Fixed distributed hyperparameter search of time series forecasting models

  • Fixed evaluation recipe schema recomputation always using the saved model’s active version even when overridden in the recipe

  • Fixed failure when the time column contains timezone and using recent version of pandas

  • Fixed a training failure when some modalities of a categorical external feature are present in the test set but not in the train set

  • Fixed a failing train of multi-series models when an identifier column contains special characters in its name

  • Fixed a training failure when using Prophet with the growth parameter set to “logistic”

Computer Vision

  • Added support for log loss metric in Image Classification tasks

  • Added ability to publish a Computer Vision model’s What-If page to a Dashboard

  • Fixed a possible failure when coming back to the What-If screen of Computer Vision models after visiting another page

  • Fixed a possible training failure when Computer Vision models are trained in containers

  • Fixed incorrect learning rate scheduling on Computer Vision model trainings

Charts & Dashboards

  • Fixed dashboard export with filter tile

  • Fixed dashboard, on opening dataset insights appear unfiltered for a short moment

  • Stacked bars chart: Added ability to remove totals when “displaying value”

  • Bars: Fixed Excel export

  • Horizontal bars: Fixed X axis disappearing

  • Line charts: Fixed axis scale update on line charts

  • Pivot table: remove “value” column if only one measure is displayed

  • Scatter plots: Made maximum number of displayed points configurable

  • Maps: Fixed display of legends with “in chart” option

  • Boxplots: Fixed chart display when the minimum is equal to zero

  • Boxplots: Fixed display of min and max as we allow possibility to set manual range

  • Added reference lines in Excel export

  • Fixed excel export for charts with measures

  • Fixed “export insight as image” not displaying legend

  • Fixed tooltip display on each subchart

  • Improve empty state and wording for workspaces

  • Fixed issue with selecting text in chart configuration forms

  • Fixed thumbnail generation when using manual axis

  • Fixed discrepancy in filter behavior between DSS and SQL engines when data contains null values

Notebooks

  • Added “Search notebooks” to easily search within ElasticSearch datasets

Code Studios

  • Streamlit: Allowed changing the config.toml

  • Streamlit: Allowed to specify a code-env for Streamlit block, allowing to choose a custom Streamlit version

  • JupyterLab: Fixed block building failing on AlmaLinux

  • JupyterLab: Added warning when stopping Code Studio and some files have been written in JupyterLab’s root directory

  • JupyterLab: Fixed renaming folders whose names contain whitespaces in JupyterLab

  • Fixed unexpected visual behavior when clicking on a DSS link inside Code Studio while not authorized

  • Fixed wrongful display of old log messages

  • Fixed “popout the current tab” button not working under some circumstances

  • Set ownership of code-envs created with the “add code environment” block to dataiku user

Flow

  • Added “stop at flow zone boundary” option when building multiple datasets at once.

  • Fixed incorrect layout when a metrics dataset or a cycle is present in a flow zone

  • Fixed unbuilt datasets appearing as built after a change in an upstream recipe caused theirs schemas to be updated

  • Fixed zone coloring when doing rectangular selection on the Flow

  • Added support for “metrics” dataset when doing schema propagation

  • Fixed “copy subflow to another project” failing when quick sharing is enabled on the first element

  • Fixed “Drop data” option for “Change connection” action

  • Fixed update of code recipes when renaming a dataset while copying a subflow

Datasets

  • Fixed leftover file when deleting an editable dataset without checking drop data

  • Added support for direct read of JSON files from Spark

  • Fixed dataset explore view not behaving correctly if the last column is named “constructor”

  • Added support for “_source” keyword in Custom Query DSL for ElasticSearch datasets

  • Added support for Azure Blob to Synapse fast path when network restriction is enabled on the Azure Blob storage account

  • Do not propose “Append instead of overwrite” for Parquet datasets, as it’s not supported

  • Improved error reporting for various cases of invalidly-configured datasets

  • Fixed BigQuery auto-write fast path with non-string partitioning columns

  • Added support for S3/Redshift fast path when using STS tokens

Recipes

  • New feature: Generate features: now supports Spark engine

  • New feature: Added recipe summary in right panel for sample/filter, group, join and stack recipes

  • Prepare: Fixed “concat” processor on Synapse

  • Prepare: Fixed preview of Formula editor not showing anything when the formula generates null values for all input values in the sample

  • Prepare: Fixed a possible timeshift with input Snowflake datasets contain columns of type “date”.

  • Prepare: Fixed possible error when moving preparation steps when input dataset is SQL

  • Prepare: Fixed possible incorrect engine selection when input dataset is SQL

  • Prepare: Added SQL engine support for “Concatenate columns” steps on Synapse datasets.

  • Prepare: Fixed wrongful change tracking for changes made on columns that have just been added by a processor

  • Prepare: Fixed wrongful “Save” indicator whereas recipe was already saved

  • Prepare: Disable Spark engine when “Enrich with context information” processor is used

  • Prepare: Fixed saving of output schema with complex types with detailed definition

  • Group and window: Fixed using an aggregation on a column that doesn’t exist in the input of a Group or Window recipe yields an unexpected error.

  • Fuzzy Join: fixed wrongful “metadata” output when using multiple join conditions

  • Window: Added “Retrieve all” checkbox to automatically retrieve all columns in the input dataset. This option is checked by default for all newly created recipes.

  • Sync: Fixed possible timeshift when input Databricks datasets contain columns of type date.

  • Sync: Fixed redispatching partitions with both a discrete and a time-based dimension

  • Sync: Fixed computing of metrics on output dataset with partition redispatch

  • Pivot: Fixed issue with BigQuery geography columns

  • Join: Fixed “match on nearest date” on Synapse

Data collections

  • Improved loading time of the various screens

  • Fixed filters being reset when refreshing data collection page

Labeling

  • New feature: Added ability to specify additional columns to be displayed next to the image or text being annotated

  • New feature: Added ability for reviewers to reject an annotated item and send it back for annotation

  • Fixed inability to delete a Labeling Task’s data when its input dataset is shared from another project

Jobs

  • New feature: Job view now displays Flow with Flow zones

  • Fixed clicking on a Job activity for a dataset that has been deleted

  • Fixed blank flow in Jobs screen on some large flows

  • Fixed Job failure incorrectly reported when building datasets with option “Stop at zone boundary” and a dependency located outside the flow zone is not built.

  • Fixed “there was nothing to do” displayed while job is still computing dependencies

Webapps

  • Webapps do not auto-start anymore at creation

Scenarios

  • Added “stop at flow zone boundary” option.

  • Fixed unexpected error generated when a scenario “Run checks” step references a non-existing dataset.

MLOps

  • Added support for MLflow 2.3

  • Added support for Transformers, LangChain and LLM flavors of MLflow

  • Added support of MLflow model outputs as lists

  • Added a project macro to delete model evaluations.

  • Create metrics and checks datasets in the same flow zone as the object they relate to.

  • Added the ability to define a seed in the evaluation recipes when using random sampling

  • In the standalone evaluation recipe, ease the setup of classes when there are many by allowing to copy / paste them.

  • Fixed Python 2.7 encoding issues in the evaluation recipe when dealing with non-ASCII characters

  • Fixed support of MLflow models returning non-numeric results

  • Ease the setup of the standalone evaluation recipe for pure data drift monitoring (prediction column is now optional)

  • Fixed incorrect handling of forced threshold in a proba-aware, perfless standalone evaluation recipe

  • Fixed the computation of the confusion matrix with Python 3.7

  • Avoid creating a Saved Model when errors occur during the deployment of a model from an experiment tracking run.

  • Fixed the creation of API service endpoint from a MLflow imported model with prediction type “Other”

  • Fixed the import of a new Saved Model Version into an existing Saved model from a model from an experiment tracking run with prediction type “Other”.

  • Fixed an issue preventing the import of new MLflow model versions into an existing Saved Model from a plugin recipe.

  • Fixed import of projects exported with experiment tracking

Deployer

  • New feature: Added the ability to publish a bundle to the deployer without being project admin

  • Added historization and display of deployments logs in project and API deployers

  • Added autocompletion on connection remappings in deployments and deployer infrastructure settings

  • Added infrastructure status for the API node in API deployer

  • Prevent the creation of two bundles with the same name

  • Fixed the setup of permissions of deployer related folders when installing impersonation

  • Enhanced the ability of deployments to customize parts of the exposition settings of the infrastructure

Dataiku Govern

  • New feature: Improved the graphical structure of artifact pages and the way fields are displayed within it

  • New feature: Added the custom metrics in the Model Registry

  • New feature: Added the ability to filter on multiple business initiatives

  • New feature: Added the possibility to set a reference from a back reference field

  • Explicitly labeled default governance templates as “Dataiku Standard”

  • Improved the creation of items inside tables (do not propose already selected items, redirect back to the table after item creation).

  • More explicit message for object deletion

  • Simplified breadcrumb on object pages, it’s now only based on object hierarchy and not on navigation history anymore.

  • Fixed an issue with the selection of a Business Initiative at govern time when the govern template doesn’t have a Business Initiative

  • In all custom pages, by default, prevented the display of archived objects and added a checkbox to display them

  • Forbid the usage of an archive blueprint version when governing an object or creating a new one (Note: “auto” governance doesn’t take archived blueprint versions into account anymore either)

  • More explicit button labels for blueprint version activation and archiving

  • Fixed a refresh issue on the object breadcrumb when updating the object’s parent

  • Fixed an issue on deployment update when the govern API key is missing from the deployer’s settings

  • Fixed the application of the node size selected during installation

  • Fixed filters not being taken into account when mass selecting users in the administration menu

  • Various small UI enhancements

Elastic AI

  • Clusters monitoring: added CPU and memory usage information on nodes

  • Clusters monitoring: improved sorting

  • AKS: Added support for selecting subscription when using managed identity

  • AKS: Added support for deleting nodegroups

  • EKS: Fixed failures with some specific kubectl binaries

  • EKS: Wait for nodegroup to be deleted before giving back control, when resizing it to 0

  • EKS: Fixed “test network” macro

  • Fixed invalid labels that could be generated with some exotic project keys

Cloud Stacks

  • Added ability to resize root disk on Azure

  • Fixed handling of “sshv2” format for SSH keys

  • Added ability to enable assignment of public IP in subnets created with the network template

  • Added ability to retrieve Fleet Manager SSL certificate from Cloud’s secret manager.

Performance & Stability

  • Major performance enhancements on handling of datasets with double or date columns, especially when using CSV. Performance for reading datasets in Python recipes and notebooks can be increased by up to 50%

  • Added safety limits to CSV parsing, to avoid cases where broken or misconfigured CSV escaping can cause a job to fail or hang

  • Added safety limit on the number of garbage collection threads to DSS job processes and Spark processes, to limit the risk of runaway garbage collection overconsuming CPU

  • Added safety limit on filesystem and cloud storage enumerations to avoid crashes when enumerating folders containing dozens of millions of files

  • Fixed possible crash when computing extreme number of metrics (such as when performing analysis on all columns on all data with thousands of columns)

  • Performance enhancement when custom policy hooks (such as GDPR or Connections/Projects restrictions) are in use

  • Fixed possible instance hanging when a lot of job activities are running concurrently

  • Fixed possible instance slowdown when a custom filesystem provider / plugin uses partitioning variables

  • The startup phase of a new Jupyter notebook kernel will not cause pauses for other notebooks running at the same time anymore

Code envs

  • Made dsscli command to rebuild code envs more robust on automation node

  • Fixed ability to use manually uploaded code env resources without a script

Plugins

  • Fixed “run as local process” flag on plugin webapps

  • Fixed code environment of some plugins failing to install when using conda

Misc

  • New feature: DSS administrators can now display messages to DSS end users in their browser to alert them of some imminent event.

  • Fixed a bug where some deleted project library files would remain loaded after reloading a notebook

  • Fixed RFC822 date parsing with non-US locale

  • Fixed link to managed folders located in a different project from the global search page

  • Renamed “Drop pipeline views” macro to “Drop DSS internal views” macro as it can also be used to drop views created by the Generate features recipe.

  • Added back the ability for users to choose - in their profile page - whether they receive notifications when other run jobs/scenarios/ML tasks.

  • Projects API: New projects are now created with the new permission scheme introduced in DSS 10

  • Fixed deletion of foreign datasets in a project incorrectly warning that recipes associated with the original dataset in the source project would be deleted.

  • Fixed sort of dataset by size/records in datasets list view

  • Fixed listing Jupyter notebooks from git when some .ipynb files are invalid

  • Fixed dataset metrics/checks computed using Python probes considered as valid even in case an exception is raised from the code

  • Improved search for wiki articles with words in camel case (Searching for “MachineLe” would not return articles containing “machine learning”)

  • Formula: Some invalid expressions are no longer accepted and now can yield errors. Some of these invalid expressions were previously incorrectly considered as valid and accepted. An example of such an * expression is “Age * 10 (-#invalid”. It is invalid yet was previously accepted and evaluated as “Age * 10”.

  • Streaming: Fixed various issues with containerized continuous Python recipe

  • Fixed deletion of secrets from connection settings

  • Fixed wrongful caching of Git repositories with experimental caching modes

Version 12.0.1 - June 23rd, 2023

DSS 12.0.1 is a security, performance and bugfix release

Datasets

  • Fixed format preview when creating dataset from folder with XML files

  • Fixed error when reading a Snowflake dataset with a DATE column containing nulls

Streaming

  • Fixed continuous Python recipe in function mode when dataframe is empty

Machine Learning

  • Fixed scoring recipe when the treatment column is missing in the input dataset

  • Cloudstack: Fixed usage of Snowflake UDF in scoring recipe

Spark

  • Fixed support of INT type with parquet files in Spark 3

Notebooks

  • Fixed notebooks export when DSS Python base env is Python 3.7 or Python 3.9

Performance

  • Fixed run comparison charts of experiment tracking when there are > 100k steps (11.4.4)

API

  • Allowed read-only user to retrieve through the REST API, the metadata of a project they have access (11.4.4)

Security

Version 12.0.0 - May 26th, 2023

DSS 12.0.0 is a major upgrade to DSS with major new features.

Major new features

Machine Learning overrides

ML models today can achieve very high levels of performance and reliability but unfortunately this is not the general case, and often, they cannot be fully trusted for critical processes. There are many known reasons for this, including overfitting, incomplete training data, outdated models, differences between testing environment and real world…

Model overrides allow you to add an extra layer of human control over the models’ predictions, to ensure that they:

  • don’t predict outlandish values on critical systems,

  • comply with regulations,

  • enforce ethical boundaries.

By defining Overrides, you ensure that the model behaves in an expected manner under specific conditions.

Please see Prediction Overrides for more details.

Universal Feature Importance

While some models are interpretable by design, many advanced algorithms appear as black boxes to decision-makers or even data scientists themselves. The new model-agnostic global feature importance capabilities helps you:

  • explain models that could not be explained until now

  • explain models in an agnostic, comparable way (rather than only using algorithm specific methods)

  • aggregate importance across categories of a single column

  • assess relative direction (in addition to magnitude of importance)

This new feature extends and enhances the existing feature importance and individual explanation capabilities. It is fully based on Shapley values and enriched with state-of-the-art visualisation

This capability is even available for MLflow models imported into DSS.

Causal Prediction

The most common Data Science projects in Machine Learning involve predicting outcomes. However, in many cases, the focus shifts towards optimizing outcomes based on actionable variables rather than just predicting them. For example, you may desire to improve business results by identifying customers who will respond best to certain actions, rather than simply predicting which customers will churn.

Traditional prediction models are built with the assumption that their predictions will remain valid when actionable variables are manipulated. However, this assumption is often false, as there can be various reasons why acting on an actionable variable doesn’t have the expected outcome. For example, acting on one variable may have unforeseen consequences on other variables, or the distribution of the actionable variable may be unevenly distributed in the population, making it difficult to compare individuals with different values of the variable.

To address these challenges, the field of Causal Machine Learning (Causal ML) has emerged, incorporating econometric techniques into the Data Science toolbox. In Causal ML, a Data Scientist selects a treatment variable (such as a discount or an ad) and a control value to tag rows where the treatment was not received. Causal ML then performs additional steps to identify individuals who are likely to benefit the most from the treatment. This information can then be used for treatment allocation optimization, such as determining which customers are expected to respond most positively to a discount.

The Causal Prediction analysis available in the Lab provides a ready-to-use solution for training Causal models and using them to predict the effects of actionable variables, optimize interventions, and improve business outcomes.

Please see Causal Prediction for more details.

Auto feature generation

The new “Generate Features” recipe makes it easy to enrich a dataset with new columns in order to improve the results of machine learning and analytics projects. You can define relationships between datasets in your project.

DSS will then automatically join, transform, and aggregate this data, ultimately creating new features.

Please see Generate features for more details.

Data Collections and Data Catalog

Data collections allow you to gather key datasets by team or use case, so that users can easily find and share quality datasets to use in their projects.

Data Collections, Data Sources search and Connections explorer now live together as the new Data Catalog in DSS.

Run subsequent recipes and on-the-fly schema propagation

For all intermediate recipes in a flow, when you click “run” from within the recipe, you now have an option to either:

  • Run just that recipe

  • Or run that recipe and all subsequent ones in the Flow, with the effect of making the whole “downstream” branch of the Flow up-to-date.

“Run this recipe and all subsequent ones” also applies schema changes on the fly to the output datasets, until the end of the Flow

It is now also possible, from the Flow, to build “downstream” (from left to right) all datasets that are after a given starting point. This also includes the ability to perform on-the-fly schema propagation

Help Center

Dataiku now includes a brand new integrated Help Center that provides comprehensive support, including a searchable database, onboarding materials, and step-by-step tutorials. It offers contextually relevant information based on the page you’re viewing, aiding in feature discovery and keeping you updated with the latest additions.

This Help Center serves as a one-stop solution for all user needs, ensuring a seamless and efficient user experience.

Other notable enhancements and features

Build Flow Zones

It is now possible to build an entire Flow zone. This builds all “final” datasets of this zone, and does not go beyond the boundary of the zone.

Deployer permissions management upgrades

When deploying projects from the Deployer, it is now possible to choose the “Run as” user for scenarios and webapps in the deployed project on the automation node. This change can only be performed by the infrastructure administrator on the Deployer.

In addition, the infrastructure administrator on the Deployer can also configure:

  • Under which identity projects are deployed to the automation node

  • Whether to propagate the permissions from the project in the design node to the automation node

Engine selection enhancements

Various enhancements were made to engine selection, so that users need to care much less often about which engines to select. In the vast majority of cases, we recommend that auto selection of engine is left to DSS, without manually selecting engines, or without setting prefered or forbidden engines.

The most notable changes are:

  • Automatically select SQL engine for prepare recipes when possible and efficient (i.e. when both input and output are the same database)

  • Do not automatically select Spark engine when it will for sure be inefficient (when the input or output cannot use fast Spark access)

Prophet algorithm for Time Series Forecasting

Visual Time Series Forecasting now includes the popular Prophet algorithm.

API service monitoring wizard

A new wizard makes it much easier to setup a full API service monitoring loop that gathers the query logs from the API nodes in order to automate drift computation.

Govern: Management of deployments

Added the synchronization of deployments and infrastructure information from the deployer node into the govern node, providing more information in the Model and Bundle registries about how and where those objects are used.

Govern: Kanban View

A new Kanban view allows you to easily get a view of all your governed projects

Charts: Reference lines

It is now possible to define horizontal horizontal lines on Line charts and Mixed charts

Request plugin installation

Users who are not admin can now request installation of a plugin from the plugin store. The request is then sent to administrators, and the user is notified when the request is processed.

Request code env setup

Users who do not have the permission to create code envs can now request the setup of a code env from the code envs list. The request is then sent to administrators, and the user is notified when the request is processed.

Model Document Generation for imported MLflow models

The automatic Model Document Generator now supports MLflow imported models.

Other enhancements and fixes

Datasets

  • Added settings to enable the Image View for a dataset as the default view

  • Added time part in addition to the date in Last modification column in folders content listing

  • Fixed “copy row as JSON” on filtered datasets

  • Explore: Fixed issue when using relative range and alphanum values filters together

  • Fixed “Edit” tab incorrectly displayed on shared editable datasets

  • S3: increased the default max size for S3 created files to 100 GB

  • Snowflake: Added support for custom JDBC properties when using the Spark-Snowflake connector

  • Snowflake: Fixed timezone issues on fields of type DATE when parsed as a DSS date

  • Snowflake: Added support for privatekey in advanced JDBC properties when using Snowpark

  • BigQuery: Fixed internal error happening if user has access to 0 GCP projects

  • BigQuery: Fixed syncing of RECORD and JSON columns containing NULL values

  • BigQuery: Fixed missing error message when table listing is denied by BigQuery

  • BigQuery: Fixed date issues on Pivot, Sort and Split recipes

Visual recipes

  • Prepare: Stricter default behavior of column type inference at creation time. The columns types of strongly typed datasets (e.g. SQL, Parquet) are kept. Behavior can be changed in Administration > Settings > Misc.

  • Prepare: Improved summary section in the right panel to quickly assess what the recipe is doing.

  • Join: Added a new mode to automatically select columns if they do not cause name conflicts

  • Join: Fixed second dataset’s columns selection being reset when opening a recipe with a cross-join

  • Join: Fixed ability to define a Join recipe using as output dataset one of its input datasets

  • Pivot: Fixed empty screen for “Other columns” step displayed when switching tabs

  • Group: Fixed concat distinct option being disabled even for SQL databases that support it

  • Formula language: Fixed now() function in formula generating a result that cannot be compared to other dates using >, >=, < or <= operators.

Flow

  • Fixed running job icons in Flow not always correctly displayed

  • Fixed Flow zoom incorrectly reset when navigating between projects with and without zones

Visual Machine Learning

  • Added support for Python 3.8 and 3.9 to Visual Machine Learning, including Visual Time Series Forecasting and Computer Vision tasks.

  • Added support for Scikit-learn 1.0 and above for Visual Machine Learning. Note that existing models previously trained with scikit-learn below 1.0 and using the following algorithms need to be retrained when switching to scikit-learn 1.0 (which may happen if the DSS builtin env is upgraded to Python 3.7 or Python 3.9): KNN, SVM, Plugin algorithms, Custom Python algorithms

  • Updated the default versions of scikit-learn and scipy in the sets of packages for Visual Machine Learning for code environments

  • Added Sort & Filter to the Predicted Data tab

  • Added the Lift metric to the model results

  • Fixed Distance weighting parameter not taken into account when training KNN models

  • Fixed failure of clustering scoring recipe when the scored dataset lacks some features that were rejected

  • Removed redundant split computation during training

  • Fixed intermittent failures of Model Document Generator on some models

  • Fixed a rare situation where the Cost Matrix Gain metric would not display

Visual Time Series Forecasting

  • Added ML Diagnostics to TS Forecasting

  • Added a result page to show ARIMA orders

  • Added a new Mean Absolute Error (MAE) metric

  • Switched to Mean Absolute Scaled Error (MASE) as the default optimization metric. The previous default (MAPE) may lead to training failure when a series has only 0s as target values.

  • Improved display of various results for multiple-series models

  • Improved support of Month time unit, for periods ending on the last day of a month or spanning more than 12 months

  • More & more prominent warnings when a time series does not have enough (finite & well-defined) data points for forecasting

  • Fixed computation (and warning) of minimum required data points for external features in the scoring recipe

  • Fixed a bug where forecasting models trained in earlier DSS versions had their horizon changed to 0 when retrained

  • Fixed default value of low pass filter for Seasonal Trend when enabled and lower than the season length

Charts & Dashboards

  • New feature: Filters: Added ability to define filters with single selected value

  • New feature: Mix chart: Added line smoothing option

  • Line chart: Fixed tooltips not correctly triggered in subcharts other the first one

  • Line chart: Fixed axis minimum wrongly computed when switching to manual range

  • Scatter plot: Fixed axis and canvas not aligned if browser in zoomed mode

  • Scatter plot: Fixed tooltips not showing up for points where y=x

  • Treemap: Fixed treemap not rendered under certain circumstances on Firefox

  • Boxplot: Fixed sorting order

  • Filters: Fixed switching from date part to date range does not reset the date slider.

  • Filters: Fixed numeric slider displayed instead of checkboxes list when pasting an URL containing values for a numerical filter.

  • Filters: Fixed filter values not correctly displayed when using multiple date parts

  • Dashboards: Moved the fullscreen button outside the content area

  • Dashboards: Fixed “Play” button issuing an error on some dashboards

  • Fixed custom color assignations getting lost when changing the measures in the chart

Labeling

  • Added Undo/Redo when annotating images in a Labeling Task

Notebooks

  • Made Jupyter notebook export timeout configurable

Scenarios

  • New feature: Added the ability to define Cc and Bcc lists in scenario email reporters

  • Fixed timezone issue in the display of monthly triggers

Collaboration

  • Enabled emails toggles in user profile by default for new users

  • Fixed switching branch in a project that would cause the project to become inaccessible in case the git branch was badly initialized

  • Fixed hyperlinks toward DSS objects in wiki exports

  • Dataset sharing: Fixed unable to import a dataset from another project P if quick sharing is disabled on project P

  • Workspaces: Fixed public API disclosing permissions set on workspaces to users and contributors of the workspace.

  • Workspaces: Fixed error message wrongly displayed when a user with Reader profile publishes an object to a workspace

  • Workspaces: Fixed “Go to source dashboard” button incorrectly grayed out under some circumstances

Govern

  • Added the ability to customize the axis of the governed projects matrix view

  • Added the ability to configure a sign-off with only final review (no feedback groups)

  • Fixed the display of multiple governed projects at the same location in the matrix view

  • Fixed import/export of blueprints to remove user and group assignment rules in sign-off configuration

  • Fixed unselect action in the selection window for lists displayed as tables

  • Fixed an error happening when reordering attachment files

  • Fixed deduplication of items in list to only apply on reference fields

  • Added the possibility to set data drift as a “metric to focus on” in the model registry

  • Fixed the removal of items from tables

  • Fixed the redirection to home page in case of a custom page not found

  • Fixed governed saved model versions or bundles being created twice when governing directly from the object page

MLOps

  • New feature: Added an option in the Evaluation and Standalone Evaluation Recipes to disable the sub-sampling for drift computation (sub-sampling is enabled by default)

  • New feature: Added data drift p-value as an evaluation metric

  • New feature: Added the ability to track Lab models metrics as experiment tracking runs

  • In Deployer, added an option to bundle only the required model versions.

  • Fixed drift computation in evaluation recipe failing when using pandas 1.0+

  • Fixed evaluation of MLflow models on dataset with integer column with missing values

  • Improved the selection of metrics to display in a Model Evaluation Store

  • Added support for MLflow’s search_experiments API method

  • Fixed handling of integer columns in the Standalone Evaluation Recipe for binary classification use cases

  • Fixed some flow-related public API method when there is a model evaluation store in the flow

  • Fixed evaluation of MLflow models when there is a date column

  • Fixed empty versions list for MLflow models migrated from a previous version

  • In the Evaluation Recipe, added the ability to customize the handling of column in data drift computation

  • Enriched Model Evaluations with additional univariate data and prediction drift metrics (can also be retrieved through the API)

Coding

  • Improved commit messages generated when creating, editing, deleting files in folder in project libraries

  • Removed some useless empty commits when performing blank edits in project libraries

Plugins

  • Fixed several types of plugin components that did not work with Python 3.11

Performance & Scalability

  • Improved performance and responsiveness when DSS data dir IO is slow

  • Improved performance of starting jobs in projects involving shared datasets

  • Improved performance of validating very large SQL queries / scripts

  • Improved performance of some API calls returning large objects

  • Improved performance of sampling for Statistics worksheets

  • Improved performance of various other UI locations

Administration

  • New feature: Added reporting of SQL queries in Compute Resource Usage for several missing locations where DSS performs SQL queries

Setup

  • New feature: Added support for Python 3.9 for the DSS builtin environment

Dataiku Cloud

  • Code Studios: Fixed RStudio on Dataiku Cloud

Cloud Stacks

  • Switched OS for DSS instances from CentOS 7 to AlmaLinux 8

  • Switched R version for DSS instances from 3.6 to 4.2

  • Switched Python version for builtin env for DSS instances from 3.6 to 3.9

  • Fixed faulty display of errors while replaying setup actions

  • Fixed various issues with renaming instances

  • Made it easier to install the “tidyverse” R package out of the box

  • GCP: Fixed region for snapshots

  • GCP: Added ability to assign a static public IP for Fleet Manager

  • Fixed issue when declaring a govern node but not creating it

  • Made the “external URL” configurable for instances, for inter-instance links shown in the interface

Elastic AI

  • EKS: Fixed support for kubectl 1.26

  • GKE: Added support for Kubernetes 1.26

  • GKE: Fixed issue when creating cluster in a different zone than the DSS instance

  • Made it easier to debug issues with API nodes deployed on Kubernetes infrastructure (API node log now appears in pod logs)

Miscellaneous

  • Fixed broken/missing filtering (live search) in some dropdown menus

  • Fixed some Flow-related methods of the public API python client that would fail when used with labeling tasks

  • Fixed broken DSSDataset#create_analysis method of the public API python client

  • Removed limitations on size of project variables

  • Fixed failure when UIF invalid rules are defined

  • Fixed renaming of To do lists

  • Fixed possible failures of Jupyter notebooks failing to load

  • Fixed Admin > Monitoring screen failing to load if the instance contains a malformed dataset or chart definition.

  • Fixed issue with Python plugin recipes when installing plugin from Git in development mode

  • Fixed Parquet in Spark falling back to unoptimized path for minor ignorable differences in schema

  • Compute resource usage: added a new indicator that provides a better approximation of CPU usage on quick starting/stopping processes