DSS 12 Release notes¶
Migration notes¶
How to upgrade¶
For Dataiku Cloud users, your DSS will be upgraded automatically to DSS 12 within pre-announed timeframes
For Dataiku Cloud Stacks users, please see upgrade documentation
For Dataiku Custom users, please see upgrade documentation: Upgrading a DSS instance.
Pay attention to the warnings described in Limitations and warnings.
Migration paths to DSS 12¶
From DSS 11: Automatic migration is supported, with the restrictions and warnings described in Limitations and warnings
From DSS 10.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 10.0 -> 11
From DSS 9.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 9.0 -> 10.0, 10.0 -> 11
From DSS 8.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11
From DSS 7.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11
From DSS 6.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 6.0 -> 7.0, 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11
From DSS 5.1: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11
From DSS 5.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11
From DSS 4.3: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11
From DSS 4.2: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11
From DSS 4.1: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.1 -> 4.2, 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11
From DSS 4.0: Automatic migration is supported. In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.0 -> 4.1, 4.1 -> 4.2, 4.2 -> 4.3, 4.3 -> 5.0, 5.0 -> 5.1, 5.1 -> 6.0, 6.0 -> 7.0, 7.0 -> 8.0, 8.0 -> 9.0, 9.0 -> 10.0, 10.0 -> 11
Migration from DSS 3.1 and below is not supported. You must first upgrade to 5.0. See DSS 5.0 Release notes
Limitations and warnings¶
Automatic migration from previous versions is supported (see above). Please pay attention to the following cautions, removal and deprecation notices.
Cautions¶
The SQL engine can now be automatically selected on prepare recipes. In case of issues on prepare recipes that were working prior to the upgrade, you can revert to the DSS engine by clicking on the “Engine: In database” button in the prepare recipe settings.
Similarly, the Spark engine can now be automatically selected more eagerly when the storage and formats are compatible with fast Spark execution. In case of issues on recipes that were working prior to the upgrade, you can revert to the DSS engine by clicking on the “Engine: Spark” button in the recipe settings.
The Bokeh package has been removed from the builtin Python environment. If you have Bokeh webapps, please make sure to use a code environment. The Bokeh package in the builtin Python environment was using a very old version of Bokeh
The Seaborn package has been removed from the builtin Python environment. If you use this package, please make sure to use a code environment.
For Cloud Stacks setups, the OS for the DSS nodes has been updated from CentOS 7 to AlmaLinux 8 (which is a RedHat-compatible distribution similar to CentOS). Custom setup actions may require some updates.
For Cloud Stacks setups, R has been upgraded from R 3 to R 4. You will need to rebuild all R code envs. Some updates to packages may be required
For Cloud Stacks, the builtin Python environment has been upgraded from Python 3.6 to Python 3.9
The version of some packages in the builtin Python environment have been upgraded and your code may require some updates if you are not using your own code environment. The most notable updates are:
Pandas 0.23 to 1.3
Numpy 1.15 to 1.21
Scikit-learn 0.20 to 1.0
Matplotlib 2.2 to 3.6
The python packages used by Visual Machine Learning have changed, in the built-in code environment and in suggested packages. Notably, if you have KNN or SVM models trained using the built-in code environment, you will need to retrain these models to be able to uswe them for scoring.
Support removal¶
Some features that were previously announced as deprecated are now removed or unsupported.
Support for H2O Sparkling Water as a backend for Visual Machine Learning has been removed
Deprecation notices¶
DSS 12 deprecates support for some features and versions. Support for these will be removed in a later release.
Support for Cloudera CDH 6
Support for Cloudera HDP 3
Support for Amazon EMR 5
Support for Java 8
Version 12.3.2 - November 15th, 2023¶
DSS 12.3.2 is a security, performance and bugfix release
LLM Mesh & Prompt Studio¶
New feature: Added a Convert to prompt recipe button in the classification and summarization
OpenAI: Added GPT4-turbo model
Azure OpenAI: Added & Fixed embedding support
AWS Bedrock: Adapted to new AWS API
AWS Bedrock: Added support for topP and topK
RAG: Improved error handling in augmented LLMs
Prompt studio: Added ability to select a recipe to update from the Save as recipe option
Prompt studio: improved UX when switching between Managed & Advanced modes
API: Fixed dataiku.KnowledgeBank python API
PII: Added possibility to ignore instead of failing if unsupported language is detected
PII: Fixed PII detection on embedding queries
Datasets¶
New feature Added an experimental dialect to connect to Sqream (through the “other databases” connection)
Editable dataset: Fixed edition after dataset has been cleared
Editable dataset: Added an option to handle large copy/paste of data
Excel export: Added support for exporting more than 1 million records
Improved error handling when trying to analyze a previously removed or renamed column
Recipes¶
Prepare: Improved mass renaming of columns based on column name pattern (regex)
Prepare: Fixed migration of ‘Flag rows on values’ processor
Prepare: Fixed creation of “filter on” step by selecting a substring in cell
Prepare: Do not select Athena engine by default
Join: Fixed icons display when using unmatched row option
Pivot: Improved error message when summing non numerical columns
Notebooks¶
Fixed importing from Git when the notebook’s name contains a dot
Dashboards¶
Fixed scrolling in dataset insight
Govern¶
Fixed installation and update of Govern node when the DB schema is not “public”
Spark¶
Prevented auto-selection of Spark engine when one of the outputs is in append mode
Fixed SparkSQL / Scala and SparkSQL notebook on CDH 6 (deprecated)
Fixed MLLIB training on old hadoop distributions: HDP 3.1.5 (deprecated), CDH6 (deprecated)
Machine Learning¶
Fixed display of median and standard deviation in cluster profiles if the value is zero
Fixed MLFlow error with ignoring TLS certificate validity check when DSS is configured for HTTPS
External models: Added support of non-JSON probat format for SageMaker CSV
Cloud Stacks¶
AWS: Fixed installation using the fleet-manager-network template
SSO and LDAP¶
Fixed on demand provisioning on Azure AD
Security¶
Fixed Directory Traversal in cluster logs retrieval endpoint
Added new HTTP security header to all requests: Permissions-Policy: fullscreen.
Added ability to specify additional HTTP security headers: Referrer-Policy, Permissions-Policy, Cross-Origin-Embedder-Policy, Cross-Origin-Opener-Policy and Cross-Origin-Resource-Policy.
Version 12.3.1 - October 30th, 2023¶
DSS 12.3.1 is a bugfix release
LLM Mesh¶
New feature: Added support for MosaicML
Fixed support for GPT 3.5 Instruct
Fixed embedding recipe with Azure Open AI models
Fixed embedding recipe when containerized execution is enabled by default
Fixed “embedding settings” display in Knowledge Banks
Fixed NLP classification recipe when output mode is “All classes”
Coding¶
Fixed installation of the “dataiku” Python package outside of DSS
Machine Learning¶
Fixed usage of TF-IDF Text preprocessing in Visual ML when stop words are enabled
API Designer¶
Fixed set_remote_dss dataiku api function when used in API designer
Bundle and Automation¶
Fixed revert of bundles on the design node
Charts¶
Fixed usage of Snowflake engine when the database is set at the session level
Jobs¶
Fixed the “Re-run this job” action from the job page
Webapps¶
Fixed login redirection for public webapps created from Code Studios
Cloud Stacks¶
Fixed loss of LDAP and Azure AD settings when Fleet Manager is restarted
Version 12.3.0 - October 23rd, 2023¶
DSS 12.3.0 is a significant new release with both new features, performance enhancements and bugfixes.
The LLM Mesh¶
With the recent advances in Generative AI and particularly large language models, new kind of applications are ready to be built, leveraging their power to structure natural language, generate new content, and provide powerful question answering capabilities.
However, there is a lack of oversight, governance, and centralization, which hinders deployment of LLM-based applications.
The LLM Mesh is the common backbone for Enterprise Generative AI Applications.
It provides:
Connectivity to a large number of Large Language Models, both as APIs or locally hosted
Full permissioning of access to these LLMs, through new kinds of connections
Full support for locally-hosted HuggingFace models running on GPU
Audit tracing
Cost monitoring
Personally Identifiable Information (PII) detection and redaction
Toxicity detection
Caching
Native support for Retrieval Augmented Generation pattern, using connections to Vector Stores and Embedding recipe.
The LLM Mesh is available in Public Preview in DSS 12.3.0.
For more details, please see Generative AI and LLM Mesh.
Prompt Studios and LLM-powered recipes¶
On top of the LLM Mesh, Dataiku now includes a full-featured development environment for Prompt Engineering, the Prompt Studio. In the Prompt Studio, you can test and iterate on your prompts, compare prompts, compare various LLMs (either APIs or locally hosted), and, when satisfied, deploy your prompts as Prompt Recipes for large-scale batch generation.
In addition, Dataiku now includes two new recipes that make it very easy to perform two common LLM-powered tasks:
Classifying text (either using classes that have been trained into the model, or classes that are provided by the user)
Summarizing text
Prompt Studio and LLM-powered recipes are available in Public Preview in DSS 12.3.0.
For more details, please see Generative AI and LLM Mesh.
Datasets¶
Databricks: Added support for global (non-per-user) OAuth login
Snowflake: Added support for global (non-per-user) OAuth login
Snowflake: Added support for variables in the Scope field for OAuth mode
JSON: Fixed Spark engine not properly unnesting JSON fields
Machine Learning¶
Added support for What-if on partitioned models without the need to go in an individual partition
Added support of custom model views when the view backend runs in containerized execution
Added ability to use a Visual model’s predictor Python API from code running in containerized execution
Fixed computation of feature importance when there are less than 15 rows in the test set
Fixed failing training of deep neural network visual model when the only feature is text using sentence embedding
Fixed DSSTrainedPredictionModelDetails.compute_shapley_feature_importance Python API that was broken for saved models
Dashboards¶
Fixed downloading of filtered datasets within dashboard that did not filter
Fixed inability to copy chart from insight view to any other chart
Fixed error display when a chart hits the limit of displayed points in a dashboard
Flow¶
Fixed “Generate Flow Documentation” failing on servers with non-English locales
Recipes¶
Shell: Fixed renaming not taking into account dataset references in pipes
Prepare: Fixed “Filter and flag on formula” step causing SQL engine to fail on some databases such as Redshift.
Prepare: Fixed “Rename” step causing SQL engine to fail in some situations such as renaming a column twice, or renaming a column with an empty string.
Deployer¶
Fixed issue with custom base image tag in API Deployer Kubernetes images (custom base images remain discouraged)
Added more details in the right panel of API services
Governance¶
Fixed Kanban views not bucketting projects correctly
MLOps¶
Fixed incorrect trainDate in the return of the list_versions() API method for MLflow models
IAM¶
Fixed fetching LDAP users with “Import from external source” not returning usernames if Display name attribute is different from Username attribute
Fixed LDAP bind password being wrongfully required, whereas it’s optional
Cloud Stacks¶
Added setup action to add a custom CA into the trust stores of DSS
Added ability to reload security settings without having to restart Fleet Manager
Code envs¶
Added support for per-code-env Dockerfile additions
Added support for per-code-env CUDA support, removing future need for CUDA-specific container images
Misc¶
Fixed catalog or global search failing when query contained special characters such as @ or ~.
Compute Resource Usage: added CPU and memory request and limit to Kubernetes CRU events
Version 12.2.3 - October 10th, 2023¶
DSS 12.2.3 is a bugfix release
Charts¶
Fixed thumbnails display of Boxplot charts
Recipes¶
Fixed usage of isBlank() formula function in a recipe causing incorrect results when executed with SQL engine
Misc¶
Fixed error occurring when an event server target is configured with an “Path within connection”
Fixed exception being added to the logs each time an API node starts
Version 12.2.2 - September 25th, 2023¶
DSS 12.2.2 is a bugfix release
Machine Learning¶
Fixed the metrics comparison chart for time series forecasting models in the models list
Fixed a rare race condition causing training failures with distributed hyperparameter search
Datasets¶
S3: Reduced memory consumption when writing multiple files on S3 in parallel
BigQuery: Fixed memory leak
Editable dataset: Fixed pressing “enter” in the “edit column” modal not closing the modal
Editable dataset: Fixed redo mechanism when a new row had been added
Fixed renaming of partitioned datasets causing downstream recipes to fail at runtime
Fixed inability to import Excel files containing Boolean cells computed with formulas
Recipes¶
Join: Fixed occasional job failures with DSS engine
Join: Fixed wrongly detected duplicate column name when 2 columns only differ by their case
Prepare: Fixed “Extract Date components” with SQL engine
Prepare: Fixed display issue when rearranging steps order
Sync: Fixed schema and catalog not taken into account when executing a Sync recipe from a Databricks dataset to an Azure Blob storage dataset.
Shell: Fixed quotes incorrectly added around variables
Fixed expansion of variables in partitioning when running a recipe from its edition screen
Deployer¶
API Designer: Fixed inability to run test queries with Python endpoints
Improved error message about deployer hooks code
Fixed an issue with the selection of core packages for Python 3.8 code environments on deployer and automation nodes
Added a “Validate” button in the Deployer Hooks’ code edition screen
Experiment Tracking¶
Added ability to ignore invalid SSL certificates in experiment tracking
Fixed several issues with starting runs (when no end time is specified, or when a name is specified but no tags)
Governance¶
Fixed workflow step not being displayed at creation time when there is one mandatory field defined (Advanced Govern only)
Fixed the filling of the signoff history on step deletion
Misc¶
ElasticSearch: Fixed invalid projectKey passed in custom headers
Charts: Fixed empty legend section displayed in the left pane for charts in Insight view mode
Charts: Fixed timeout when exporting a dashboard containing donuts charts that need scrolling to be visible
Fixed “Assumed time zone” not displaying the correct default value on existing connections
Webapps: Fixed ability to use dkuSourceLibR in Shiny webapps
Fixed required permissions to import and export projects using the public API (aligning to UI behavior)
Version 12.2.1 - September 12th, 2023¶
DSS 12.2.1 is bugfix release
Machine Learning¶
Fixed UI issue disabling the creation of AutoML Clustering models
Cloud Stacks¶
Fixed the reprovisioning of DSS instances from Fleet Manager following a change in PostgreSQL repositories
Misc¶
Fixed a memory leak enumerating Azure Storage containers with very large number of files
Version 12.2.0 - September 1st, 2023¶
DSS 12.2.0 is a significant new release with both new features, performance enhancements and bugfixes.
New features and enhancements¶
Custom aggregations on charts¶
UDAF (User Defined Aggregation Functions) allow user create custom aggregation based on a powerful formula language directly from the chart builder.
For example, you can directly create an aggregation of sum(sell_price - cost)
to compute an aggregated gross margin, without having to first create that column.
Radar chart¶
The Radar chart is now available. Radar Charts are a way of comparing multiple quantitative variables . This makes them useful for seeing which variables have similar values or if there are any outliers amongst each variable .
Radar Charts are also useful for seeing which variables are scoring high or low within a dataset, making them suited for displaying performance.
Popular datasets¶
The Data Catalog home page now lists datasets that have been detected as popular on the DSS instance. Popular datasets are datasets that have been shared to multiple other projects and are actively used as source of new recipes. They are the ideal candidates to bootstrap your first data collections or to be added to existing data collections.
Govern Sign-off enhancements¶
Improvements of the sign-off feature allowing to:
Reset a finished sign-off
Reload an updated configuration from the Blueprint Designer
Create a sign-off on an active step if its configuration has been created afterwards
Setup recurrence to automatically reset an approved sign-off
Have multiple feedback reviews per users
Edit and delete feedback reviews and approvals
Change the sign-off status to go back to a previous state
Send an email to the reviewers when the final approval is added and deleted
Additionally, a new validation option has been added in the sign-off configuration to prevent the workflow from going past an unapproved sign-off step.
It also comes with UI improvements such as:
Expand and collapse long feedback reviews
Display the sign-off description below the title
Show the feedback and approver groups with details info on which users are configured
Warning: Some changes have been made to the API around the sign-off feature, you need to pay attention to your usages of the Public API and, for Advanced Govern instances, the logical hooks around the sign-off feature. Only for Advanced Govern instances, you may currently use logical hooks that are checking the sign-off status (preventing the workflow from going past an unapproved sign-off step) which will not work anymore in 12.2.0 due to the API changes. They can be replaced by the new validation option in the sign-off configuration to prevent going past an unapproved sign-off step. After enabling it, you will need to reset the corresponding sign-offs and reload their configuration.
PCA recipe¶
A new PCA recipe was added. The PCA recipe produces projections, eigenvectors and eigenvalues as three separate datasets.
You can create the PCA recipe from a PCA card in a dataset’s Statistics worksheets.
External Models¶
External Models allow a user to surface within Dataiku a model available as an endpoint on SageMaker, Azure ML or Vertex AI. Those models can be used like others Saved Models and most noticeably be scored and evaluated.
This feature is currently Experimental.
For more details, please see External Models
Deployer Hooks¶
Deployer hooks allow administrators of a Project or API Deployment Infrastructure to define pre- and post-deployment hooks written in Python. For instance, a pre-deployment hooks could perform some check and prevent a deployment if it fails ; a post-deployment hook could send a notification.
Other enhancements and fixes¶
Flow¶
The “Records count” view now displays the exact records count under each dataset in the Flow
Added ability to export flow documentation when having read-only acces to the project
Added ability to chose the name of the new zone when you copy a Flow zone
Added ability to copy a zone directly from the right panel
Fixed copying the default zone into a new zone duplicating flow objects into the original zone instead of the new zone
Fixed copying a zone not duplicating datasets without inputs
Fixed copying a zone to another project creating 2 zones into the destination project
Fixed “Recipe Engines” view not listing some engines such as “Snowflake to S3”.
Fixed creation of new datasets when creating a new recipe from “+ Recipe” button with no input selected
Datasets¶
BigQuery: Added ability to specify labels that will be applied to BigQuery jobs
Editable: Automatically add additional rows and columns when pasting data larger than the current table
Excel files: When selecting sheets by a pattern, matching sheets are now displayed
CSV: Fixed possible issue reading some CSV files
Snowflake: Fixed fast-path from cloud storage with date-partitioned datasets but non-date partitioning column
Snowflake: Fixed “Parse SQL dates as DSS date” setting not taken into account for Snowflake
Snowflake: Fixed issue with sync from non-SQL datasets with Spark engine
Prevented renaming datasets with the same name as a streaming endpoint
Fixed renaming datasets when only changing the case (from “DS1” to “ds1” for example)
Recipes¶
Generate features: Fixed failure when input dataset contains column names longer than what the output database can accept (the limit is 59 characters on PostgreSQL for example).
Split: Fixed adding a second input before selecting the output during creation
Data Catalog¶
Added ability to add multiple datasets to a Data Collection (either from the Flow or from a Data Collection)
Machine Learning¶
New feature: Causal Prediction now suports multiple treatments
New feature: Model comparisons now allow comparing feature importance between models
Fixed failure to compute the feature importance of a model would cause the whole training to fail
Fixed failure to compute partial dependencies on features with a single value
Fixed missing option to use a Custom model in clustering model design settings
Fixed scoring of a model with Overrides using the Spark engine
Fixed missing dashboard model insight/tile option to show the Hyperparameter optimization report
Fixed incorrect aggregate computation of cost-matrix gain when using kfold cross-validation
Fixed possible hang of DSS when computing interactive scoring (What-if)
Fixed automatic selection of the code environment that could sometimes suggest an incompatible environment when creating a new modeling task
MLOps¶
When exporting a model to the MLflow format, add its required packages to the requirements.txt
In evaluation recipes, added ability to skip rescoring and use the prediction if provided in the evaluation dataset.
When computing univariate drift, better deal with missing categories by showing a very high PSI rather than having an infinite/missing value
With the public API, added ability to create custom model evaluation with arbitrary metrics.
Scoring recipes can now compute explanations for MLflow models
A model can now be deployed with the GUI from an Experiment Tracking run without being evaluated
Non classification/regression models can now be deployed with the GUI from an Experiment Tracking run
Monitoring wizard: only suggest the deployments that are relevant for the current project
Statistics¶
Added support for the FDR Benjamini-Hochberg method for p-values adjustment on the pairwise t-test and pairwise Mood test
Charts¶
New feature: Added ability to copy charts from one dataset to another
New feature: Added ability to customize tick marks
Scatter: Added ability to configure number of displayed records
Scatter: Various zoom and pan improvements
Scatter: Zoom and pan can now be persisted
Scatter: Fixed issues when there are too many colors
Bar charts: Improved color contrast for displayed values
Pivot table: Added ability to customize font size and color
Pie/Donut: Added option to position “others” group at the end
Treemap: Fixed tooltip color indicator
Added reset buttons for axis customization options
Improved zoom buttons on relevant charts (Treemap/Geometry/Grid/Scatter/Administrative filled/Administrative bubbles/Density Maps)
Added digit grouping formatting options
Fixed measure formatting update on tooltip
Fixed display formula for regression line
Increased precision for pivot table and maps tooltips
Improved legends display performance with many items
Fixed number formatting for reference lines on vertical bar and scatter charts
Workpaces and dashoards¶
Improved view/edit navigation on dashboards
Improved behavior of date range filter on dashboard
Fixed deletion of dashboard filters
Fixed dashboard export on air-gapped DSS instances
Added ability for users to override the name of workspace objects
Improved display of empty workspaces
Persist sort during a session on datasets
Coding¶
New feature: Added the ability to edit Jupyter notebooks in Visual Studio Code or JupyterLab via Code Studios
Project libraries: Added History tab to track, compare and revert changes.
Code Studios: automatically recover in case of network issues
Added ability to use the dataikuscoring library in the Python processor of the prepare recipe
Fixed ability to run a Python or R recipe from a SQL query dataset
Upgraded the builtin version of Visual Studio Code in Code Studios to 4.13
Fixed issues with uploading Jupyter notebooks from Databricks or Jupyter notebooks that do not specify a kernel
Code Studios: Fixed issue with Unicode characters in project libraries
Code Studios: Fixed ability to us Jupyter support in Visual Studio Code
Labeling¶
Input records with invalid or empty identifier / path / text are now ignored
Collaboration & Onboarding¶
Home page: Fixed clicking on a project folder - after scrolling - opens the wrong folder
Project activity > Contributors: Fixed error occurring on projects with a very large number of contributions
Help center: Added tutorials with progress tracking in Help > Educational Content > Onboarding
Project Version control: Added ability to create a tag from a commit
Project Version control: Added ability to push & pull tags when using a remote git repository
Project Version control: Fixed error happening during force commit not displayed
API Deployer¶
Pre-build required code environments during image build when deploying on a Kubernetes cluster, to speed up actual deployment
Added ability to add a commit and a tag when a bundle is created
Added an option to trust all certificates for infrastructures of static API nodes
Added support for variables for the specification of “New service id” in the “Create API service version” scenario step
Fixed running test queries on multi-endpoint API services
Project Deployer & Bundles¶
Added ability to add a commit and a tag when an API Service package is created
Better deal with bundle including a Saved Model with no active version: warn on pre-activation and activation and have a clearer exception when using the Saved model
Static insights can now be included in bundles
Scenarios¶
Do not clear retry settings when disabling/enabling a scenario step
Added a new mail channel to send emails using Microsoft365 with OAuth.
Govern¶
Added new artifact admin permission that grants all permissions for a specific artifact
Added the ability to export an item and its content (workflow state, field values) to CSV or PDF files
Governed Project’s Kanban view now also includes projects using custom templates
Added the ability to add a project directly from a business initiative page
Fixed display issue with very long Blueprint name in the Blueprint Designer
Fixed standard deviation display issue on Model version metrics
Fixed display issue for field of type number and value 0
Improved the performances of some queries
Cloud Stacks¶
Fixed display issue of the “Please wait, your Dataiku DSS instance is getting ready” screen
Fixed missing display of some errors in Fleet Manager
Added warning when trying to set a too small data volume
Moved some temporary folders to the data volume to avoid filling the OS volume
Fixed default value for IOPS on EBS
Fixed issues making the Save button unavailable
Elastic AI¶
Fixed ability to create a SparkSQL recipe based on a SQL query dataset (it however remains a very bad idea)
Simplified interaction with Kubernetes for containerized execution: Kubernetes Jobs are not used anymore. DSS now creates pods directly
Added display of DSS user / project / … to Cluster Monitoring screens
GKE: Improved error message when gcloud does not have authentication credentials
GKE: Improved handling of pod and service IP ranges
GKE: Added support for spot VMs
Added support for using a proxy for building the API deployer base image with R enabled
Streaming¶
Fixed default code sample for Spark Scala Streaming recipe
Fixed default code sample for Python streaming recipe
Added ability to perform regular reads of datasets in a Spark Scala Streaming recipe
Fixed read of array subfields in Kafka+Avro
Fixed issue with using “recursive build downstream” in flow branches containing streaming recipes
Performance and scalability¶
Improved performance for listing jobs
Improved IO performance for starting up jobs
Improved memory usage
Fixed possible hang when creating an editable dataset from a large existing dataset
Security¶
Fixed credentials appearing in the logs when using Cloud-to-database fast paths
OpenID login: added ability to configure the “prompt” parameter of OpenID
User provisioning: clarified how group profile mappings are applied
Azure AD integration: Fixed support for users having more than 20 groups
OAuth2 authentication on API node: added configurable timeout for fetching the JWKs
Jupyter notebooks trust system is now on a per-user basis
Misc¶
Added settable random seed for pseudo random sampling methods, allowing for reproducible sampling.
Fixed display issue with “Use global proxy” setting in connection getting wrongfully reset
Analyses: Fixed adding or removing tags from the right panel
Improved display of code env usage in code env settings
Fixed cases where building a code env could silently fail
Fixed possible failure aborting a job
Fixed issue with displaying large RMarkdown reports
Fixed possible error in Jupyter
Fixed possible UNIX user race condition when starting a large number of webapps at once
dataiku.api_client()
is now available from within exporter and fs-provider plugin components
Version 12.1.3 - August 17th, 2023¶
DSS 12.1.3 is a security, performance and bugfix release
Machine Learning¶
Fixed UI issue in model assertions
Fixed partial dependencies failure with sample weights
Fixed computation of partial dependencies when rows are dropped by processing
MLOps¶
Fixed possible failure to display model results for imported MLflow models built from recent scikit-learn versions
Fixed display of model results for imported MLflow models for which performance was not evaluated
Fixed display of API endpoint URL in API deployer
Fixed ability to deploy MLflow models that are not tabular classification nor regression
Fixed Python requirements for exported MLflow models
Govern¶
Fixed validation error when custom templates have been deployed and standard ones have been archived
Dashboards¶
Fixed filter on “no value” when downloading dataset data from dashboards
Cloud Stacks¶
Fixed issue with authentication when upgrading Fleet Manager directly from 10 to 12.1
Performance¶
Improved performance for reading records with dates from Snowflake
Fixed potential slow query and failure on the “Automation monitoring” page
Fixed flooding of logs with bad data in Excel export
Security¶
Added more sensitive information removal from support diagnostic archives
Misc¶
Added the ability to embed Dataiku in another website through setting “SameSite=None” for cookies
Fixed Databricks sync to Azure/S3 with pass-through credentials when Unity Catalog is disabled
Fixed issues with display of list of scenarios in some upgrade situations
Fixed minor display issue in Wiki taxonomy tree
Fixed display of Flow in jobs page with big flows
Version 12.1.2 - July 31st, 2023¶
DSS 12.1.2 is a security, performance and bugfix release
Datasets¶
Explore: Fixed filtering of Decimal columns with “text facet” filtering mode
Editable dataset: increased display density
Editable dataset: fixed bad interaction with the Tab key
Editable dataset: improved column edition and autosizing experience
Editable dataset: fixed bad interaction with keyboard shortcuts while editing a column
Snowflake: Strongly improved performance of verifying table existence and importing tables
Presto/Trino: Strongly improved performance of verifying table existence and importing tables
Databricks: Fixed wrongful cleanup of temporary tables for auto-fast-write
Recipes¶
Prepare: Fixed a case where the formula parser would wrongfully ignore invalid formula and only execute parts of the formula
Prepare: removed a wrongful warning regarding dates with SQL engine
Prepare: fixed wrongful data loss when using “if then else” to write into an existing column with SQL engine
Prepare: fixed number of steps appearing in the description in the right panel of the recipe
Window: Fixed pre-computed columns when “always retrieve all” is selected and Spark engine is used
Windows: Fixed display when “always retrieve all” is selected
Machine Learning¶
Removed ability to export train set if datasets export is disabled
Fixed wrongful binary classification threshold in evaluation recipe
Fixed wrongful fugacity matrix not taking threshold into account in drift evaluation
Fixed precision-recall curve with Python 2.7
Fixed what-if when a feature is empty and selected to “drop row if empty”
Fixed SQL scoring on BigQuery
Labeling¶
Object detection: fixed an issue when a single image has more than 5 labels
Dashboards and worksapces¶
Fixed display of Dataiku applications viewed through a workspace
Webapps¶
Fixed ability to retrieve headers for Bokeh 2
Dataiku Govern¶
Fixed improper status computation on the review step when there are unvalidated signoffs in the following steps
Fixed display of SSO settings
Elastic AI¶
Fixed ability to run Spark History Server behind a reverse proxy
Cloud Stacks¶
Fixed issues saving forms in the Fleet Manager UI
Pre-create the “cpu/DSS” cgroup to make it easier to control CPU through cgroups
Increased too low system limits on some components
Performance and scalability¶
Fixed performance issue when renaming datasets on extremely large instances
Fixed possible instance crash when using the “compute ngrams” prepare processor with extremely large number of ngrams
Improved performance of the “Automation monitoring” page
Miscellaneous¶
Reemove extra whitespaces in logging remapping rules to avoid hard-to-investigate issues
Version 12.1.1 - July 19th, 2023¶
DSS 12.1.1 is a security, performance and bugfix release
Statistics¶
Fixed STL decomposition analysis when resampling is disabled
Machine Learning¶
Fixed charts on predicted data when a date filter is set
Performance and Scalability¶
Fixed performance issue when switching from recipe to notebook, when the recipe code contains lot of spaces
Fixed issue with notebooks startup when kernel takes too long to start
Security¶
Version 12.1.0 - June 29th, 2023¶
DSS 12.1.0 is a significant new release with both new features, performance enhancements and bugfixes.
New features and enhancements¶
Dataset preview on the Flow¶
You can now preview the content of datasets directly from the Flow. Simply click on “Preview”.
Databricks Connect¶
Support for Databricks Connect was added in Python recipes.
It is now possible to push down Pyspark code to Databricks clusters using a Databricks connection.
More charts customization and features¶
Many new capabilities and customization options were added to charts and dashboards
Added the ability to set the position of the legend of charts on dashboard
Added the ability to customize font size and colors for values, legend items, reference lines, axis labels and axis values
Added “relative date range” filters for charts and dashboards (“last week”, “this year”, …)
Added ability to force displayed values to overlap
Bar charts: Added reference lines (horizontal lines)
Scatter plots: Added reference lines (horizontal lines)
Scatter plots: Added regression lines
Scatter plots: Added zoom and pan
New join types¶
The join recipe now supports 2 new types of joins:
Left anti join: keep rows of the left dataset without a match from the right
Right anti join: keep rows of the right dataset without a match from the left
Text Labeling¶
In addition to image classes and object bounding boxes, Dataiku managed labeling can now label text spans in text fields.
Visual Time series decomposition¶
Visual Statistics now includes visual STL time series decomposition (trend and seasonality)
New editable dataset UI¶
A new UI for the “editable” dataset adds many new capabilities:
Easier resizing of columns
Auto-sizing of columns
Click-and-drag to fill
Ability to add several rows and columns at once
Ability to reorder & pin columns with drag-and-drop
Fixed various issues with undo/redo
Added warning when attempting concurrent edition
Excel sheet selection enhancements¶
Excel files sheet selection was revamped. It is now possible to select sheets manually or via rules based on their names or indexes, or to always select all sheets.
In addition, it is now possible to add a column containing the source sheet name.
Enhanced user management capabilities¶
Added the ability to automatically provision users at login time from their SSO identity
Added Azure AD integration to provision and sync users
Added the ability to explicitly resync users (either from the UI or from the API) from their LDAP or Azure AD identity
Added the ability to browse LDAP and Azure AD directories to provision users from their LDAP or Azure AD identity at will (without them having to login first)
Added the ability to define and use custom authentication and provisioning mechanisms
Other enhancements and fixes¶
Machine learning¶
New feature: Added a Precision-Recall curve to classification model reports, as well as Average-Precision metric approximating the area under this curve
Added support of ML Overrides to Model Documentation Generation
Added indicators in What-if when a prediction was overridden
Now showing preprocessed features in model reports even when K-fold cross test was enabled on this model
Added option to export the data underlying Shapley feature importance
Sped up training of partitioned models
Added a “model training diagnosis” in Lab model trainings, to download information needed for troubleshooting technical issues
Fixed reproducibility of Ridge regression models
Fixed the computation of the multiclass ROC AUC metric in the rare case of a validation set with only 2 classes
Fixed a possible scoring failure of ensemble models on the API node
Fixed overridden threshold of binary classification model when scoring with Spark, Snowflake (with Java UDF) or SQL engines
Fixed a failure when an evaluation recipe was run on a spark-based model with either only a metrics output or only a scored output dataset
Fixed a failure to score time-based partitioned models using the python (original) backend when the partitioning column is a date or timestamp
Fixed a scoring failure when using time-based partitioning on year only
Fixed inability to delete an analysis containing a Keras / Tensorflow model
Fixed a condition where an erroneous user-defined metric would cause the whole training to fail
Fixed training failure caused by incorrect stratification of stratified group k-fold with some datasets
Fixed a possible hang of a containerized train when the training data is very large
Fixed broken Design page for modeling tasks in some rare cases
Fixed MLlib clustering with outlier detections
Time series forecasting¶
New feature: Added Model Documentation Generation for forecasting models
Added experimental support for forecasting models with more than 20000 series
Added option to sample the first N records sorted by a given column
Added ML diagnostics to the evaluation & scoring recipes, warning instead of failing when a time series is too short to be evaluated or resampled, or when a new series was not * seen at train time by a statistical model
Added an option in multi-series forecasting models to ignore time series that are too short
Sped up the loading & display of multi-series forecasting models
Set the default thread count of forecasting models hyperparameter search to 1, to ensure full reproducibility
Fixed distributed hyperparameter search of time series forecasting models
Fixed evaluation recipe schema recomputation always using the saved model’s active version even when overridden in the recipe
Fixed failure when the time column contains timezone and using recent version of pandas
Fixed a training failure when some modalities of a categorical external feature are present in the test set but not in the train set
Fixed a failing train of multi-series models when an identifier column contains special characters in its name
Fixed a training failure when using Prophet with the growth parameter set to “logistic”
Computer Vision¶
Added support for log loss metric in Image Classification tasks
Added ability to publish a Computer Vision model’s What-If page to a Dashboard
Fixed a possible failure when coming back to the What-If screen of Computer Vision models after visiting another page
Fixed a possible training failure when Computer Vision models are trained in containers
Fixed incorrect learning rate scheduling on Computer Vision model trainings
Charts & Dashboards¶
Fixed dashboard export with filter tile
Fixed dashboard, on opening dataset insights appear unfiltered for a short moment
Stacked bars chart: Added ability to remove totals when “displaying value”
Bars: Fixed Excel export
Horizontal bars: Fixed X axis disappearing
Line charts: Fixed axis scale update on line charts
Pivot table: remove “value” column if only one measure is displayed
Scatter plots: Made maximum number of displayed points configurable
Maps: Fixed display of legends with “in chart” option
Boxplots: Fixed chart display when the minimum is equal to zero
Boxplots: Fixed display of min and max as we allow possibility to set manual range
Added reference lines in Excel export
Fixed excel export for charts with measures
Fixed “export insight as image” not displaying legend
Fixed tooltip display on each subchart
Improve empty state and wording for workspaces
Fixed issue with selecting text in chart configuration forms
Fixed thumbnail generation when using manual axis
Fixed discrepancy in filter behavior between DSS and SQL engines when data contains null values
Notebooks¶
Added “Search notebooks” to easily search within ElasticSearch datasets
Code Studios¶
Streamlit: Allowed changing the config.toml
Streamlit: Allowed to specify a code-env for Streamlit block, allowing to choose a custom Streamlit version
JupyterLab: Fixed block building failing on AlmaLinux
JupyterLab: Added warning when stopping Code Studio and some files have been written in JupyterLab’s root directory
JupyterLab: Fixed renaming folders whose names contain whitespaces in JupyterLab
Fixed unexpected visual behavior when clicking on a DSS link inside Code Studio while not authorized
Fixed wrongful display of old log messages
Fixed “popout the current tab” button not working under some circumstances
Set ownership of code-envs created with the “add code environment” block to dataiku user
Flow¶
Added “stop at flow zone boundary” option when building multiple datasets at once.
Fixed incorrect layout when a metrics dataset or a cycle is present in a flow zone
Fixed unbuilt datasets appearing as built after a change in an upstream recipe caused theirs schemas to be updated
Fixed zone coloring when doing rectangular selection on the Flow
Added support for “metrics” dataset when doing schema propagation
Fixed “copy subflow to another project” failing when quick sharing is enabled on the first element
Fixed “Drop data” option for “Change connection” action
Fixed update of code recipes when renaming a dataset while copying a subflow
Datasets¶
Fixed leftover file when deleting an editable dataset without checking drop data
Added support for direct read of JSON files from Spark
Fixed dataset explore view not behaving correctly if the last column is named “constructor”
Added support for “_source” keyword in Custom Query DSL for ElasticSearch datasets
Added support for Azure Blob to Synapse fast path when network restriction is enabled on the Azure Blob storage account
Do not propose “Append instead of overwrite” for Parquet datasets, as it’s not supported
Improved error reporting for various cases of invalidly-configured datasets
Fixed BigQuery auto-write fast path with non-string partitioning columns
Added support for S3/Redshift fast path when using STS tokens
Recipes¶
New feature: Generate features: now supports Spark engine
New feature: Added recipe summary in right panel for sample/filter, group, join and stack recipes
Prepare: Fixed “concat” processor on Synapse
Prepare: Fixed preview of Formula editor not showing anything when the formula generates null values for all input values in the sample
Prepare: Fixed a possible timeshift with input Snowflake datasets contain columns of type “date”.
Prepare: Fixed possible error when moving preparation steps when input dataset is SQL
Prepare: Fixed possible incorrect engine selection when input dataset is SQL
Prepare: Added SQL engine support for “Concatenate columns” steps on Synapse datasets.
Prepare: Fixed wrongful change tracking for changes made on columns that have just been added by a processor
Prepare: Fixed wrongful “Save” indicator whereas recipe was already saved
Prepare: Disable Spark engine when “Enrich with context information” processor is used
Prepare: Fixed saving of output schema with complex types with detailed definition
Group and window: Fixed using an aggregation on a column that doesn’t exist in the input of a Group or Window recipe yields an unexpected error.
Fuzzy Join: fixed wrongful “metadata” output when using multiple join conditions
Window: Added “Retrieve all” checkbox to automatically retrieve all columns in the input dataset. This option is checked by default for all newly created recipes.
Sync: Fixed possible timeshift when input Databricks datasets contain columns of type date.
Sync: Fixed redispatching partitions with both a discrete and a time-based dimension
Sync: Fixed computing of metrics on output dataset with partition redispatch
Pivot: Fixed issue with BigQuery geography columns
Join: Fixed “match on nearest date” on Synapse
Data collections¶
Improved loading time of the various screens
Fixed filters being reset when refreshing data collection page
Labeling¶
New feature: Added ability to specify additional columns to be displayed next to the image or text being annotated
New feature: Added ability for reviewers to reject an annotated item and send it back for annotation
Fixed inability to delete a Labeling Task’s data when its input dataset is shared from another project
Jobs¶
New feature: Job view now displays Flow with Flow zones
Fixed clicking on a Job activity for a dataset that has been deleted
Fixed blank flow in Jobs screen on some large flows
Fixed Job failure incorrectly reported when building datasets with option “Stop at zone boundary” and a dependency located outside the flow zone is not built.
Fixed “there was nothing to do” displayed while job is still computing dependencies
Webapps¶
Webapps do not auto-start anymore at creation
Scenarios¶
Added “stop at flow zone boundary” option.
Fixed unexpected error generated when a scenario “Run checks” step references a non-existing dataset.
MLOps¶
Added support for MLflow 2.3
Added support for Transformers, LangChain and LLM flavors of MLflow
Added support of MLflow model outputs as lists
Added a project macro to delete model evaluations.
Create metrics and checks datasets in the same flow zone as the object they relate to.
Added the ability to define a seed in the evaluation recipes when using random sampling
In the standalone evaluation recipe, ease the setup of classes when there are many by allowing to copy / paste them.
Fixed Python 2.7 encoding issues in the evaluation recipe when dealing with non-ASCII characters
Fixed support of MLflow models returning non-numeric results
Ease the setup of the standalone evaluation recipe for pure data drift monitoring (prediction column is now optional)
Fixed incorrect handling of forced threshold in a proba-aware, perfless standalone evaluation recipe
Fixed the computation of the confusion matrix with Python 3.7
Avoid creating a Saved Model when errors occur during the deployment of a model from an experiment tracking run.
Fixed the creation of API service endpoint from a MLflow imported model with prediction type “Other”
Fixed the import of a new Saved Model Version into an existing Saved model from a model from an experiment tracking run with prediction type “Other”.
Fixed an issue preventing the import of new MLflow model versions into an existing Saved Model from a plugin recipe.
Fixed import of projects exported with experiment tracking
Deployer¶
New feature: Added the ability to publish a bundle to the deployer without being project admin
Added historization and display of deployments logs in project and API deployers
Added autocompletion on connection remappings in deployments and deployer infrastructure settings
Added infrastructure status for the API node in API deployer
Prevent the creation of two bundles with the same name
Fixed the setup of permissions of deployer related folders when installing impersonation
Enhanced the ability of deployments to customize parts of the exposition settings of the infrastructure
Dataiku Govern¶
New feature: Improved the graphical structure of artifact pages and the way fields are displayed within it
New feature: Added the custom metrics in the Model Registry
New feature: Added the ability to filter on multiple business initiatives
New feature: Added the possibility to set a reference from a back reference field
Explicitly labeled default governance templates as “Dataiku Standard”
Improved the creation of items inside tables (do not propose already selected items, redirect back to the table after item creation).
More explicit message for object deletion
Simplified breadcrumb on object pages, it’s now only based on object hierarchy and not on navigation history anymore.
Fixed an issue with the selection of a Business Initiative at govern time when the govern template doesn’t have a Business Initiative
In all custom pages, by default, prevented the display of archived objects and added a checkbox to display them
Forbid the usage of an archive blueprint version when governing an object or creating a new one (Note: “auto” governance doesn’t take archived blueprint versions into account anymore either)
More explicit button labels for blueprint version activation and archiving
Fixed a refresh issue on the object breadcrumb when updating the object’s parent
Fixed an issue on deployment update when the govern API key is missing from the deployer’s settings
Fixed the application of the node size selected during installation
Fixed filters not being taken into account when mass selecting users in the administration menu
Various small UI enhancements
Elastic AI¶
Clusters monitoring: added CPU and memory usage information on nodes
Clusters monitoring: improved sorting
AKS: Added support for selecting subscription when using managed identity
AKS: Added support for deleting nodegroups
EKS: Fixed failures with some specific kubectl binaries
EKS: Wait for nodegroup to be deleted before giving back control, when resizing it to 0
EKS: Fixed “test network” macro
Fixed invalid labels that could be generated with some exotic project keys
Cloud Stacks¶
Added ability to resize root disk on Azure
Fixed handling of “sshv2” format for SSH keys
Added ability to enable assignment of public IP in subnets created with the network template
Added ability to retrieve Fleet Manager SSL certificate from Cloud’s secret manager.
Performance & Stability¶
Major performance enhancements on handling of datasets with double or date columns, especially when using CSV. Performance for reading datasets in Python recipes and notebooks can be increased by up to 50%
Added safety limits to CSV parsing, to avoid cases where broken or misconfigured CSV escaping can cause a job to fail or hang
Added safety limit on the number of garbage collection threads to DSS job processes and Spark processes, to limit the risk of runaway garbage collection overconsuming CPU
Added safety limit on filesystem and cloud storage enumerations to avoid crashes when enumerating folders containing dozens of millions of files
Fixed possible crash when computing extreme number of metrics (such as when performing analysis on all columns on all data with thousands of columns)
Performance enhancement when custom policy hooks (such as GDPR or Connections/Projects restrictions) are in use
Fixed possible instance hanging when a lot of job activities are running concurrently
Fixed possible instance slowdown when a custom filesystem provider / plugin uses partitioning variables
The startup phase of a new Jupyter notebook kernel will not cause pauses for other notebooks running at the same time anymore
Code envs¶
Made dsscli command to rebuild code envs more robust on automation node
Fixed ability to use manually uploaded code env resources without a script
Plugins¶
Fixed “run as local process” flag on plugin webapps
Fixed code environment of some plugins failing to install when using conda
Misc¶
New feature: DSS administrators can now display messages to DSS end users in their browser to alert them of some imminent event.
Fixed a bug where some deleted project library files would remain loaded after reloading a notebook
Fixed RFC822 date parsing with non-US locale
Fixed link to managed folders located in a different project from the global search page
Renamed “Drop pipeline views” macro to “Drop DSS internal views” macro as it can also be used to drop views created by the Generate features recipe.
Added back the ability for users to choose - in their profile page - whether they receive notifications when other run jobs/scenarios/ML tasks.
Projects API: New projects are now created with the new permission scheme introduced in DSS 10
Fixed deletion of foreign datasets in a project incorrectly warning that recipes associated with the original dataset in the source project would be deleted.
Fixed sort of dataset by size/records in datasets list view
Fixed listing Jupyter notebooks from git when some .ipynb files are invalid
Fixed dataset metrics/checks computed using Python probes considered as valid even in case an exception is raised from the code
Improved search for wiki articles with words in camel case (Searching for “MachineLe” would not return articles containing “machine learning”)
Formula: Some invalid expressions are no longer accepted and now can yield errors. Some of these invalid expressions were previously incorrectly considered as valid and accepted. An example of such an * expression is “Age * 10 (-#invalid”. It is invalid yet was previously accepted and evaluated as “Age * 10”.
Streaming: Fixed various issues with containerized continuous Python recipe
Fixed deletion of secrets from connection settings
Fixed wrongful caching of Git repositories with experimental caching modes
Version 12.0.1 - June 23rd, 2023¶
DSS 12.0.1 is a security, performance and bugfix release
Datasets¶
Fixed format preview when creating dataset from folder with XML files
Fixed error when reading a Snowflake dataset with a DATE column containing nulls
Streaming¶
Fixed continuous Python recipe in function mode when dataframe is empty
Machine Learning¶
Fixed scoring recipe when the treatment column is missing in the input dataset
Cloudstack: Fixed usage of Snowflake UDF in scoring recipe
Spark¶
Fixed support of INT type with parquet files in Spark 3
Notebooks¶
Fixed notebooks export when DSS Python base env is Python 3.7 or Python 3.9
Performance¶
Fixed run comparison charts of experiment tracking when there are > 100k steps (11.4.4)
API¶
Allowed read-only user to retrieve through the REST API, the metadata of a project they have access (11.4.4)
Security¶
Fixed Aborting scenarios with read-only permission on the project (11.4.4)
Version 12.0.0 - May 26th, 2023¶
DSS 12.0.0 is a major upgrade to DSS with major new features.
Major new features¶
Machine Learning overrides¶
ML models today can achieve very high levels of performance and reliability but unfortunately this is not the general case, and often, they cannot be fully trusted for critical processes. There are many known reasons for this, including overfitting, incomplete training data, outdated models, differences between testing environment and real world…
Model overrides allow you to add an extra layer of human control over the models’ predictions, to ensure that they:
don’t predict outlandish values on critical systems,
comply with regulations,
enforce ethical boundaries.
By defining Overrides, you ensure that the model behaves in an expected manner under specific conditions.
Please see Prediction Overrides for more details.
Universal Feature Importance¶
While some models are interpretable by design, many advanced algorithms appear as black boxes to decision-makers or even data scientists themselves. The new model-agnostic global feature importance capabilities helps you:
explain models that could not be explained until now
explain models in an agnostic, comparable way (rather than only using algorithm specific methods)
aggregate importance across categories of a single column
assess relative direction (in addition to magnitude of importance)
This new feature extends and enhances the existing feature importance and individual explanation capabilities. It is fully based on Shapley values and enriched with state-of-the-art visualisation
This capability is even available for MLflow models imported into DSS.
Causal Prediction¶
The most common Data Science projects in Machine Learning involve predicting outcomes. However, in many cases, the focus shifts towards optimizing outcomes based on actionable variables rather than just predicting them. For example, you may desire to improve business results by identifying customers who will respond best to certain actions, rather than simply predicting which customers will churn.
Traditional prediction models are built with the assumption that their predictions will remain valid when actionable variables are manipulated. However, this assumption is often false, as there can be various reasons why acting on an actionable variable doesn’t have the expected outcome. For example, acting on one variable may have unforeseen consequences on other variables, or the distribution of the actionable variable may be unevenly distributed in the population, making it difficult to compare individuals with different values of the variable.
To address these challenges, the field of Causal Machine Learning (Causal ML) has emerged, incorporating econometric techniques into the Data Science toolbox. In Causal ML, a Data Scientist selects a treatment variable (such as a discount or an ad) and a control value to tag rows where the treatment was not received. Causal ML then performs additional steps to identify individuals who are likely to benefit the most from the treatment. This information can then be used for treatment allocation optimization, such as determining which customers are expected to respond most positively to a discount.
The Causal Prediction analysis available in the Lab provides a ready-to-use solution for training Causal models and using them to predict the effects of actionable variables, optimize interventions, and improve business outcomes.
Please see Causal Prediction for more details.
Auto feature generation¶
The new “Generate Features” recipe makes it easy to enrich a dataset with new columns in order to improve the results of machine learning and analytics projects. You can define relationships between datasets in your project.
DSS will then automatically join, transform, and aggregate this data, ultimately creating new features.
Please see Generate features for more details.
Data Collections and Data Catalog¶
Data collections allow you to gather key datasets by team or use case, so that users can easily find and share quality datasets to use in their projects.
Data Collections, Data Sources search and Connections explorer now live together as the new Data Catalog in DSS.
Run subsequent recipes and on-the-fly schema propagation¶
For all intermediate recipes in a flow, when you click “run” from within the recipe, you now have an option to either:
Run just that recipe
Or run that recipe and all subsequent ones in the Flow, with the effect of making the whole “downstream” branch of the Flow up-to-date.
“Run this recipe and all subsequent ones” also applies schema changes on the fly to the output datasets, until the end of the Flow
It is now also possible, from the Flow, to build “downstream” (from left to right) all datasets that are after a given starting point. This also includes the ability to perform on-the-fly schema propagation
Help Center¶
Dataiku now includes a brand new integrated Help Center that provides comprehensive support, including a searchable database, onboarding materials, and step-by-step tutorials. It offers contextually relevant information based on the page you’re viewing, aiding in feature discovery and keeping you updated with the latest additions.
This Help Center serves as a one-stop solution for all user needs, ensuring a seamless and efficient user experience.
Other notable enhancements and features¶
Build Flow Zones¶
It is now possible to build an entire Flow zone. This builds all “final” datasets of this zone, and does not go beyond the boundary of the zone.
Deployer permissions management upgrades¶
When deploying projects from the Deployer, it is now possible to choose the “Run as” user for scenarios and webapps in the deployed project on the automation node. This change can only be performed by the infrastructure administrator on the Deployer.
In addition, the infrastructure administrator on the Deployer can also configure:
Under which identity projects are deployed to the automation node
Whether to propagate the permissions from the project in the design node to the automation node
Engine selection enhancements¶
Various enhancements were made to engine selection, so that users need to care much less often about which engines to select. In the vast majority of cases, we recommend that auto selection of engine is left to DSS, without manually selecting engines, or without setting prefered or forbidden engines.
The most notable changes are:
Automatically select SQL engine for prepare recipes when possible and efficient (i.e. when both input and output are the same database)
Do not automatically select Spark engine when it will for sure be inefficient (when the input or output cannot use fast Spark access)
Prophet algorithm for Time Series Forecasting¶
Visual Time Series Forecasting now includes the popular Prophet algorithm.
API service monitoring wizard¶
A new wizard makes it much easier to setup a full API service monitoring loop that gathers the query logs from the API nodes in order to automate drift computation.
Govern: Management of deployments¶
Added the synchronization of deployments and infrastructure information from the deployer node into the govern node, providing more information in the Model and Bundle registries about how and where those objects are used.
Govern: Kanban View¶
A new Kanban view allows you to easily get a view of all your governed projects
Charts: Reference lines¶
It is now possible to define horizontal horizontal lines on Line charts and Mixed charts
Request plugin installation¶
Users who are not admin can now request installation of a plugin from the plugin store. The request is then sent to administrators, and the user is notified when the request is processed.
Request code env setup¶
Users who do not have the permission to create code envs can now request the setup of a code env from the code envs list. The request is then sent to administrators, and the user is notified when the request is processed.
Model Document Generation for imported MLflow models¶
The automatic Model Document Generator now supports MLflow imported models.
Other enhancements and fixes¶
Datasets¶
Added settings to enable the Image View for a dataset as the default view
Added time part in addition to the date in Last modification column in folders content listing
Fixed “copy row as JSON” on filtered datasets
Explore: Fixed issue when using relative range and alphanum values filters together
Fixed “Edit” tab incorrectly displayed on shared editable datasets
S3: increased the default max size for S3 created files to 100 GB
Snowflake: Added support for custom JDBC properties when using the Spark-Snowflake connector
Snowflake: Fixed timezone issues on fields of type DATE when parsed as a DSS date
Snowflake: Added support for privatekey in advanced JDBC properties when using Snowpark
BigQuery: Fixed internal error happening if user has access to 0 GCP projects
BigQuery: Fixed syncing of RECORD and JSON columns containing NULL values
BigQuery: Fixed missing error message when table listing is denied by BigQuery
BigQuery: Fixed date issues on Pivot, Sort and Split recipes
Visual recipes¶
Prepare: Stricter default behavior of column type inference at creation time. The columns types of strongly typed datasets (e.g. SQL, Parquet) are kept. Behavior can be changed in Administration > Settings > Misc.
Prepare: Improved summary section in the right panel to quickly assess what the recipe is doing.
Join: Added a new mode to automatically select columns if they do not cause name conflicts
Join: Fixed second dataset’s columns selection being reset when opening a recipe with a cross-join
Join: Fixed ability to define a Join recipe using as output dataset one of its input datasets
Pivot: Fixed empty screen for “Other columns” step displayed when switching tabs
Group: Fixed concat distinct option being disabled even for SQL databases that support it
Formula language: Fixed now() function in formula generating a result that cannot be compared to other dates using >, >=, < or <= operators.
Flow¶
Fixed running job icons in Flow not always correctly displayed
Fixed Flow zoom incorrectly reset when navigating between projects with and without zones
Visual Machine Learning¶
Added support for Python 3.8 and 3.9 to Visual Machine Learning, including Visual Time Series Forecasting and Computer Vision tasks.
Added support for Scikit-learn 1.0 and above for Visual Machine Learning. Note that existing models previously trained with scikit-learn below 1.0 and using the following algorithms need to be retrained when switching to scikit-learn 1.0 (which may happen if the DSS builtin env is upgraded to Python 3.7 or Python 3.9): KNN, SVM, Plugin algorithms, Custom Python algorithms
Updated the default versions of scikit-learn and scipy in the sets of packages for Visual Machine Learning for code environments
Added Sort & Filter to the Predicted Data tab
Added the Lift metric to the model results
Fixed Distance weighting parameter not taken into account when training KNN models
Fixed failure of clustering scoring recipe when the scored dataset lacks some features that were rejected
Removed redundant split computation during training
Fixed intermittent failures of Model Document Generator on some models
Fixed a rare situation where the Cost Matrix Gain metric would not display
Visual Time Series Forecasting¶
Added ML Diagnostics to TS Forecasting
Added a result page to show ARIMA orders
Added a new Mean Absolute Error (MAE) metric
Switched to Mean Absolute Scaled Error (MASE) as the default optimization metric. The previous default (MAPE) may lead to training failure when a series has only 0s as target values.
Improved display of various results for multiple-series models
Improved support of Month time unit, for periods ending on the last day of a month or spanning more than 12 months
More & more prominent warnings when a time series does not have enough (finite & well-defined) data points for forecasting
Fixed computation (and warning) of minimum required data points for external features in the scoring recipe
Fixed a bug where forecasting models trained in earlier DSS versions had their horizon changed to 0 when retrained
Fixed default value of low pass filter for Seasonal Trend when enabled and lower than the season length
Charts & Dashboards¶
New feature: Filters: Added ability to define filters with single selected value
New feature: Mix chart: Added line smoothing option
Line chart: Fixed tooltips not correctly triggered in subcharts other the first one
Line chart: Fixed axis minimum wrongly computed when switching to manual range
Scatter plot: Fixed axis and canvas not aligned if browser in zoomed mode
Scatter plot: Fixed tooltips not showing up for points where y=x
Treemap: Fixed treemap not rendered under certain circumstances on Firefox
Boxplot: Fixed sorting order
Filters: Fixed switching from date part to date range does not reset the date slider.
Filters: Fixed numeric slider displayed instead of checkboxes list when pasting an URL containing values for a numerical filter.
Filters: Fixed filter values not correctly displayed when using multiple date parts
Dashboards: Moved the fullscreen button outside the content area
Dashboards: Fixed “Play” button issuing an error on some dashboards
Fixed custom color assignations getting lost when changing the measures in the chart
Labeling¶
Added Undo/Redo when annotating images in a Labeling Task
Notebooks¶
Made Jupyter notebook export timeout configurable
Scenarios¶
New feature: Added the ability to define Cc and Bcc lists in scenario email reporters
Fixed timezone issue in the display of monthly triggers
Collaboration¶
Enabled emails toggles in user profile by default for new users
Fixed switching branch in a project that would cause the project to become inaccessible in case the git branch was badly initialized
Fixed hyperlinks toward DSS objects in wiki exports
Dataset sharing: Fixed unable to import a dataset from another project P if quick sharing is disabled on project P
Workspaces: Fixed public API disclosing permissions set on workspaces to users and contributors of the workspace.
Workspaces: Fixed error message wrongly displayed when a user with Reader profile publishes an object to a workspace
Workspaces: Fixed “Go to source dashboard” button incorrectly grayed out under some circumstances
Govern¶
Added the ability to customize the axis of the governed projects matrix view
Added the ability to configure a sign-off with only final review (no feedback groups)
Fixed the display of multiple governed projects at the same location in the matrix view
Fixed import/export of blueprints to remove user and group assignment rules in sign-off configuration
Fixed unselect action in the selection window for lists displayed as tables
Fixed an error happening when reordering attachment files
Fixed deduplication of items in list to only apply on reference fields
Added the possibility to set data drift as a “metric to focus on” in the model registry
Fixed the removal of items from tables
Fixed the redirection to home page in case of a custom page not found
Fixed governed saved model versions or bundles being created twice when governing directly from the object page
MLOps¶
New feature: Added an option in the Evaluation and Standalone Evaluation Recipes to disable the sub-sampling for drift computation (sub-sampling is enabled by default)
New feature: Added data drift p-value as an evaluation metric
New feature: Added the ability to track Lab models metrics as experiment tracking runs
In Deployer, added an option to bundle only the required model versions.
Fixed drift computation in evaluation recipe failing when using pandas 1.0+
Fixed evaluation of MLflow models on dataset with integer column with missing values
Improved the selection of metrics to display in a Model Evaluation Store
Added support for MLflow’s search_experiments API method
Fixed handling of integer columns in the Standalone Evaluation Recipe for binary classification use cases
Fixed some flow-related public API method when there is a model evaluation store in the flow
Fixed evaluation of MLflow models when there is a date column
Fixed empty versions list for MLflow models migrated from a previous version
In the Evaluation Recipe, added the ability to customize the handling of column in data drift computation
Enriched Model Evaluations with additional univariate data and prediction drift metrics (can also be retrieved through the API)
Coding¶
Improved commit messages generated when creating, editing, deleting files in folder in project libraries
Removed some useless empty commits when performing blank edits in project libraries
Plugins¶
Fixed several types of plugin components that did not work with Python 3.11
Performance & Scalability¶
Improved performance and responsiveness when DSS data dir IO is slow
Improved performance of starting jobs in projects involving shared datasets
Improved performance of validating very large SQL queries / scripts
Improved performance of some API calls returning large objects
Improved performance of sampling for Statistics worksheets
Improved performance of various other UI locations
Administration¶
New feature: Added reporting of SQL queries in Compute Resource Usage for several missing locations where DSS performs SQL queries
Setup¶
New feature: Added support for Python 3.9 for the DSS builtin environment
Dataiku Cloud¶
Code Studios: Fixed RStudio on Dataiku Cloud
Cloud Stacks¶
Switched OS for DSS instances from CentOS 7 to AlmaLinux 8
Switched R version for DSS instances from 3.6 to 4.2
Switched Python version for builtin env for DSS instances from 3.6 to 3.9
Fixed faulty display of errors while replaying setup actions
Fixed various issues with renaming instances
Made it easier to install the “tidyverse” R package out of the box
GCP: Fixed region for snapshots
GCP: Added ability to assign a static public IP for Fleet Manager
Fixed issue when declaring a govern node but not creating it
Made the “external URL” configurable for instances, for inter-instance links shown in the interface
Elastic AI¶
EKS: Fixed support for kubectl 1.26
GKE: Added support for Kubernetes 1.26
GKE: Fixed issue when creating cluster in a different zone than the DSS instance
Made it easier to debug issues with API nodes deployed on Kubernetes infrastructure (API node log now appears in pod logs)
Miscellaneous¶
Fixed broken/missing filtering (live search) in some dropdown menus
Fixed some Flow-related methods of the public API python client that would fail when used with labeling tasks
Fixed broken
DSSDataset#create_analysis
method of the public API python clientRemoved limitations on size of project variables
Fixed failure when UIF invalid rules are defined
Fixed renaming of To do lists
Fixed possible failures of Jupyter notebooks failing to load
Fixed Admin > Monitoring screen failing to load if the instance contains a malformed dataset or chart definition.
Fixed issue with Python plugin recipes when installing plugin from Git in development mode
Fixed Parquet in Spark falling back to unoptimized path for minor ignorable differences in schema
Compute resource usage: added a new indicator that provides a better approximation of CPU usage on quick starting/stopping processes