DSS 10.0 Release notes

Migration notes

Migration paths to DSS 10.0

How to upgrade

It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.

For automatic upgrade information, see Upgrading a DSS instance.

Pay attention to the warnings described in Limitations and warnings.

Limitations and warnings

Automatic migration from previous versions (see above) is supported. Please pay attention to the following removal and deprecation notices.

Support removal

Some features that were previously announced are deprecated are now removed or unsupported.

  • Support for Ubuntu 16.04 LTS is now removed

  • Support for Debian 9 is now removed

  • Support for SuSE 12 SP2, SP3 and SP4 is now removed. SuSE 12 SP5 remains supported

  • Support for AmazonLinux 1 is now removed

  • Support for Hortonworks HDP 2 is now removed

  • Support for Cloudera CDH 5 is now removed

  • Support for HDInsight is now removed

Deprecation notice

DSS 10.0 deprecates support for some features and versions. Support for these will be removed in a later release.

  • The “Build missing datasets” build mode is deprecated and will be removed in a future release. This mode only worked in very specific cases and was never fully operational.

  • Support for MapR is deprecated and will be removed in a future release.

  • Support for training Machine Learning models with H2O Sparkling Water is deprecated and will be removed in a future release.

  • As a reminder from DSS 9.0, support for EMR below 5.30 is deprecated and will be removed in a future release.

  • As a reminder from DSS 9.0, support for Elasticsearch 1.x and 2.x is deprecated and will be removed in a future release.

  • As a reminder from DSS 7.0, support for “Hive CLI” execution modes for Hive is deprecated and will be removed in a future release. We recommend that you switch to HiveServer2. Please note that “Hive CLI” execution modes are already incompatible with User Isolation Framework.

  • As a reminder from DSS 7.0, Support for Microsoft HDInsight is now deprecated and will be removed in a future release. We recommend that users plan a migration toward a Kubernetes-based infrastructure.

Version 10.0.9 - September 9th, 2022

DSS 10.0.9 is a security release. All users are strongly encouraged to update to this release.

Version 10.0.8 - August 24th, 2022

DSS 10.0.8 is a security and bugfix release. All users are strongly encouraged to update to this release.

Recipes

  • SQL: Fixed execution of multiple SQL recipes at the same time on Redshift when using the Redshift driver (11.0.1)

  • Prepare: Fixed possible internal error with Spark engine (11.0.0)

  • Plugin recipes: Fixed dynamic select in plugin recipes for OBJECT_LIST parameter type (11.0.0)

Cloud Stacks

  • Fixed upgrade issue for Govern node

  • Fixed issue when using automatically updated license mode (11.0.0)

Elastic AI

  • Fixed failure creating AKS clusters due to third-party API change

APIs

  • Fixed “GET user” API with logins containing ‘@’ or ‘.’ (11.0.0)

Misc

  • Fixed possible failure using empty files-based datasets and folders (11.0.0)

  • Fixed DSS upgrade if previous install directory has been removed (11.0.0)

Version 10.0.7 - May 30th, 2022

DSS 10.0.7 is a security and bugfix release. All users are strongly encouraged to update to this release.

Cloud Stacks

  • AWS: Fixed per instance custom certificates

  • Azure: Fixed incompatibility when deploying new DSS with previous Fleet Manager version when the SSL certificate key storage mode is SECRETS_MANAGER

  • Fixed issue saving instance settings when root volume type was not properly set

Misc

  • Fixed issue in the UI when deleting personal API keys from user profile page

Version 10.0.6 - May 20th, 2022

DSS 10.0.6 is a very significant new release with both new features, performance enhancements and bugfixes.

Machine Learning

  • New feature: Added no-code image classification

  • New feature: Added automated data augmentation for object detection and image classification

  • Object detection and image classification: improved display of the loss graph

  • Added “Max delta step” as configurable parameter for XGBoost

  • Added “Column subsample ratio for splits / levels” as configurable parameter for XGBoost

  • LightGBM: Switched to using gain for variable importance

  • Improved the way model views are chosen and activated

  • Fixed explanation text for lift charts

  • Fixed failure scoring with models trained with older DSS, with impact coding and unseen categories

  • Fixed ability to resume a session after some of its models have been deleted

  • Fixed ugly names for hyperparameters for LightGBM in the training details screens

  • Fixed small UI issues for clustering

  • Fixed computation of feature distributions on fully-empty numerical features

  • Added missing algorithm details for partitioned models

  • Fixed a race condition in training of partitioned models

  • Fixed handling of project libraries for custom algorithms

  • Fixed number of retrained layers for Object Detection and Image Classification

  • Object Detection and Image Classification: added ability to select GPU for training recipe

  • Object Detection and Image Classification: fixed display of images feed when using a foreign managed folder

  • Fixed case where both retraining and using a model in the same job led to the old model to be reused

Elastic AI

  • New feature: Brand new monitoring UI for managed clusters, allowing you to view all activity on your managed clusters

  • New feature: Cleanup actions to remove all failed and finished items on managed clusters

  • New feature: EKS: Added ability to use spot instances

  • New feature: EKS: Added ability to automatically install Kubernetes Metric Server

  • EKS: Added ability to tag nodes

  • EKS: Added ability to assume a role to create the cluster

  • Fixed failure to run containerized execution jobs when they need more than 30 minutes to start

  • Added ability for streaming Python recipes to have extraLabels and extraAnnotations

  • Fixed cases where SparkSQL recipe validation could fail and keep failing

  • AKS: fixed support for taints

  • Fixed settings warning staying displayed after switching back to local backend environment for webapps

  • Fixed GPU images on GKE

  • Fixed build of GPU images following NVidia repository changes

  • Fixed ability to use custom ingress classes

Datasets & Managed Folders

  • New feature: When uploading multiple files at once, you can now choose between creating a single dataset or one dataset per file

  • New feature: Redshift: added ability to read external tables (also known as “Redshift Spectrum”)

  • DynamoDB: Vastly improved write performance(up to 30 times faster)

  • Teradata: Fixed reading of dates prior to 1582

  • Snowflake: Added caching for OAuth tokens in the case of using “Snowflake OAuth” to reduce number of calls to authorization server

  • Managed folders: Fixed actions from the folder view

  • Managed folders: Fixed “move” and “rename” actions on Azure Blob Storage

  • Connection explorer: fixed useless listing of tables when previewing data

  • Fixed numerical filter losing its settings on explore page

Statistics

  • New feature: Added native support for time series in Visual Statistics (stationarity tests, trend tests, ACF, PACF, autocorrelation statistic)

  • Added loading plot support for PCA

  • Improved axis ranges for scatter plots

Flow

  • Added direct ability to move recipes between flow zones from the contextual menu, and in API

  • Fixed issues with “copy data” when copying filesystem datasets and folders

Hadoop

  • New feature: Added support for Cloudera CDP Private Cloud Base 7.1.7.p1000

  • Cloudera CDP: Fixed sort recipe order by clause in Hive engine on CDP.

  • Cloudera CDP: Fixed join recipe when a date is involved in joining conditions

  • Changed Hive queries to be explicit on null / empty behavior when ordering

Charts & Dashboards

  • Added “Sampled” badge on filters tile to show that you are only seeing partial values

  • Fixed display error when a date filter has no more available values

  • Fixed issue with dimensions “graying out” when dragging/dropping them in some circumstances

Formula

  • Fixed silent error in SQL translation of some formulas

  • Fixed mishandling of the PI function

MLOps

  • New feature: Added ability to compute data drift in standalone evaluation recipes

  • Added ability to use plugins and project libraries for MLflow models

  • Added ability to use a saved model as output of a Python recipe, in order to facilitate MLflow models creation

  • Various UI and API enhancements for MLflow models import

  • Added ability to publish metrics from a model evaluation to the dashboard

  • Fixed “compute_schema_updates” on evaluation recipes with model evaluation stores

  • Fixed ability to use variables expansion for partition dependencies in evaluation recipe

  • Fixed possible failure computing metrics for MLflow models when there are not enough different values in test set

Collaboration

  • Fixed copy of attachments when copying Wiki articles

  • Fixed issue with displaying tag categories on home page

Visual recipes

  • Prepare: Fixed chained pivot steps in Prepare recipe losing output columns when run with Spark

  • Prepare: added SQL support for “extract from geo column” processor

  • Geo Join: fixed handling of variables expaansion in pre/post filters

API Node

  • New feature: Added ability to authenticate API calls using JWT Bearer Token

Scenarios

  • Fixed some issues with relocability of scenarios (ability to run in a different project key)

  • Fixed handling of content-type header on webhook reporters

  • Fixed a case where scenario could not appear as aborted when aborting it

  • Fixed ability for read-only users who have “run scenarios” permission to run directly from the scenario page

API

  • New feature: Added last login and last activity (opening DSS) to users API

  • New feature: Added an API to get information about dataset last build

  • New feature: Added an API to manage personal API keys

  • Added ability for non-admins to use code envs API

  • Added ability to create Kubernetes clusters through the API

Plugins

  • Added support of dynamic select on the plugin’s settings page

  • Fixed support for dynamic select for OBJECT_LIST type

  • Added ‘triggerParameters’ on getChoicesFromPython to reload only when subset of field are updated

  • Fixed issue setting value for STRINGS parameter

  • Added ability to use “contextual” code env for model views

Scalability and performance

  • Strong performance enhancements (especially startup times) for jobs leveraging S3, Azure Blob and Google Cloud storage

  • Catalog: strongly improved performance for “External tables” tab

  • Machine Learning performance enhancement for categorical features with vast number of distinct values in train set

  • Added ability to export projects with extremely large .git folders

  • Fixed severe performance degradation when translating to SQL “Find/Replace” processors with vast amounts of empty entries

  • Fixed severe performance degradataion when translating to SQL a vast number of “Formula” processors

  • Fixed possible failure to delete Kubernetes jobs from aborted DSS jobs

  • Fixed performance degardation related to metrics API

  • Fixed potential hang when listing paths of a managed folder that does not respond

  • Fixed potential hang when submitting a SQL query with hundreds of thousands of lines to some databases, leading in issues parsing the resulting error message

  • Fixed potential hang with Webapps on Kubernetes

  • Fixed potential hangs with external hosting of runtime databases under very high load, notably with many active scenario triggers

  • Fixed potential hangs with external hosting of runtime databases under very highl load, when all available connections are used

  • Fixed potential hang related to users API

  • Fixed potential hang related to schema consistency check on non-responding datasets

Administration

  • New feature: Added last login and last activity (opening DSS) to users screen

  • Fixed failure of “per-connection data” screen in the case where some plugins were uninstalled

  • Fixed refresh of data in “per-connection data” when clearing datasets

  • Automatically ignore empty pip / conda options

Deployer

  • Projects: Fixed ability to save settings of infrastructures when they are managed by Fleet Manager

  • Projects: Fixed issue with setting scenario states from the deployer

Cloud Stacks

  • Improved display of virtual network details for Azure

  • Fixed system limits that could make it impossible to log in with SSH

  • Fixed reprovisioning on instances with lots of settings, especially when using many containerized execution configurations, or SSO

  • Azure: Added support for certificates coming from Keyvault

  • Fixed issue with deploying instances with some recent licenses

  • Added an instance diagnosis ability to Fleet Manager

  • Fixed starting Kubernetes clusters on DSS nodes reprovisioned by Fleet Manager 10.0.5

  • Fixed support of zipped JDBC drivers

Misc

  • Fixed compatibility issue with the “Reverse Geocoding” plugin

  • Fixed login issue on Safari 15.4

  • Fixed aborted jobs still appearing as running (UI-only issue)

  • Fixed logs in application-as-recipe

  • Fixed default name of notebooks created based on foreign datasets

Version 10.0.5 - March 10th, 2022

DSS 10.0.5 is a bugfix release

Recipes

  • Join recipe: fixed “match on nearest date” and “match on date range” options

Misc

  • Fix an issue causing malfunction with some types of customer licenses

Version 10.0.4 - March 7th, 2022

DSS 10.0.4 is a very significant new release with both new features, performance enhancements and bugfixes.

Coding

  • New feature: Added support for Python 3.8, Python 3.9 and Python 3.10

  • New feature: Added support for Pandas 1.1, Pandas 1.2 and Pandas 1.3

  • New feature: When running a coding recipe, the “raw” output of the code can now be displayed in the logs (without Dataiku infrastructure logs)

  • Updated dependency on “requests” for better compatibility with 3rd party libraries that require newer “requests”

  • Managed folders API: added “upload_folder” function

  • Fixed continuous python activities not getting project python libraries

  • Fixed SparkSQL insertable fragments using wrong quoting char

  • API: Python: Added a Python method to clear the remote DSS previously set by set_remote_dss

  • API: Fixed a bug in get_latest_model_evaluation not providing the latest model evaluation id

  • API: Added an API method to add several items to a zone

Explore

  • New feature: Automatically display whether you are seeing the complete data or a sample

  • New feature: Added total number of records in the dataset, when sampling is not “first records”

  • New feature: Added total number of records in the dataset, when sampling is “first records”, on Snowflake and BigQuery

Charts

  • New feature: Automatically display whether a chart is running on sampled data or whole data

  • Performance enhancement: Faster charts rendering on dashboards

  • Performance enhancement: Reduced the number of times where chart cache needs to be rebuilt, leading to overall improved performance for charts

  • Binned scatter plot: Do not mistakenly accept geo columns as X or Y

  • Scatter plot: Fixed display of axis margins when enabling log scale

  • Fixed useless scroll bar with Firefox

  • Improved preservation of chart settings when changing the type of chart

  • Fixed failure on animated charts if a bin disappears after chart setting changes

  • Fixed thumbnail generation

  • Prevented user from saving color palettes with invalid colors

Flow

  • New feature: Uploaded Datasets can now be created by directly dragging-and-dropping files on the Flow

  • Performance enhancement: Improved performance of panning large flows

  • Performance enhancement: Improved performance of hovering and selecting items in large flows

  • Improved behavior when removing partitioning on SQL datasets

  • Mark “missing data only” build mode as deprecated

  • Improved accuracy of rectangular selection (Ctrl+mouse drag)

  • Fixed usage of SQL pipelines when schema/catalog of virtualised datasets contains a variable

  • Fixed Flow disappearing with invalid characters in Flow zone name

  • Fixed external dataset appearing as “not built” if a managed dataset of the same name previously existed and was never built

Workspaces & Dashboards

  • Slack notifications: Fixed notification text when items are shared to workspaces

  • Fixed collapse of long descriptions on workspaces

  • Prevented the full screen in dashboard from overlapping with the “close error” button

Snowflake

  • New feature: Added native integration with Snowpark Python

  • New feature: Added in-Snowflake support for URL Splitter prepare processor (through Java UDF)

  • New feature: Added in-Snowflake support for Currency Conversion prepare processor (through Java UDF)

  • New feature: Added in-Snowflake support for Normalize measures prepare processor (through Java UDF)

  • Improved in-Snowflake support for regular expression extraction processor (through Java UDF)

  • Added support for proxy for OAuth endpoints

  • Prepare recipe: Fixed string concatenation processor with null values

  • Fixed possible issue on pivot recipe when QUOTED_IDENTIFIERS_IGNORE_CASE is set to TRUE

  • Fixed issues with Cloud-to-Snowflake synchronization with date columns containing null values

BigQuery

  • Enabled the DSS builtin driver by default for new BigQuery connections

  • DSS builtin driver: Much faster read of large datasets

  • DSS builtin driver: Added support for reading from views

Datasets

  • GCS: Added support for proxy

  • ElasticSearch: fixed support for authenticated proxy

  • Synapse: Added support for Parquet for fast-sync from Azure Blob Storage

  • S3: Fixed usage of connections with specific interface endpoints

  • Shapefile: Fixed format options when manually selecting Shapefile format

  • Fixed ‘Move To’ folder action being limited to a small number of items

  • Fixed “max length” display in the schema of some datasets

Formula

  • New feature: switch() function for easy switch/case support (SQL pushdown supported)

  • New feature: uuid() function generating a UUID

  • Fixed highlighting of unknown fields in formula editor

  • Added SQL support for substring function

  • Added SQL support for now function on BigQuery and PostgreSQL

Visual Recipes

  • New feature: Prepare recipe: New processor: ‘Enrich with last build time’, adding a column containing the recipe run date

  • Prepare recipe: Fixed “clear cells” option in the Analyse modal

  • Prepare recipe: Fixed a bug on DSS engine when using several consecutive pivot steps

  • Prepare recipe: fixed missing refresh when removing a value from the “Find/Replace” replacements list

  • Prepare recipe: report warnings for CRS change and Geometry info extraction processors

  • Prepare recipe: fixed small UI issues in the “merge categorical values” modal

  • Prepare recipe: Fixed plugin processors with Spark engine

  • Filter/Sampling recipe: fixed usage of variables in when sampling is disabled

  • Split recipe: Fixed changing input

  • Split recipe: fixed failure when dropping some percentile of data

  • Stack recipe: Improved support of variables in the pre/post filters

  • Join recipe: Fixed “auto select all columns” with Spark engine

  • Join recipe: fixed join suggestions when columns use non-Latin characters

  • Join recipe: various interface improvements in join conditions modal

  • Join recipe: Made “+0000” timezone usable with DSS Engine

  • Sync recipe: added fast-path support to “files in folder” dataset

Machine Learning

  • New feature Added sentence embedding as a text feature handling option

  • New feature: Added a diagnostic that detects if the model predicts the same class more than 99% of the time

  • Performance enhancement: Improved performance of opening clustering models

  • Multiple UX enhancements in “Explore neighborhood” (aka counterfactuals)

  • Added a warning when “drop rows when empty” would lead to dropping large number of rows

  • Fixed interactive scoring with date features and ensemble models

  • Fixed Keras models deletion on UIF instances

  • Fixed distributed hyperparameter search failing in case of an unexpected failure on one worker

  • Object Detection: Fixed CPU scoring on a model trained on GPU if there is no GPU available on the instance

  • Fixed creation of scoring recipes with existing datasets as output

  • Fixed possible error while viewing a clustering model

  • Fixed possible error when deploying models trained with old DSS versions

  • Fixed model creation modal images on Firefox

  • Fixed new diagnostics not being displayed in the settings of old analyses

  • Fixed display of number of training rows when the model is trained on the full dataset

  • Fixed possible errors showing a model when traing has been aborted by an unexpected event

  • Fixed “calibration loss” not displayed for multiclass in the “Metrics and assertations” page

  • Fixed unexpected reset of the partitions filtering widget when selecting a partition to train a model

  • Fixed multiclass prediction summary page not showing metric used for training when it was not mAUC

  • Removed irrelevant random state selection from time-based K-Fold (always deterministic)

  • Fixed interactive scoring when training in containers with “skip expensive reports” option

  • Switched to using train set instead of test set to compute features distribution for model explanations

  • Fixed display of cost Matrix Gain in decision chart when some metrics are deselected

MLOps

  • New feature: MLflow import: Added support for containerized execution for evaluation and scoring

  • New feature: MLflow import: Added support for input data drift computation

  • MLflow import: Added ability to read features from MLflow model signature

  • MLflow import: Added ability to load MLflow models from DSS managed folders

  • MLflow import: Added support of Evaluation diagnostic

  • MLflow import: Added support for sampling of input dataset for evaluation recipe

  • MLflow import: Added ability to directly input the features list in the API

  • MLflow import: easier to use API for evaluate

  • MLflow import: Fixed the case where the MLFLow model returns NaN for some predictions

  • MLflow import: Improved handling of errors in interactive scoring

  • MLflow import: Fixed possible failure in computing counterfactuals

  • MLflow import: Prevented invalid version ids

  • Evaluation recipe: Added support for sampling of input dataset

  • Evaluation recipe: fixed preselection of test dataset when using a shared dataset

  • Model comparison: Fixed the reduce button of “configure” modal

  • Model comparison: Made model coming from analysis available for drift computation

  • Drift: Improved progress bar when computing drift analysis

  • Drift: Added a warning on new modalities in univariate drift analysis

  • Performance enhancement: Model Evaluation Store: Better performance for model evaluation stores UI

  • Model Evaluation Store: made summary sections collapsable

  • Model Evaluation Store: Added tags on the side panel

  • Model Evaluation Store: Allow exposing Model Evaluation Stores between projects

  • Model Evaluation Store: Disabled unwanted scientific notification in some result screens

  • Model Evaluation Store: Removed evaluations that are still being computed from charts

  • Standalone Evaluation recipe: Fixed computation of probabilistic evaluation when target has NaN value

  • Standalone Evaluation Recipe: Add the ability to create it using the public API

  • Standalone Evaluation Recipe: Added evaluation diagnostics when classes are missing

  • Standalone Evaluation Recipe: Fixed wrong “training data” information in result screens

  • Made Model Comparator and Model Evaluation Store searchable in the global finder

Notebooks

  • New feature: SQL notebooks: added ability to execute only the selected part of the query

  • SQL notebooks: Added display of JDBC warnings

  • Jupyter: Install Jupyter Widgets extension by default

  • Jupyter: Predefined notebooks on datasets are now Python 3 compatible

  • Jupyter: Fixed some issues with autocompletion on Jupyter notebooks

Scenarios

  • Fixed the “define project variables” scenario step not escaping value properly when logging

  • Added missing check when starting a scenario using a “Run scenario” step that could lead to running the same scenario twice in parallel

Automation

  • Fixed connection remapping failure if a plugin is missing on the automation node

  • Added Wiki attachments in bundles

Geospatial

  • New feature: Added ability to export Geospatial datasets as Shapefiles

  • GeoJSON import: Added support for importing GeoJSON files with missing geometries

  • GeoJSON export: Added stricter handling of types (numericals will now be numericals in the generated GEoSJON)

  • GeoJoin: Fixed issue when joining with the same dataset and using different filters

Statistics

  • Fixed support of cgroups for statistics computation

  • Fixed broken chart auto-resizing when resizing browser window

  • Fixed possible out of memory with a very specific series of numbers

  • Improved error handling if a failure occurs while computing automated card suggestions

Managed folders

  • New feature: Added ability to have “Filesystem” managed folders on NFS or CIFS, or other locations where managing ACLs is not supported

Webapps

  • Improved user experience on the “rename webapp” modal

  • Fixed reverting a web app previously exposed on K8S to local run

Collaboration

  • Performance enhancement: Improved performance of home page for fetching projects list

  • Performance enhancement: Strongly reduced the cost of notifications (“red bell”)

  • Fixed discussions when their underlying project is watched by deactivated users

  • Fixed setting to disable login/logout notifications

  • Fixed error when duplicating a project from the project folder list

  • Allow explorers to edit wiki

Governance

  • Multiple UX improvements

  • Fixed sync of object detection models to Govern

  • Fixed various issues with advanced permission criteria

  • Fixed non-editable fields that still appeared as editable

  • Fixed issue with displaying related artifacts in the “Graph” view

  • Fixed various robustness issues with DSS-govern project synchronization in the presence of errors

  • Disabled sync of partitioned models, which are not available in Govern

  • Fixed the “Synchronize DSS Items” button in DSS admin settings not displayed without refreshing the page

  • Fixed the “Test” button of Govern integration not taking the value without saving

  • Added an option to not synchronize in Govern a specific model evaluation in the model evaluation store

Cloud stacks

  • New feature: Added centralized license reporting in Fleet Manager, to get a complete view on license usage across instances

  • New feature: Added a “sublicense” mechanism which allows limiting the number of users that can be assigned to an instance (to a subset of your total number of licensed seats)

  • Fixed issues with user names containing @ or too long user names

  • When using self-signed certificates, generate a Subject Alternative Name to improve browser compatibility

  • Automatically mark cookies as secure when deploying DSS over HTTPS

  • Fixed login screen on Fleet Manager appearing before Fleet Manager itself is ready

  • Fixed license check when reprovisioning an instance with a Discover or Business license

  • Added log rotation for agent logs

  • Azure: Fixed issues logging with SSH after 30 days

  • Azure: fixed possible issues with AKS clusters when using user-assigned-managed-identities

  • Azure: added ability to restrict the IPs allowed to connect to Fleet Manager

  • Azure: added ability to use an existing VNET in a different RG in the ARM template

  • Azure: Added ability to specify a resource group for data disks when using blueprints

  • Azure: added ability to choose Internet traffic mode

  • Azure: Improved error message when SSL key stored in Azure KeyVault is not properly set

  • Azure: Fixed creation of initial password with special characters

  • AWS: Added support for gp3 volumes

Elastic AI and Spark

  • Fixed possible leak of pods when a job is aborted. Pods are now automatically cleaned up, both for containerized execution and Spark execution, when the job finishes, even after an abort

  • Fixed various issues which could cause jobs or notebooks failures when the Kubernetes cluster is overloaded or temporarily unable to reespond

  • When running Spark on Kubernetes jobs, the logs and pods status of Spark executors is now automatically collected and can be viewed in the UI to facilitate troubleshooting

  • When running Spark jobs, some common configuration issues are now more clearly highlighted to facilitate troubleshooting

  • Added ability to automatically Python 3.8, 3.9 and 3.10 in container images

  • New feature: EKS clusters: Added support for automatically installing the GPU driver

  • EKS clusters: upgrade to a newer eksctl for better compatibility

  • EKS clusters: Added support for Python 3 for the creation environment

  • Improved support for multiple sets of Azure credentials in a single Spark job

  • Fixed excessive refresh of GCS tokens when using GCS connections with OAuth2 credentials in Spark jobs

  • AKS clusters: fixed issue with “inherit DSS host settings” when deploying the cluster in another resource group

  • Save settings before “push base images” in order to use latest settings

  • Added code env resources support for spark executors

  • Fixed leak of pods when aborting a training or scoring recipe on Kubernetes

Hadoop

  • Fixed hive validation on CDP 7.1.7 when using “ADD JAR” commands (or other DDL)

  • Fixed search box for Hive database on new Hive dataset screen in Chrome

Streaming

  • Fixed “save and refresh sample” button on streaming endpoints

Plugins development

  • Fixed error message not displaying when more than 2 columns are selected in a COLUMNS fields of a plugin recipe

  • Fixed wrongful error message when recreating a plugin that was just deleted

  • Added support for dynamic select in auto config form for custom fields

  • Added ability to get the expanded version of a preset in Python custom UI setup code

Administration

  • New feature: Authorization matrix: added ability to export the authorization matrix to CSV, Excel, dataset, …

  • New feature: Added ability to restrict allowed sender domains in SMTP and Amazon SES channels

  • Authorization matrix: Improved UI

  • Authorization matrix: Improved scalability with very large instances

  • Automatically cleanup some very large files in the “jobs” folder to save space

  • Various logs in the “jobs” folder are now automatically compressed to save space

  • When deleting a project, automatically propose to delete job and scenario logs

  • Added encryption of proxy password

  • Fixed issue with projects permission upgrades (for workspaces)

Other performance & stability enhancements

  • Performance enhancement: Strongly reduced cost and impact on other users of starting jobs on highly loaded instances

  • Performance enhancement: Strongly reduced cost and impact on other users of changing permissions on large projects

  • Performance enhancement: Reduced cost and impact on other users of using scenario reporters with large scenario runs history

  • Performance enhancement: Reduced cost and impact on other users of activating saved model versions on partitioned models with large number of partitions

  • Performance enhancement: Reduced disruption caused by initial data catalog indexing in the first minutes after DSS startup

  • Performance enhancement: Improved scenario UI performance for projects with large number of datasets

  • Performance enhancement: Overall performance enhancements for projects with large number of datasets

  • Stability: Fixed potential instance hang when dealing with lots of webapps on Kubernetes

  • Stability: Fixed potential instance hang when using managed folders Python API

macOS Launcher

  • Disabled “Check for updates” while DSS is starting up

  • Do not display “Git is not installed” popup anymore

  • Added display of DSS and launcher versions

Misc

  • Added safety on corrupted params.json project file blocking the whole instance

  • Fixed managed folders not being deleting when used by an App as recipes

  • Fixed DSS stream engine when sorting double columns that contain NaN values

Version 10.0.3 - January 28th, 2022

DSS 10.0.3 is a bugfix and security release. All users are strongly encouraged to update to this release.

Items marked with (9.0.7) are also present in DSS 9.0.7

Recipes

  • Prepare recipe: Fixed formula preview (9.0.7)

  • Code recipes: Fixed access to Flow variables (9.0.7)

Flow

  • Fixed flow graph disappearing from job page at each refresh for large flows (9.0.7)

Projects

  • Fixed “Code env selection” settings resetting to default when the tab is open. (9.0.7)

Cloud Stacks

  • Fixed scheduled snapshots not taking changes of snapshot settings into account (9.0.7)

Performance

  • Fixed instance lockup when copying very large managed folders for Python function endpoints

Miscellaneous

  • Fixed invalid actions displayed on the home page of the automation node when there are no projects (9.0.7)

Security

Version 10.0.2 - December 13th, 2021

DSS 10.0.2 is a significant new release with both new features, performance enhancements and bugfixes.

Items marked with (9.0.6) are also present in DSS 9.0.6

Datasets

  • New feature Added per user login for Google Cloud Storage (OAuth) (9.0.6)

  • New feature Added per user login for BigQuery (OAuth) (9.0.6)

  • When creating a dataset from file names with Unicode characters (including CJK), an equivalent ASCII dataset name is automatically generated (9.0.6)

  • Fixed possible UI overlapping between different custom exporters (9.0.6)

  • Fixed creation of managed SQL datasets from “New Dataset > Internal > Managed”

Machine Learning

  • Fixed creation of cluster recipes on foreign datasets (9.0.6)

  • Fixed creation of scoring recipes from MLflow models

  • Fixed import of MLflow models on UIF-enabled DSS

Hadoop, Spark, Elastic AI

  • New feature: Added support for CDP Private Cloud Base 7.1.7 (9.0.6)

  • Added the ability to import EMR-created tables from Glue as S3 datasets when not using EMR with DSS (9.0.6)

  • Fixed failure of Spark recipes when project variables contain Unicodes characters (including CJK) (9.0.6)

  • Fixed SparkSQL recipe validation failure when the code contains Unicode characters (9.0.6)

  • Fixed issue with Kubernetes namespace policies (9.0.6)

  • Fixed direct write to Snowflake from Spark with OAuth authentication and variables (9.0.6)

Dashsboards

  • Fixed truncation of large dashboard exports (9.0.6)

  • Fixed opening of insights when clicking their title

Cloud Stacks

  • New feature: Azure: Added ability to create a subnet that does not cover the entire vnet (9.0.6)

  • New feature: Azure: Support for static private IP for Fleet Manager (9.0.6)

  • New feature: Azure: Support for static private IP for DSS instances (9.0.6)

  • New feature: Azure: Added ability to create resources in a specific resource group instead of always using the vnet resource group (9.0.6)

  • New feature: Azure: Added ability to fully control the name of created resources (machines, disks, network interface, …) (9.0.6)

  • New feature: AWS: Added support for Hong Kong, Osaka, Milan and Bahrain regions (9.0.6)

Flow

  • Fixed Flow filtering with flow zones and exposed objects (9.0.6)

Recipes

  • Prepare recipe: “Simplify column names” now automatically translates Unicode characters (including CJK) to equivalent ASCII (9.0.6)

  • Prepare recipe: Snowflake: Fixed date parsing with timezone being sensitive to the JDBC session timezone (9.0.6)

  • Code recipes: When creating the recipe with input or output managed folder with Unicode names (including CJK), generate an equivalent ASCII variable name for the starter code (9.0.6)

  • Join recipe: Improved input preview

  • Join recipe: Better warnin at recipe validation when there are unusable characters in column names (9.0.6)

  • SQL recipe: Fixed usage of explicit DKU_END_STATEMENT (9.0.6)

  • Fixed possible failure with Snowflake/Synapse/BigQuery auto-fast-paths with date columns (9.0.6)

  • Fixed failure with Snowflake auto-fast-path and incomplete configuration (9.0.6)

API

  • Added ability to modify containerization settings of code envs (9.0.6)

  • Fixed creation of prepare recipe with existing outputs from the Python public API (9.0.6)

  • Fixed the direction argument of the SelectQuery.order_by method (9.0.6)

  • Fixed invalid removal of default Flow zone through the API (9.0.6)

Notebooks and webapps

  • Fixed changing name of a SQL notebooks when created from the side panel (9.0.6)

  • Fixed possible issue when saving standard webapps (9.0.6)

  • Fixed write to Snowflake/Synapse/BigQuery auto-fast-path from Jupyter notebooks and webapps (9.0.6)

  • Fixed failure of webapps when the project variables contain Unicodes characters (including CJK) (9.0.6)

Performance and scalability

  • Improved performance of flow zones listing (9.0.6)

  • Improved performance on home page with large number of project folders (9.0.6)

  • Fixed leak of Python processes from custom filesystem providers such as Sharepoint (9.0.6)

  • Fixed memory leak in Cloud Stacks for Azure (9.0.6)

  • Fixed failure on dashboards for datasets with large number of charts (9.0.6)

  • Added pagination on users list and UIF rules screens (9.0.6)

  • Improved CPU consumption of eventserver reporting (9.0.6)

Misc

  • Dataiku Applications: Added an option to hide the “Switch to project view” button (9.0.6)

  • Added ability for non-admins to create plugin code envs if they have plugin development rights (9.0.6)

  • Fixed bug when duplicating a plugin component

Version 10.0.0 - November 15th, 2021

This release is dedicated to the memory of our dear colleague Mark Treveil.

DSS 10.0.0 is a major upgrade to DSS with major new features.

New features

MLOps: Models Comparison and Drift Analysis

Model evaluations now allow you to capture the performance and behavior of a model after it has been trained, in order to analyze the evolution of its behavior in time. This enables Drift analysis.

Visual model comparisons allow you to quickly compare models between them or different versions of models. They can be used both during the Machine Learning design phase or to compare behaviors and performance over time.

For more details, please see MLOps

MLOps: Centralized Models registry

Part of the new Govern Node, the centralized models registry provides a centralized way to see all models (whether developed in Dataiku or externally) in one place, versioned and with performance metrics and project summaries for leaders and project managers. This includes Drift analysis metrics

MLOps: Models deployment signoff workflows

Part of the new Govern Node, you can now have mandatory sign-off and approval of models before they can be deployed in production. Models signoff can include multiple and customizable reviewers and approvers.

MLOps: MLflow Models import

DSS can now import models from the MLflow Models framework. MLFLow Models imported into DSS benefit from all the capabilities of DSS-trained models, including:

For more details, please see MLflow Models

Governance: Projects governance, risk & value assessments

Part of the new Govern Node, the centralized projects governance framework leaders and project managers to keep an eye of all of the AI initiatives lifecycle with clear steps and gates in order to keep proper oversight of your business initiatives.

Risk and value assessment matrices provide a standardized framework to compare initiatives for investment and determine the appropriate oversight level.

For more details, please see Governance

Data consumers: Workspaces, a new home for data consumers

Outputs of complex data projects are often scattered across multiple projects and locations, making it challenging for business stakeholders and data consumers to quickly gain access to the needed data.

Workspaces provide dedicated, secure landing pages where data consumers can easily browse Dataiku dashboards, webapps, datasets, applications, wikis, etc. to get direct access to the most relevant insight or to take direct action using applications and webapps.

For more details, please see Workspaces

Data consumers: cross-chart filters on dashboards

You can now add cross-charts filters on dashboards. The filter can affect all charts on a slide.

For more details, please see Dashboard concepts

Geospatial analytics: Geo-join recipe

The new geo-join recipe allow you to visually match and enrich geospatial datasets.

For more details, please see Geo join: joining datasets based on geospatial features

Geospatial analytics: Density chart

The Geo heatmap chart provides a “density”-based analytics in order to quickly visualize the most important locations on a map.

Geospatial analytics: preparation tools

New tools in the prepare recipe facilitate Geospatial analytics:

  • New processor and formula function: Create an area around a geopoint

  • Formula function: Simplify a geometry (including SQL support for PostGIS and Snowflake)

  • Formula function: Get the bounding box of a geometry

  • Formula function: Compute distance between geometries

  • Formula function: Check for intersection between geometries

  • The Change CRS processor can now run in SQL (with PostGIS)

Machine Learning: Object detection

Object detection is now a top-level task in DSS. You can now easily leverage leading, pre-trained deep learning models for detecting objects, and fine tune them to your specific labeled datasets.

Like all models trained visually in DSS, object detection models provide detailed results screens, builtin scoring ability, versioning and governance.

For more details, please see Computer vision

Machine Learning: Counterfactuals and Actionable Recourse

Counterfactuals and Actionable Recourse analysis enhance Interactive scoring with insights about the behavior of the model in the vicinity of a reference example.

Counterfactuals generate various records similar to the reference example and that lead to a different predicted class.

Actionable recourse generates the records with the smallest possible perturbations compared to the reference example that lead to a specific predicted class, different from the one of the reference example. Interactive scoring is a simulator that enables any AI builder or consumer to run “what-if” analyses (i.e., qualitative sensibility analyses)

Machine Learning: LightGBM

The fast and powerful LightGBM algorithm joins the family of algorithms that can be trained by the DSS AutoML component

Machine Learning: expanded feature encodings

Several new feature encodings are now available in AutoML:

  • Enhanced impact (target) encoding

  • Rank encoding

  • Frequency encoding

  • Cyclical encodings for date/time

For more details, please see Features handling

Machine Learning: Queues

While training machine learning models, you can now enqueue several trainings that will all execute without further intervention. This allows you to schedule many experiments at the end of the day, and come back the next day with all your models trained and ready to be compared in the new Models Comparison.

Statistics: Augmented Exploratory Data Analysis

When performing exploratory data analysis on wide or complex datasets, it can be challenging and overwhelming for users to understand which columns might be most important to their analysis, how the columns relate to each other, and to identify patterns and insights.

Within the Statistics, a new wizard interactively suggests statistical analyses that may be interesting, along with new additional advanced charting capabilities such as 3-D scatter plots and parallel coordinates plots.

Other notable enhancements

Charts: Customizable axis ranges

Ranges on both the X and Y axis of charts can now be customized

Charts: Color assignments

It is now easier to manually control color assignments on charts in order to have consistent colors between charts.

Charts: numerical formatting

New numerical formatting options are available for charts (for values displayed in the chart and in the tooltips)

Git push and pull for libraries

In addition to the existing capability to fetch project libraries from existing Git repositories, it is now possible to push them back to their origin.

For more details, please see Importing code from Git in project libraries

Code env resources

When installing some packages in code envs, such as NLTK or Spacy, you frequently need to download additional resources, such as pretrained models. Previously, each user had to download the resource in a specific folder, and sometimes tweak options of the packages in order to point to the downloaded resources.

Code env resources allow you to download resources directly to the code env folder, making them available for all users

For more details, please see Operations (Python)

Data preparation: Easy extraction with Grok

You can now leverage the “Grok” pattern extraction mechanism that allows you to easily parse logs using predefined patterns. A visual editor makes it easy to view what your expression matches and to troubleshoot it.

For more details, please see Extract with grok

Wiki: quality-of-life enhancements

It is now possible to attach images in the wiki by directly dragging and dropping it.

Adding attachments does not require saving edits first anymore.

Other enhancements and fixes

Elastic AI

  • Spark version has been upgraded to 3.2.1

Visual recipes

  • Prepare: Fixed invalid JSON in “shift+V” on a cell

  • Prepare: Fixed issue with the Nest processor on Spark

  • Grouping: Fixed UI issue with CJK characters in column names

  • Grouping: Improved discoverability of “First/Last”

  • Distinct, Pivot, Grouping: Fixed error on partitioned SQL datasets when the partition column was also used as a key

Machine Learning

  • Fixed possible permissions issues with UIF enabled

  • Variables importance and partial dependencies can now be exported (CSV, Excel, Tableau, dataset, …)

  • Fixed failure when copying feature handling between clustering tasks

  • Fixed score discrepancy with partitioned models in SQL mode with “redispatch”

  • Fixed UI issue with mass actions on features handling

  • Fixed clustering recipe failure when a column is fully empty

  • Fixed faulty ability to remove models while they were training

  • Fixed performance issue with distributed hyperparameters search

  • Updated the computation of individual explanations to improve their correctness

Snowflake

  • Preparation: URL parser can now be pushed down to Snowflake

  • Preparation: Email parser can now be pushed down to Snowflake

Datasets

  • Fixed issues with autodetection of Parquet on S3/Azure/GCS datasets

  • Faster datetime-based partitioning on PostgreSQL

Flow

  • The “Schema changes” modal will not display anymore when modifying the last dataset in the Flow. Schema changes are auto-accepted.

  • Added ability to select zone when copying a subflow

  • Added connection information on dataset right panel

  • Better error handling when using invalid values in a Time Range partitioning dependency

  • Fixed various issues with managed folders from foreign projects

  • Fixed navigation bar when using the catalog from a project

Charts

  • Fixed color and size on “Binned XY” chart

  • Fixed possible misalignment on date axis for column charts

Dashboards

  • Fullscreen mode is now preserved after a redirection to SSO login

API

  • Added ability to create evaluation recipes in the API

Administration

  • It is now possible to view all usages of a code env

  • Fixed possible hang in airgapped environments

  • Fixed browser window title in administration pages

Security

  • Removed plain-text credentials from the Twitter connector

Misc

  • Fixed wiki search when using “:” in the searched term

  • Performance enhancements for instances with large number of users

  • Fixed issue with “Test” button for containerized execution config with multiple clusters