DSS 9.0 Release notes

Migration notes

Migration paths to DSS 9.0

How to upgrade

It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.

For automatic upgrade information, see Upgrading a DSS instance.

Pay attention to the warnings described in Limitations and warnings.

Limitations and warnings

Automatic migration from previous versions (see above) is supported. Please pay attention to the following removal and deprecation notices.

Support removal

Some features that were previously announced are deprecated are now removed or unsupported.

  • Support for RedHat 6, CentOS 6 and Oracle Linux 6 is removed

  • Support for Amazon Linux 2017.XX is removed

  • Support for Spark 1 (1.6) is removed. We strongly advise you to migrate to Spark 2. All supported Hadoop distributions can use Spark 2.

  • Support for Pig is removed

  • Support for Machine Learning through Vertica Advanced Analytics is removed We recommend that you switch to In-memory based machine learning models. In-database scoring of in-memory-trained machine learnings will remain available

  • Support for Hive SequenceFile and RCFile formats is removed

Deprecation notice

DSS 9.0 deprecates support for some features and versions. Support for these will be removed in a later release.

  • Support for Ubuntu 16.04 LTS is deprecated and will be removed in a future release

  • Support for Debian 9 is deprecated and will be removed in a future release

  • Support for SuSE 12 SP2, SP3 and SP4 is deprecated and will be removed in a future release. SuSE 12 SP5 remains supported

  • Support for Amazon Linux 1 is deprecated and will be removed in a future release.

  • Support for Hortonworks HDP 2.5 and 2.6 is deprecated and will be removed in a future release. These platforms are not supported anymore by Cloudera.

  • Support for Cloudera CDH 5 is deprecated and will be removed in a future release. These platforms are not supported anymore by Cloudera.

  • Support for EMR below 5.30 is deprecated and will be removed in a future release.

  • Support for Elasticsearch 1.x and 2.x is deprecated and will be removed in a future release.

  • As a reminder from DSS 7.0, support for “Hive CLI” execution modes for Hive is deprecated and will be removed in a future release. We recommend that you switch to HiveServer2. Please note that “Hive CLI” execution modes are already incompatible with User Isolation Framework.

  • As a reminder from DSS 7.0, Support for Microsoft HDInsight is now deprecated and will be removed in a future release. We recommend that users plan a migration toward a Kubernetes-based infrastructure.

Version 9.0.4 - June 21st, 2021

DSS 9.0.4 is a significant new release with both new features, performance enhancements and bugfixes.

Snowflake integration

  • New feature Experimental support for leveraging Snowflake Java UDF for faster (up to 3x faster) in-database scoring of ML models. Requires Snowflake Java UDF (preview) on Snowflake side.

  • New feature Experimental support for leveraging Snowflake Java UDF for exporting ML models to Snowflake functions that can be reused by any Snowflake user or client application. Requires Snowflake Java UDF (preview) on Snowflake side.

  • New feature Experimental support for leveraging Snowflake Java UDF for data preparation, allowing push down of the following processors: String transformer, Currency extractor, Filter on bad meaning, Query string parsing, Holidays flagging, GeoIP resolution

  • New feature Experimental support for direct fast write into Snowflake from any recipe, without having to sync through cloud storage anymore

  • New feature Fast load and fast unload from/to Google Cloud Storage

  • New feature Fast load from Azure Blob to Snowflake in Parquet format

  • Increased max colum names length to the maximum supported by Snowflake, 251

Datasets & Formats

  • Redshift: New feature Experimental support for direct fast write into Redshift from any recipe, without having to sync through S3 anymore

  • Synapse: New feature Experimental support for direct fast write into Synapse from any recipe, without having to sync through Azure Blob anymore

  • Synapse: New feature Ability to set distribution policy in the UI

  • Synapse: Fixed issue with table creation on partitioned datasets

  • BigQuery: New feature Added ability to read and write nested and repeated fields

  • BigQuery: Experimental alternate mode to interact with BigQuery, providing ability to read data samples without incurring the cost of a full scan

  • BigQuery: Experimental support for displaying query cost estimation in notebook

  • ElasticSearch: New feature Added ability to use a custom query DSL filter

  • ElasticSearch: New feature Experimental support for index patterns (for reading)

  • S3: Fixed issue with partitioned datasets with spaces in partition names

  • New feature: Added ability to import table definitions from an external Hive-compatible metastore, such as a Databricks metastore

  • Improved “skip first line” detection for CSV files

  • Improved detection of schema for CSV files with empty column names

  • Added ability to override Parquet message style when DSS fails to recognize it

  • Fixed ability to create a dataset from the files of a shared managed folder

  • Fixed display of query in “SQL query” datasets

Machine Learning

  • New feature: Added stop words for Afrikaans, Albanian, arabic, Armenian, Basque, Bengali, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, Estonian, Finnish, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Korean, Latvian, Lithuanian, Macedonian, Malayalam, Marathi, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Sanskrit, Serbian, Sinhala, Slovak, Slovenian, Spanish, Swedish, Tagalog, Tamil, Tatar, Telugui, Thai, Turkish, Ukrainian, Urdu, Vietamese, Yoruba - These are available in “Simplify Text”, “Tokenize text”, “Analyze text” and Text feature handling

  • New plugin: Plugin Model error analysis adds a Saved model custom view to highlight the samples mostly contributing to a predictive model’s errors

  • Fixed class weights with XGBoost

  • Update the available code samples when changing prediction type

  • Fixed possible breakage of models when the preparation script contained a “filter on date range” processor

  • Fixed issue with duplicating ML Tasks

  • Fixed wrong result of SQL scoring with numerical columns stored as strings

  • Fixed various small numerical inconsistencies in SQL scoring

  • Fixed issues with colors in clustering models reports

  • Fixed display of custom scores in binary classification model reports

Coding and API

  • Added ability to specify strings that Pandas should consider as “NA” (i.e. make it possible not to consider the “NA” string as being a NA value)

  • Added ability to autodetect ElasticSearch dataset settings

  • Added API for Model Documentation Generator

  • Fixed creation of values-based meanings with Python 3

  • Fixed dataiku.Folder.upload_file method with binary files and Python 3

  • Fixed mangling of name when importing a notebook with CJK characters in the name

  • Fixed bad interaction between “Edit recipe in notebook” and notebooks imported from Git

  • Copying notebooks will now clear the “associated recipe”

  • Fixed link to editor settings from library editor

  • Fixed ability to use scenario-level variables in code recipes

  • Fixed issues with “Git references” in code libraries

  • Fixed conversion of notebooks with Markdown cells to recipes

  • Improved the API for setting values of numerical hyperparameters on ML tasks

  • When creating a code environment, go directly to its page

  • Fixed display issue when deleting a code env while being on the code env page

  • Added search in all code env dropdowns

Flow

  • Performance: Strongly improved display performance of Flow page for very large flows with many zones and many projects in the instance

  • Performance: Strongly improved display performance for “Jobs” page with very large flows

  • Fixed display of Flow if you enter a wrongful pattern in a Flow filter

  • Fixed recipes being moved to the default zone when moving them

  • Fixed bad error display when trying to rebuild a write-protected dataset

  • Removed bogus ability to edit tags on shared datasets

  • Removed bogus ability to edit tags in the quick Flow navigator

  • Fixed display of “collapsed” Flow zones

  • Fixed failures copying flows or subflows with SQL recipes

Data preparation

  • Pattern generator: Improved support for detecting non-ASCII text

  • Fixed support for val(“column”, default_value) in formula

  • Formula editor: fixed support for regular expressions

  • Formula editor: fixed support for datePart function

  • Fixed support for default value on strval for SQL engine

  • Fixed handling of null values in “min” and “max” formula functions

  • Fixed the “real Python” mode of the Python processor when running on Spark

  • Fixed issue in the “Impute missing values” processor

  • Fixed the help tooltip for “Force numerical range” processor

Plugins and extensibility

  • Fixed duplicate columns appearing in “column name” fields

  • Fixed various other issues with column name autocompletion

  • Fixed dynamic select choices for custom views (managed folders and models)

  • Fixed dynamic select choices for “create cluster” scenario step

  • Fixed dynamic select choices for PRESET fields

  • Fixed dynamic select choices in custom Kubernetes exposition plugins

  • Added ability to use presets in custom Kubernetes exposition plugins

  • Automatically commit plugin.json at first commit

  • Fixed typo when reverting plugin to a previous revision

Collaboration & Applications

  • Fixed disappearing “users”, “creation” and “last modification” fields in catalog

  • Strongly increased maximum character limit of Wiki pages

  • Fixed missing scroll in profile page

  • Experimental ability to hide unwanted recipes (legacy Hadoop, R, Scala, …)

  • Fixed Wiki export when working on a machine without users namespaces enabled

  • Fixed Dataiku Applications flooding the logs

  • Dataiku Applications: added the “is a Dataiku application” visual indicator in all project listing pages

Deployer

  • Added ability to ignore SSL certificate validation for design-node-to-deployer communication

  • Various UI fixes

Automation and scenarios

  • Fixed wrongful display of “Created on” in the “Triggers” page of Automation monitoring

  • Fixed small display issues in Triggers page

Cloud Stacks

  • Better default volume sizes and volume resizing strategies for high-activity and high-volumetry instances

  • Added ability to define tags at fleet creation, that will be propagated both at instance and network levels

  • AWS: Added ability to encrypt the root EBS volume. Default to encrypting both root and data EBS volumes

  • AWS: Added ability to use a custom CMK for encrypting root and data EBS volumes on Fleet Manager instance

  • AWS: Install the AWS Systems Manager agent on both Fleet Manager and DSS images

  • AWS: Default to automatically creating the security groups

  • AWS: Upgrade eksctl for compatibility with latest EKS versions

  • AWS: Fixed startup failures after too many reprovisionings of an instance

  • Additional hardening of the runtime images following CIS Benchmark guidelines

Streaming

  • Fixed support for continuous Python recipes when UIF is enabled

  • Fixed ability to create a continuous sync recipe directly from the streaming endpoint

Dashboards

  • Fixed dashboards PDF export sometimes being clipped

  • Fixed display of preview of files in “file from managed folder” insight

Elastic AI

  • Fixed TLS termination with nginx ingress

  • Added more transient errors that are recognized as non fatal while monitoring Kubernetes jobs

  • Fixed startup failure with custom Kubernetes exposition plugins

  • Fixed support for webapps on Kubernetes

Security

  • Added an audit event when opening a Jupyter notebook

  • Added encryption of client secret fields in Azure Blob, SQL Server and Synapse connections

  • Fixed bad redirect to HTTP when fetching credentials for 3rd-party services with OAuth

  • Fixed display of error upon failure to acquire an OAuth2 authorization code

  • Fixed stored XSS in objects titles

Misc

  • Fixed typo when switching project to another branch

  • Fixed UI issue in export recipe

  • Fixed migration issue when a DSS 9 project had been imported to a DSS 8 instance and this instance is then added to DSS 9

  • Fixed bug when multiple DSS instances use the same PostgreSQL database and schema for runtime databases

  • Fixed failure to display data after migration to DSS 9 when some kinds of date filters were present

  • Fixed Excel export of charts when some kinds of date filters were present

  • Fixed default settings for “Push to editable” recipe

  • Fixed eventserver not refreshing the token when using a S3 connection with “Assume role”

Version 9.0.3 - May 10th, 2021

DSS 9.0.3 is a bugfix release. We recommend that you upgrade to DSS 9.0.3

Flow

  • Fixed inability to create recipes based on shared datasets

  • Fixed various errors in recipe edition screens

Scenarios

  • Fixed display issue in Automation Monitoring “Triggers” page

Version 9.0.2 - May 4th, 2021

DSS 9.0.2 is a significant new release with both new features, performance enhancements and many bugfixes. Note that we recommend that you upgrade to 9.0.3 rather than 9.0.2

Datasets and connections

  • New feature Added OAuth2 login for Snowflake

  • New feature Added OAuth2 login for Azure Blob

  • Azure Blob: Made the “client secret” field hidden

  • MongoDB: Removed connection details from logs

  • BigQuery: Fixed metrics computation on BigQuery “SQL Query” datasets

  • SCP: Fixed write to managed folders based on SCP connections

  • Google Cloud Storage: Fixed PDF preview in managed folders

  • Fixed preview of images with specials characters in their file names in managed folders

Visual recipes

  • New feature: Prepare: Added support for SQL pushdown of “inc” formula function (add to dates) to BigQuery and Snowflake

  • New feature: Prepare: Added support for SQL pushdown of “coalesce” formula function to BigQuery and Snowflake

  • New feature: Prepare: Added support for SQL pushdown of “rand” formula function to BigQuery and Snowflake

  • New feature: Prepare: Added support for SQL pushdown of trigonometric functions to Snowflake

  • Performance: Sync and Prepare: Strongly improved performance on large partitioned datasets (notably S3 / Azure Blob / Google Cloud Storage)

  • Performance: Join: Improved performance of DSS engine with non-equijoin conditions

  • Prepare: Fixed issue with SQL pushdown of “concat” formula function with Snowflake and NULL values

  • Prepare: Fixed possible SQL pushdown issue with formula processor

  • Prepare: Added ability to trim white spaces in “Find and replace”

  • Prepare: Fixed issue with SQL pushdown of date parsing on Snowflake with numeric columns

  • Prepare: Fixed issue when setting ‘cast output’ to None in Formula step

  • Prepare: Fixed formula validation issue with column names starting with numbers

  • Prepare: UX enhancements on “Pattern detector” and “Smart date”

Hadoop and Spark

  • New feature Added support for Cloudera Data Platform CDP Private Cloud Base (CDH 7)

  • New feature Added support to direct writes from Spark to S3 with SSE-KMS encryption

Deployer

  • Performance: Improved performance of Deployer dashboards with large number of deployments

  • Fixed deployment dialog being stuck when a warning happens during bundle activation

  • Fixed sticky tooltip for performance charts in API deployer

Collaboration

  • New feature Central tracking of project reporters in admin monitoring

  • Performance: Improved performance of home page on Firefox

  • Fixed a bug when importing a project containing API services that use a code environment with remapping

  • Fixed wrongful URLs in navigation bar when duplicating projects

Dataiku Applications

  • New feature The labels of the ‘run button’ and ‘edit variables application’ tiles are now customizable

  • Added possibility to mass delete application instances

  • By default app instances are now hidden in the ‘all projects’ list

  • In application designer always prompt for saving when updating the test instance

  • Improved error message for ‘download file from folder’ tile

  • Improved error handling for application-as-recipes

  • Fixed ‘Append instead of overwrite’ in application-as-recipes

Flow

  • Performance: Strongly improved network and UI performance when creating or opening recipes in projects with large flows or large number of columns

  • Performance: Strongly improved performance of “Computing job dependencies” for very large flows and flows with large number of “branches”

  • Fixed possible crash when using flow zones

  • When copying a recipe, the new recipe now appears in the same zone than the original recipe

  • Fixed “Set auto count of records” action

Machine Learning

  • Performance: Individual explanations: Improved performance with large number of categories and for text features

  • Performance: Improved memory usage for ML training

  • Individual explanations: Fixed scoring recipe with computation of Individual explanations and ‘output probabilities’ disabled

  • Preprocessing: Fixed “MINMAX” mode of feature rescaling

  • Preprocessing: Fixed display of feature generation

  • Preprocessing: Fixed wrong stop words usage for Saved models training

  • SQL scoring: fixed issue with rejected features

  • Interactive scoring: Fixed empty categorical dropdown for some preprocessing

  • Interactive scoring: Fixed first loading of threshold on Firefox

  • Interactive scoring: Fixed issue with UIF

  • Custom algorithms: Added ability to display regression coefficients for custom linear models

  • Custom algorithms: Fixed possible failure when scoring with explanations

  • Custom views: Made Saved model custom views exportable in the dashboard

  • Custom views: Made available for analysis models (in addition to saved models)

  • Partitioned models: Fixed race condition for partitioned training recipe

  • Partitioned models: Fixed detection of unused partitions of partitioned models when partition name contains extended charsets

  • Partitioned models: Fixed display of insight for partitioned models

  • Partitioned models: Fixed duplicated tabs

  • Notebook export: Added support for instance weights

  • Model Document Generation: Fixed issue with models coming from imported or duplicated projects

  • Fixed training in edge cases of numeric features with few values including invalid values on Python 3

  • Fixed discrepancy between Java scoring and SQL batch scoring on models trained with Python 3

  • Made the seed of the hyper-parameter search independent from the seed of the train-test split

  • Rounded display of threshold when evaluating a binary classification model

  • Fixed scoring recipe with multiclass prediction and python scoring if “Output probabilities” is disabled

  • Removed non compatible exponential loss training option for Gradient Boosted Tree on multiclass

  • PMML export: Added back support for dummy-encoded categorical features

  • PMML export: Improved consistency between PMML models and DSS scoring

  • PMML export: Added support for models with “treat missing as regular” for categorical features

  • PMML export: Added support for Extra Trees algorithm

  • PMML export: Explicitly list drop rows as incompatible preprocessing for PMML

  • API: Enforced PMML compatibility check

  • API: Added helpers to manage time-based prediction

Scenarios

  • New feature Central tracking of scenario reporters on automation monitoring

  • New feature: Ability to configure the max number of results for “Execute SQL”

  • Fixed infinite loop with monthly triggers running “on first week”

  • Made webhooks reporter appear as failed if the webhook gets a non-2XX HTTP return code

  • Fixed connection remapping for “Execute SQL” steps during bundle activation

  • Fixed Python API methods ‘add_monthly_trigger’ and ‘add_periodic_trigger’

  • Fixed addition of newly-created scenarios in catalog

Notebooks

  • New feature: Support for installing Jupyter nbextensions

  • When deleting a notebook, unload it first

  • Fixed adding new tags to notebooks

  • Fixed unloading notebooks of users with a dot in their name

  • Fixed “Explain” in SQL notebooks with very large queries

Coding

  • Javascript client: Fixed Javascript dataiku.getSchema.getColumnNames() method

  • API: Added API to get project creation and last modification dates

  • API: Fixed “DSSFuture” API in Python client

  • Webapps: Fixed Bokeh webapps behind a reverse proxy

  • SQL recipes: Added ability to access recipe inputs by position rather than name

Cloud Stacks

  • Added support for static private IP for nodes

  • Fixed display of the “Clusters” tab in Deployer node

  • Fixed support of special characters in passwords

Security

  • Added the ‘require authentication’ option at webapp levels for plugin webapps

  • Fixed “admin-connection-save” audit log entry

  • Upgraded Nginx version in container images to avoid a Nginx 1.16 vulnerability ([CVE-2019-20372])

Misc

  • Large update of the administrative boundaries for reverse geocoding and administrative charts (notably fixing an issue with some US states)

  • Performance: Performance enhancement for the “Line chart” with “Interrupt line” mode

  • Fixed UI issue in “Enrichments” page of API designer

  • Improved handling of errors in statistics screen

  • Fixed leak of folders in /tmp when UIF is enabled

Version 9.0.1 - April 6th, 2021

DSS 9.0.1 is a significant new release with both performance enhancements and bugfixes

Datasets and connections

  • Azure Synapse: Fixed “contains” formula function and visual operator

  • Snowflake: Added support for explain plans

  • Azure Blob: Fixed issue with restrictive ACLs on parent folders of datasets

  • Delta Lake: Fixed preview of large Delta datasets

Deployer

  • Fixed failure when re-deploying API services from pre-existing infrastructures and deployments from DSS 7.0

  • Fixed project list search in Project Deployer

  • When bundle preload fails, keep the failure logs visible

  • Improved error message readability in the health status of a deployment

  • Improved Deployer integration in the Global Finder

  • Fixed inability to import projects in case of failure during code env remapping

Machine learning

  • Regression Fixed scatter plot when model is trained in Python 3

Performance and scalability

  • Performance: Improved performance when a very large number of scenarios start at the same time

  • Performance: Improved performance of automation home page with high number of projects and scenario runs

  • Performance: Improved performance of project home page with high number of scenario runs

  • Performance: Improved performance of scenario page with high number of runs

  • Performance: Improved performance of automation monitoring pages with high number of runs

  • Performance: Reduced resource consumption of backend with very large number of triggers

  • Performance: Reduced resource consumption of backend with very large number of “Build” scenario steps

  • Performance: Reduced resource consumption of backend with very large number of connected users

Prepare recipe

  • UI and UX improvements on the smart pattern generator

  • UI and UX improvements on the smart date modal

Coding

  • New feature: Added an API for listing and managing Jupyter notebooks

Cloud stacks

  • Made public IP optional on Fleet Manager CloudFormation template

  • Added EBS encryption for the Fleet Manager EBS

Notebooks

  • SparkSQL notebook: fixed issues with very large Parquet datasets

Misc

  • Added support for Microsoft Edge browser

  • Fixed possible failures of the “clear scenario logs” macro

  • Fixed possible upgrade failure when time-based triggers contained invalid settings

Version 9.0.0 - March 1st, 2021

DSS 9.0.0 is a major upgrade to DSS with major new features.

New features

Unified Deployer

The DSS Deployer provides a unified environment for fully-managed production deployments of both projects and API services. It allows you to have a central view of all of your production assets, to manage CI/CD pipelines with testing/preproduction/production stages, and is fully API-drivable.

For more details, please see Production deployments and bundles.

Interactive scoring and What-if

Interactive scoring is a simulator that enables any AI builder or consumer to run “what-if” analyses (i.e., qualitative sensibility analyses) to get a better understanding of what impact changing a given feature value has on the prediction by displaying in real time the resulting prediction and the individual prediction explanations.

For more details, please see Interactive scoring.

Dash Webapps

Dash by Plotly is a framework for easily building rich web applications. DSS now includes the ability to write, deploy and manage Dash webapps. Dash joins Flask, Bokeh and Shiny as webapps building frameworks to help data scientists go much further than simple dashboards and provide full interactivity to users.

For more details, please see Dash web apps.

Fuzzy join recipe

A very frequent data wrangling use case is to join datasets with “almost equal” data. The new “fuzzy join” recipe is dedicated to joins between two datasets when join keys don’t match exactly. It handles inner, left, right and outer fuzzy joins, and handles text, numerical and geographic fuzziness.

For more details, please see Fuzzy join: joining two datasets

Smart Pattern Builder

In Data Preparation, you can now highlight a part of a cell in order to automatically generate suggestions to extract information “similar” to the one you highlighted. You can then add other examples to guide the automated pattern builder of DSS, and choose the pattern that provides you with the best results.

Visual ML Diagnostics

ML Diagnostics help you detect common pitfalls while training models, such as overfitting, leakage, insufficient learning and such. It can suggest possible improvements.

For more details, please see ML Diagnostics

Model assertions

Model assertions streamline and accelerate the model evaluation process, by automatically checking that predictions for specified subpopulations meet certain conditions. You can automatically compare “expected predictions” on segments of your test data with the model’s output. DSS will check that the model’s predictions are aligned with your business judgment.

For more details, please see ML Assertions

Git push and pull for notebooks

It is now possible to fetch Jupyter notebooks from existing Git repositories, and to push them back to their origin. Pulls and pushes can be made notebook-per-notebook or for a group of notebooks.

For more details, please see Importing Jupyter Notebooks from Git

Wiki Export

Wikis can now be exported to PDF, either on a per-article basis or globally.

For more details, please see Wikis

Model Fairness report

Evaluating the fairness of machine learning models has been a topic of both academic and business interest in recent years. Before prescribing any resolution to the problem of model bias, it is crucial to learn more about how biased a model is, by measuring some fairness metrics. The model fairness report provides you with assistance in this measurement task.

For. more details, please see Model fairness report

Streaming (experimental)

DSS now features an experimental real-time processing framework, notably targeting Kafka and Spark Structured Streaming.

For more details, please see Streaming data

Delta Lake reading (experimental)

DSS now features experimental support for directly reading the latest version of Delta Lake datasets.

For more details, please see Delta Lake

Other notable enhancements

Azure Synapse support

DSS now officially supports Azure Synapse (dedicated SQL pools)

For more details, please see Azure Synapse

Date Preparation

DSS brings a lot of new capabilities for date preparation:

  • New visual prepare processors for incrementing or truncating dates, and for finding differences between dates

  • New ability to delete, keep or flag rows based on various time intervals

  • Better date filtering capabilities for Explore view

For more details, please see Managing dates

Formula editor

The formula editor has been strongly enhanced with better code completion, inline help for all functions and features, and better examples.

For more details, please see Formula language

Spark 3

DSS now supports Spark 3.

If using Dataiku Cloud Stacks for AWS or Elastic AI for Spark, Spark 3 is builtin.

It is also now possible to use SparkSession in Pyspark code

Python 3.7

DSS now supports Python 3.7

You can now create Python 3.7 code envs. In addition, on Linux distributions where Python 3.7 is the default, DSS will automatically use it.

In addition, new DSS setups will now use Python 3.6 or Python 3.7 as the default builtin environment.

In Python 3.7, async is promoted to a reserved keyword and thus cannot be used as a keyword argument in a method or a function anymore. As a consequence, the DSS Scenario API is replacing the async keyword argument, formerly used in some methods, by the asynchronous keyword argument. Please make sure to update uses of the Scenario class accordingly if running Python scenarios or Python scenario steps with Python 3.7.

Impacted methods are: run_scenario, run_step, build_dataset, build_folder, train_model, invalidate_dataset_cache, clear_dataset, clear_folder, run_dataset_checks, compute_dataset_metrics, synchronize_hive_metastore, update_from_hive_metastore, execute_sql, set_project_variables, set_global_variables, run_global_variables_update, create_jupyter_export, package_api_service.

Builtin Snowflake driver

DSS now comes with the Snowflake JDBC driver and native Spark connector builtin. You do not need to install JDBC drivers for Snowflake anymore.

Enhanced “time-based” trigger

The time-based trigger in scenario has been strongly enhanced with the following capabilities:

  • Ability to show and handle triggering times in all timezones, not only server timezone

  • Ability to run every X hours instead of only every hour

  • Ability to run every X days instead of only every day

  • Ability to run every X week instead of only every week

  • Ability to run every X months instead of only every month

  • For once every X month triggers, ability to run on “last Monday” or “third Tuesday”

  • Ability to set a starting date for a trigger

Enhanced cross-connection and no-input SQL recipes

SQL recipes can now work without an input dataset. The recipe will run in the connection of the output dataset.

For SQL recipes with both inputs and outputs, it is now possible to enable “cross-connection” handling while using the connection of the output (previously, only inputs could be selected).

Addition of individual users to projects

You can now grant access to projects to individual users, in addition to groups.

Pan/Zoom control in Flow

You can now zoom and pan on the flow with the keyboard, and zoom and reset the zoom with dedicated buttons.

Variables expansion support in “Build”

The “Build” dialog now supports variables expansion for partitioned datasets

Variables expansion support in “Explicit values”

The “Explicit Values” partition dependency function now supports variables expansion

Schema reload and propagation as scenario steps

In many situations, it is expected that the schema of a Flow input dataset will change frequently, and that these changes should be accepted and their impacts propagated without further manual intervention.

In order to ease the situations, DSS 9 introduces two new scenario steps:

  • “Reload dataset schema” to reload the schema of an input dataset from the underlying data source

  • “Propagate schema” to perform an automated schema propagation across the Flow.

These steps should usually be used before a recursive Build step.

Experimental read support for kdb+

Dataiku now features experimental support for reading from kdb+

Other enhancements and fixes

Datasets

  • Snowflake: the JDBC driver and Spark connector are now preinstalled and do not need manual installation anymore

  • Snowflake: added post-connect statements

  • Snowflake: added support for Snowflake -> S3 fast-path when the target bucket mandates encryption

  • Vertica: fixed partitioning outside of the default schema

  • PostgreSQL: the builtin PostgreSQL has been updated to a more recent version, which notably fixes issues with importing tables on PostgreSQL 12

  • S3: It is now possible to force “path-style” rather than “virtualhost-style” S3 access. This is mainly useful for “S3-compatible” storages.

  • BigQuery: fixed ability to use “high throughput” mode for the JDBC driver

Flow

  • Added detection of changes in editable datasets, which will now properly trigger rebuilds

  • Fixed missing refresh of “Building” indicator with flow zones

  • Fixed wrong “current” flow zone remembered when browsing

Visual recipes

  • Prepare on Snowflake: fixed handling of accentuated column names

  • Fixed handling of “contains” formula operator on Impala when the string to match contains _

  • Fixed “Use an existing folder” on download recipe

  • Added variables expansion on “Flag rows where formula matches” processor

Machine Learning

  • The Evaluation recipe can now output the cost matrix gain

  • PMML export now supports dummy-encoded variables

  • Custom models can now access the list of feature names

  • Fixed failure scoring on SQL with numerical features stored as text

  • Text features: fixed stop words when training in containers

  • Fixed warning in Jupyter when exporting a model to a Jupyter notebook

  • Added ability to define a class inline for a custom model

  • Switched XGBoost feature importances to use the “gain” method (library default since version 0.82)

Elastic AI and Kubernetes

  • AKS: fixed node pool creation with a zero minimum number of nodes

  • AKS: added ability to select the system node pool

  • Disabling an already-disabled Kubernetes-based API deployment will not fail anymore

  • Fixed webapps on Kubernetes leaking “Deployment” objects in Kubernetes

  • Fixed possible failures deploying webapps due to invalid Kubernetes labels

  • Fixed possible failures running Spark pipelines due to invalid Kubernetes labels

  • Added support for CUDA 11 when building base images

  • Fixed validation of Hive recipes containing “UNION ALL” on HDP 3 and EMR

Collaboration

  • Fixed “Back” button when going to the catalog

  • Fixed tags filtering with spaces in tag names

  • Fixed links to DSS items when putting a wiki page on the home page

  • Fixed display of Scala notebooks in Catalog

Automation

  • Display in project home page when triggers are disbaled

  • Added ability for administrators to force the SMTP sender, preventing users from setting it

  • Performance improvements on “Automation monitoring” pages

Coding

  • Fixed handling of records containing \r in Python when using write_dataframe

  • Fixed code env rebuilding if the code env folder had been removed

  • Fixed “with_default_env” on project settings class

  • Fixed ability to delete a code env if a broken dataset exists

Charts

  • Added a safety against potential memory overruns when requesting too high number of bins

  • Fixed sort with null values on PostgreSQL

Notebooks

  • SQL notebooks: added explain plans directly in SQL notebooks

  • Jupyter: Fixed “File > New” and “File > Copy” actions

  • Fixed renamed notebooks not appeareding in “recent elements”

  • Fixed icon of SQL notebooks in “recent elements”

Misc

  • RMarkdown: fixed support for project libraries

  • Fixed erroneous behavior of the browser’s “Back” button when going to the catalog

  • Small UI improvements in multiple locations