DSS 4.2 Release notes¶
From DSS 4.1: Automatic migration is supported, with the restrictions and warnings described in Limitations and warnings
From DSS 4.0: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.0 -> 4.1
From DSS 3.1: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 3.1 -> 4.0 and 4.0 -> 4.1
From DSS 3.0: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying your previous versions. See 3.0 -> 3.1, 3.1 -> 4.0 and 4.0 -> 4.1
From DSS 2.X: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions: see 2.0 -> 2.1 2.1 -> 2.2 2.2 -> 2.3, 2.3 -> 3.0, 3.0 -> 3.1, 3.1 -> 4.0 and 4.0 -> 4.1
Migration from DSS 1.X is not supported. You must first upgrade to 2.0. See DSS 2.0 Relase notes
It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.
For automatic upgrade information, see Upgrading a DSS instance.
Pay attention to the warnings described in Limitations and warnings.
DSS 4.2 is a major release, which changes some underlying workings of DSS. Automatic migration from previous versions is supported, but there are a few points that need manual attention.
Models trained with prior versions of DSS should be retrained when upgrading to 4.2 (usual limitations on retraining models and regenerating API node packages - see Upgrading a DSS instance). This includes models deployed to the flow (re-run the training recipe), models in analysis (retrain them before deploying) and API package models (retrain the flow saved model and build a new package)
After installation of the new version, R setup must be replayed
Several external libraries bundled with DSS have been bumped to major revisions. Some of these libraries include some changes that may require adaptation of your code.
As usual, remember that you should not change the version of Python libraries bundled with DSS. Instead, use Code environments.
BigQuery: added support for latest JDBC (en majuscules) drivers version (>= 1.1.6)
Fixed error when browsing path of Google Cloud Storage datasets
Fixed explore of DB2 datasets when the compatibility mode is not MySQL
New feature: Added ability to revert the design of a prediction task to a previously trained model
Fixed issues with outliers detection in MLLib clustering
Fixed failure training multiple MLLib clustering models at once
Fixed failure deploying custom MLLib clustering models
Fixed excessive memory consumption on linear models
Fixed display of interactive clustering hierarchy with high number of clusters.
Fixed API node version activation when using Lasso-LARS algorithm
Added proper error message when trying to ensemble K-fold-cross-tested models (not supported)
New feature: Integration with collectd for system monitoring
Added support for Java 10
Fixed reset of HDFS connection settings when upgrading multi-user-security-enabled instances
Fixed external Elasticsearch 6 datasets
Fixed testing of ElasticSearch datasets with “Trust any SSL certificate” option
Fixed missing authorization in Jupyter that could allow users to shutdown and delete unauthorized notebooks
Fixed missing enforcing of “Freely usable by” connection permission on SQL queries written from R scripts (using dkuSQLQueryToData)
Fixed copy of Python recipes with a managed folder as output
Fixed other edge cases in copy of recipes
Performance improvements for formulas
Made it easier to write into managed folders in Multi-user-security-enabled DSS instances
Fixed automation node not taking into account the “Install Jupyter Support” flag for code environments
Fixed Python code environments on Mac (TLS issue in pip)
Fixed “Clean internal DBs” macro that could prevent running jobs from finishing
Worked-around Conda bug preventing installation of Jupyter on conda environments
Improved support for PingFederate SSO IdP (compatibility with default behavior)
Fixed Notebooks list in “Lab”
S3: Faster files enumeration on large directories
Teradata-Hadoop sync: add support for multi-user-security
Teradata-Hadoop sync: fixed distribution modes and added parallelism settings to all modes
Pivot recipe: added support for Teradata
Prepare recipe: fixed possible NPE on remove column processing with pattern mode.
Do not fail on startup if the model need to be retrained. Instead, display an informative message
Various performance improvements
Fix sample fetching from the catalog on Teradata tables
Preliminary support for Ubuntu 18.04
Fix Multi-User-Security mode on SuSE 12
Security: Add “noopener norefer” to all links from DSS to https://dataiku.com
Security: Add directives to disable password autocompletion in various forms
DSS 4.2.0 is a major upgrade to DSS with significant new features. For a summary of the major new features, see: https://www.dataiku.com/learn/whatsnew
You can now define a column to be used as the sample weights column when training a machine-learning model.
When a sample weights column is enabled:
Most algorithms take it into account for training
All performance metrics become weighted metrics for better evaluation of your model’s performance
In addition to the traditional “HDFS” dataset, DSS now supports a native “Hive” dataset.
When reading a “Hive” dataset, DSS uses HiveServer2 to access its data (compared to the direct access to the underlying HDFS files, with the traditional HDFS dataset).
This gives access to some Hive-native features that were not possible with the HDFS dataset:
Support for Hive views (including if you don’t have filesystem access to the underlying tables)
Support for ACID Hive tables
Better support for “decimal” and “date” data types
The Hive dataset can be used in all visual recipes in addition to the coding Hive recipe.
When importing tables from the Hive metastore, you can now select whether to import it as a HDFS or Hive dataset.
When running DSS in multi-user-security mode (see User Isolation), you can now use impersonation features of some enterprise databases.
This gives per-user impersonation when logging into the database (i.e. connections to the database are made as the final user, not as the DSS service account), without requiring users to individually enter and store their connection credentials.
This feature is available for:
Automation nodes now get a dedicated home page that shows the state of all of your scenarios.
All machine learning models operations can now be performed using the API, and we provide a Python client for this:
Modifying their settings
Retrieving details of trained models
Deploying trained models to DSS Flow
Creating scoring recipes
See Python APIs
Improved ability to edit metadata of items, which can no be edited directly from the Flow or objects lists
Improved tags management UI
Added ability to rename a tag
You can now select from more cropping and stretching mode for your project homes
Spark pipelines now handle more kinds and cases of Flows
Prediction scoring recipes in Spark mode can now be part of a Spark pipeline
SQL datasets can now be partitioned by multiple dimensions and not a single one anymore
DSS can now read CSV files with duplicate column names
It is now possible to ignore “unterminated quoted field” error in CSV, and keep parsing the next files
It is now possible to ignore broken compressed files errors in CSV, and keep parsing the next files
Added support for ElasticSearch 6
Forbid creating datasets at the root of a connection (which is very likely an error, and could lead to dropping all connection data)
Automatically disable Hive and Impala metrics engine if the dataset does not have associated metastore information
Exporting visual recipes to SQL query now takes aliases into account
Added ability to compare dates in DSS Formulas
It is now possible to disable steps
It is now possible to have steps that execute even if previous steps failed
It is now possible to import a plugin in DSS by cloning an existing Git repository
A plugin installed in DSS can now be converted to a “plugin in development” so it can be modified directly in the plugin editor
The Jupyter Notebook (providing Python, R and Scala notebooks) has been upgraded to version 5.4
- This provides fixes for:
Saving plotly charts
Displaying Bokeh charts
You do not need to restart DSS anymore to take into account new Spark settings for the Jupyter notebook
Custom scoring functions can now receive the
Xinput dataframe in addition to the
SGD and SVM algorithms have been added for regression (they were already available for classification)
“For Display Only” variables are now usable in more kinds of clustering report screens
It is now possible to configure how many algorithms are trained in parallel (was previously always 2)
DSS now supports Java 9
It is now possible to customize the GC algorithm
DSS now automatically configures the Java heap with a value depending on the size of the machine
DSS now automatically uses G1 GC on Java 8 and higher
New API to create new files in development plugins
New API to download a development plugin as a Zip file
Added ability to force types in
JSON output for
Added more ability to automatically clear various temporary data
Fixed parsing of “year + week number” kind of dates
Fixed merge of clusters in value clustering with overlapping clusters
Fixed error when computing full sample analysis on a column which was not in the schema
Fixed models on foreign (from another project) datasets
Fixed invalid rescaled coefficients statistics for linear models
Fixed Evaluate recipe when some rows are dropped by the “Drop rows” imputation method
Fixed “Drop rows” imputation method on the API node when using optimized scoring engine
SQL datasets: Multiple issues with “date” columns in SQL have been fixed
SQL datasets: Add ability to read Oracle CLOB fields
Avro: fix reading of some Avro files with type references
S3: Fixed reading of some Gzip files that failed
Elasticsearch: on managed Elasticsearch datasets, partitioning columns for value dimensions are now typed as
keywordon ES 5+, rather than
string, which is deprecated in ES 5 and not supported by ES 6.
Show column renamings in the “View SQL query” section of visual recipes
Fixed partitioning sync from SQL to HDFS using Spark engine
Fixed “Concat Distinct” aggregation
Prevent failing join with DSS engine if columns have leading or trailing whitespaces
Fixed “null ordering” with DSS engine
Fixed window on range using DSS engine with nulls in ordering column
Fixed export recipe on partitioned datasets (was exporting the whole dataset)
Copying a prepare recipe now properly initializes schema on the copied dataset
Fixed Grouping recipe with Spark when renaming column and using post-filtering on renamed column
Fix Hive recipe validation failure if the input dataset doesn’t have an associated Hive table
Fixed export of Jupyter dataframe when it contains non-ascii column names
Fixed failure to write managed folder files when files are very small
Fixed “output piping” in the Shell recipe
Added ability to process dates after the current date in the “Time Range” dependnecy function
Fixed building both Filesystem and SQL partitioned datasets at the same time
Fixed some cases where exports of RMarkdown reports would not display all kinds of charts.
Don’t try to use Hive or Impala for metrics if the dataset doesn’t have an associated Hive table
Fixed loss of “Additional dashboard users” and Project Status when deploying on automation node
Fixed issues with migration of webapps on Automation node