DSS 4.2 Release notes¶
Migration notes¶
Migration paths to DSS 4.2¶
From DSS 4.1: Automatic migration is supported, with the restrictions and warnings described in Limitations and warnings
From DSS 4.0: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 4.0 -> 4.1
From DSS 3.1: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 3.1 -> 4.0 and 4.0 -> 4.1
From DSS 3.0: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying your previous versions. See 3.0 -> 3.1, 3.1 -> 4.0 and 4.0 -> 4.1
From DSS 2.X: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions: see 2.0 -> 2.1 2.1 -> 2.2 2.2 -> 2.3, 2.3 -> 3.0, 3.0 -> 3.1, 3.1 -> 4.0 and 4.0 -> 4.1
Migration from DSS 1.X is not supported. You must first upgrade to 2.0. See DSS 2.0 Relase notes
How to upgrade¶
It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.
For automatic upgrade information, see Upgrading a DSS instance.
Pay attention to the warnings described in Limitations and warnings.
Limitations and warnings¶
DSS 4.2 is a major release, which changes some underlying workings of DSS. Automatic migration from previous versions is supported, but there are a few points that need manual attention.
Retrain of machine-learning models¶
Models trained with prior versions of DSS should be retrained when upgrading to 4.2 (usual limitations on retraining models and regenerating API node packages - see Upgrading a DSS instance). This includes models deployed to the flow (re-run the training recipe), models in analysis (retrain them before deploying) and API package models (retrain the flow saved model and build a new package)
After installation of the new version, R setup must be replayed
External libraries upgrades¶
Several external libraries bundled with DSS have been bumped to major revisions. Some of these libraries include some changes that may require adaptation of your code.
As usual, remember that you should not change the version of Python libraries bundled with DSS. Instead, use Code environments.
Version 4.2.5 - May, 31th 2018¶
Machine learning¶
Fixed retraining of LASSO-LARS models
Version 4.2.4 -¶
Internal release
Version 4.2.3 - May, 9th 2018¶
Machine learning¶
New feature: Added ability to revert the design of a prediction task to a previously trained model
Fixed issues with outliers detection in MLLib clustering
Fixed failure training multiple MLLib clustering models at once
Fixed failure deploying custom MLLib clustering models
Fixed excessive memory consumption on linear models
Fixed display of interactive clustering hierarchy with high number of clusters.
Fixed API node version activation when using Lasso-LARS algorithm
Added proper error message when trying to ensemble K-fold-cross-tested models (not supported)
Version 4.2.2 - April, 17th 2018¶
Datasets¶
Fixed external Elasticsearch 6 datasets
Fixed testing of ElasticSearch datasets with “Trust any SSL certificate” option
Security¶
Fixed missing authorization in Jupyter that could allow users to shutdown and delete unauthorized notebooks
Fixed missing enforcing of “Freely usable by” connection permission on SQL queries written from R scripts (using dkuSQLQueryToData)
Flow¶
Fixed copy of Python recipes with a managed folder as output
Fixed other edge cases in copy of recipes
Machine learning¶
Fixed lift curve with sample weights
Misc¶
Performance improvements for formulas
Made it easier to write into managed folders in Multi-user-security-enabled DSS instances
Fixed automation node not taking into account the “Install Jupyter Support” flag for code environments
Fixed Python code environments on Mac (TLS issue in pip)
Fixed “Clean internal DBs” macro that could prevent running jobs from finishing
Worked-around Conda bug preventing installation of Jupyter on conda environments
Improved support for PingFederate SSO IdP (compatibility with default behavior)
Fixed Notebooks list in “Lab”
Version 4.2.1 - April, 3rd 2018¶
Datasets¶
S3: Faster files enumeration on large directories
Teradata-Hadoop sync: add support for multi-user-security
Teradata-Hadoop sync: fixed distribution modes and added parallelism settings to all modes
Machine learning¶
Fixed Jupyter notebooks export of models
Fixed “Redetect settings” button
Visual recipes¶
Pivot recipe: added support for Teradata
Prepare recipe: fixed possible NPE on remove column processing with pattern mode.
API node¶
Do not fail on startup if the model need to be retrained. Instead, display an informative message
Misc¶
Various performance improvements
Fix sample fetching from the catalog on Teradata tables
Preliminary support for Ubuntu 18.04
Fix Multi-User-Security mode on SuSE 12
Security: Add “noopener norefer” to all links from DSS to https://dataiku.com
Security: Add directives to disable password autocompletion in various forms
Version 4.2.0 - March, 21st 2018¶
DSS 4.2.0 is a major upgrade to DSS with significant new features. For a summary of the major new features, see: https://www.dataiku.com/learn/whatsnew
New features¶
Support for sample weights in visual machine learning¶
You can now define a column to be used as the sample weights column when training a machine-learning model.
When a sample weights column is enabled:
Most algorithms take it into account for training
All performance metrics become weighted metrics for better evaluation of your model’s performance
“Hive” dataset (views and decimal support)¶
In addition to the traditional “HDFS” dataset, DSS now supports a native “Hive” dataset.
When reading a “Hive” dataset, DSS uses HiveServer2 to access its data (compared to the direct access to the underlying HDFS files, with the traditional HDFS dataset).
This gives access to some Hive-native features that were not possible with the HDFS dataset:
Support for Hive views (including if you don’t have filesystem access to the underlying tables)
Support for ACID Hive tables
Better support for “decimal” and “date” data types
The Hive dataset can be used in all visual recipes in addition to the coding Hive recipe.
When importing tables from the Hive metastore, you can now select whether to import it as a HDFS or Hive dataset.
Impersonation on SQL databases¶
When running DSS in multi-user-security mode (see User Isolation), you can now use impersonation features of some enterprise databases.
This gives per-user impersonation when logging into the database (i.e. connections to the database are made as the final user, not as the DSS service account), without requiring users to individually enter and store their connection credentials.
This feature is available for:
Microsoft SQL Server (also added: Kerberos authentication support)
Oracle (also added: Kerberos authentication support)
Full support for BigQuery¶
DSS now supports both read and write for Google BigQuery
Dedicated automation homepage¶
Automation nodes now get a dedicated home page that shows the state of all of your scenarios.
API for managing and training machine-learning models¶
All machine learning models operations can now be performed using the API, and we provide a Python client for this:
Creating models
Modifying their settings
Training them
Retrieving details of trained models
Deploying trained models to DSS Flow
Creating scoring recipes
See Python APIs
Other notable enhancements¶
UI and collaboration¶
Improved ability to edit metadata of items, which can no be edited directly from the Flow or objects lists
Improved tags management UI
Added ability to rename a tag
You can now select from more cropping and stretching mode for your project homes
Spark¶
Spark pipelines now handle more kinds and cases of Flows
Prediction scoring recipes in Spark mode can now be part of a Spark pipeline
Datasets¶
SQL datasets can now be partitioned by multiple dimensions and not a single one anymore
DSS can now read CSV files with duplicate column names
It is now possible to ignore “unterminated quoted field” error in CSV, and keep parsing the next files
It is now possible to ignore broken compressed files errors in CSV, and keep parsing the next files
Added support for ElasticSearch 6
Forbid creating datasets at the root of a connection (which is very likely an error, and could lead to dropping all connection data)
Automatically disable Hive and Impala metrics engine if the dataset does not have associated metastore information
Visual recipes¶
Exporting visual recipes to SQL query now takes aliases into account
Added ability to compare dates in DSS Formulas
Machine Learning¶
Display Isolation Forest anomaly score in the ML UI
Scenarios¶
It is now possible to disable steps
It is now possible to have steps that execute even if previous steps failed
Plugins¶
It is now possible to import a plugin in DSS by cloning an existing Git repository
A plugin installed in DSS can now be converted to a “plugin in development” so it can be modified directly in the plugin editor
Jupyter Notebook¶
The Jupyter Notebook (providing Python, R and Scala notebooks) has been upgraded to version 5.4
- This provides fixes for:
Saving plotly charts
Displaying Bokeh charts
You do not need to restart DSS anymore to take into account new Spark settings for the Jupyter notebook
Machine Learning¶
Custom scoring functions can now receive the
X
input dataframe in addition to they_pred
andy_true
seriesSGD and SVM algorithms have been added for regression (they were already available for classification)
“For Display Only” variables are now usable in more kinds of clustering report screens
It is now possible to configure how many algorithms are trained in parallel (was previously always 2)
Java runtime¶
DSS now supports Java 9
It is now possible to customize the GC algorithm
DSS now automatically configures the Java heap with a value depending on the size of the machine
DSS now automatically uses G1 GC on Java 8 and higher
API¶
New API to create new files in development plugins
New API to download a development plugin as a Zip file
Added ability to force types in
query_to_df
API
Administration¶
JSON output for
apinode-admin
toolAdded more ability to automatically clear various temporary data
Notable bug fixes¶
Data preparation¶
Fixed parsing of “year + week number” kind of dates
Fixed merge of clusters in value clustering with overlapping clusters
Fixed error when computing full sample analysis on a column which was not in the schema
Machine Learning¶
Fixed models on foreign (from another project) datasets
Fixed invalid rescaled coefficients statistics for linear models
Fixed Evaluate recipe when some rows are dropped by the “Drop rows” imputation method
Fixed “Drop rows” imputation method on the API node when using optimized scoring engine
Datasets¶
SQL datasets: Multiple issues with “date” columns in SQL have been fixed
SQL datasets: Add ability to read Oracle CLOB fields
Avro: fix reading of some Avro files with type references
S3: Fixed reading of some Gzip files that failed
Elasticsearch: on managed Elasticsearch datasets, partitioning columns for value dimensions are now typed as
keyword
on ES 5+, rather thanstring
, which is deprecated in ES 5 and not supported by ES 6.
Visual recipes¶
Show column renamings in the “View SQL query” section of visual recipes
Fixed partitioning sync from SQL to HDFS using Spark engine
Fixed “Concat Distinct” aggregation
Prevent failing join with DSS engine if columns have leading or trailing whitespaces
Fixed “null ordering” with DSS engine
Fixed window on range using DSS engine with nulls in ordering column
Fixed export recipe on partitioned datasets (was exporting the whole dataset)
Copying a prepare recipe now properly initializes schema on the copied dataset
Fixed Grouping recipe with Spark when renaming column and using post-filtering on renamed column
Multi-user-security¶
Fixed various issues with HDFS managed folders in MUS mode
Coding¶
Fix Hive recipe validation failure if the input dataset doesn’t have an associated Hive table
Fixed export of Jupyter dataframe when it contains non-ascii column names
Fixed failure to write managed folder files when files are very small
Fixed “output piping” in the Shell recipe
Flow¶
Added ability to process dates after the current date in the “Time Range” dependnecy function
Fixed building both Filesystem and SQL partitioned datasets at the same time
Code reports¶
Fixed some cases where exports of RMarkdown reports would not display all kinds of charts.
Metrics¶
Don’t try to use Hive or Impala for metrics if the dataset doesn’t have an associated Hive table
Automation¶
Fixed loss of “Additional dashboard users” and Project Status when deploying on automation node
Fixed issues with migration of webapps on Automation node