DSS 4.0 Release notes¶

Migration notes
Version 4.0.9 - October, 3rd 2017
- Datasets
- Recipes
- Migrations
- Machine Learning
- Misc
Version 4.0.8 - September, 6th 2017
- Tutorials
- Hadoop & Spark
- Flow
- Datasets
- Misc
Version 4.0.7 - August, 24th 2017
Version 4.0.6 - July, 19th 2017
Version 4.0.5 - June, 22nd 2017
Version 4.0.4 - April, 27th 2017
Version 4.0.3 - March, 27th 2017
Version 4.0.2 - March, 1st 2017
Version 4.0.1 - February, 16th 2017
- Setup
- Datasets
- Explore
- R
- Hadoop
- Spark
- UI
- Dashboards
- Misc
Version 4.0.0 - February, 13th 2017

Migration notes ¶

Migration paths to DSS 4.0 ¶

From DSS 3.1: Automatic migration is supported, with the restrictions and warnings described in Limitations and warnings

From DSS 3.0: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to DSS 3.0. See 3.0 -> 3.1

From DSS 2.X: In addition to the following restrictions and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions: see 2.0 -> 2.1 2.1 -> 2.2 2.2 -> 2.3 2.3 -> 3.0 and 3.0 -> 3.1

Migration from DSS 1.X is not supported. You must first upgrade to 2.0. See DSS 2.0 Relase notes

How to upgrade ¶

It is strongly recommended that you perform a full backup of your Data Science Studio data directory prior to starting the upgrade procedure.

For automatic upgrade information, see Upgrading a DSS instance.

Pay attention to the warnings described in Limitations and warnings.

Limitations and warnings ¶

DSS 4.0 is a major release, which changes some underlying workings of DSS. Automatic migration from previous versions is supported, but there are a few points that need manual attention.

In Hadoop settings, previous versions of DSS didn’t use the HiveServer2 component. DSS now uses and requires HiveServer2 for all interaction with Hive. HiveServer2 is included by default in all Hadoop distributions. See Hive for more information.

When migrating from previous versions, you need to setup the hostname of your HiveServer2 instance in Administration > Settings > Hadoop.

Charts on SQL or Impala ¶

The way charts engine is configured has been redesigned. You now first select the desired engine and DSS will show you errors if the engine is not compatible. While most of the charts that used to run on SQL (or Impala) will remain so, we recommend that you check all charts thata were supposed to run on SQL, and more generally all charts that use “full” sampling on datasets.

Permissions ¶

The permissions system has been overhauled and new permission definitions have been introduced. DSS automatically migrates permissions to the new system. We recommend that you check all permissions, both for users and API keys.

Dashboard ¶

The new dashboard uses a new layout system, with a responsive grid instead of a fixed-size one. You might need to tweak the layout of your existing dashboards.

R ¶

After upgrading to DSS 4, you’ll need to re-run ./bin/dssadmin install-R-integration for R to work properly.

Webapps ¶

Since the addition of the new dashboard, Webapps have been moved to their own section in the UI. You’ll find the usual webapp editor in the “Notebooks” section of the project (“Web Apps” subtab)

For webapps that have a Python backend, make sure that the python backend file does not contain encoding magic comment, such as:

# -*- coding: <encoding name> -*-

or:

# coding=utf-8

The old deprecated “/datasets/getcontent” API used by webapps prior to DSS 1.0 has been removed. Very old webapps still using dataiku_load_dataset() or dataiku_dataset_object need to be migrated to new Webapps API.

Other ¶

Models trained with prior versions of DSS must be retrained when upgrading to 4.0 (usual limitations on retraining models and regenerating API node packages - see Upgrading a DSS instance). This includes models deployed to the flow (re-run the training recipe), models in analysis (retrain them before deploying) and API package models (retrain the flow saved model and build a new package)
After installation of the new version, R setup must be replayed
We now recommend using mainly personal API keys for external applications controlling DSS, rather than project or global keys. Some operations, like creating datasets or recipes, are not always possible using non-personal API keys.

External libraries upgrades ¶

Several external libraries bundled with DSS have been bumped to major revisions. Some of these libraries include some backwards-incompatible changes. You might need to upgrade your code.

Notable upgrades:

scikit-learn 0.17 -> 0.18
matplotlib 1.5 -> 2.0

As usual, remember that you should not change the version of Python libraries bundled with DSS.

Version 4.0.9 - October, 3rd 2017 ¶

Datasets ¶

New feature: ElasticSearch: Add SSL support
Fix charts on ElasticSearch dataset

Recipes ¶

Fix ‘contains’ and ‘startsWith’ operators if the expression has special characters in it.

Migrations ¶

Fix migrations if there is a space in the java path

Machine Learning ¶

Fix usage of “Class rebalanced” sampling with “Explicit extract from the dataset” mode

Misc ¶

Small fixes in scenario API
New feature: Add support for Debian 9
SAML SSO: allow unsigned assertions in the IdP response if the response itself is signed
Fix dependencies installation on Red Hat 6
Fix download charts on Chrome >= 60

Version 4.0.8 - September, 6th 2017 ¶

Tutorials ¶

Brand new “starter” tutorials (Basics, Lab & Flow, Machine Learning)
New tutorials on automation, deployment and SQL

Hadoop & Spark ¶

Add support for Spark 2.2
Add support for Cloudera CDH 5.12

Flow ¶

You can now specify multiple ranges of dates for partitions in the “Build” dialog

Datasets ¶

ElasticSearch: Fix counting of shards in case index has less shards than cluster
ElasticSearch: add warning if index/type not found, or if some documents were rejected

Misc ¶

Security: Disable Jupyter terminals in Multi-User security mode as they are not impersonated
Fix “Project Activity” page that could fail to display due to timezone issues
Make sure the Python notebook cannot be disrupted by pre-installed Jupyter on the machine

Version 4.0.7 - August, 24th 2017 ¶

Data preparation ¶

The Data preparation UI will now warn when trying to use a column that does not exist in a preparation step.

Datasets ¶

Oracle: Fixed issue with the Oracle “undetermined” number type returned when doing “CASE WHEN”
S3: Fixed clearing of single-file datasets
Internal stats: fixed the “scenarios runs” view while a scenario is running

Machine Learning ¶

Fixed Least Absolute Deviation and Huber loss functions for GBT regression

Hadoop/Spark ¶

Make custom variables usable in Scala recipes used in a Spark pipeline

Misc ¶

Fixed possible crash when aborting SQL queries from a notebook
Fixed ability to tune log rotation settings
Fixed erratic display of Flow on Firefox 55
Fixed the “Clean internal databases” macro
Improved Java GC behavior under certain kinds of memory pressure
Added missing system dependency for installing Sparklyr support

Version 4.0.6 - July, 19th 2017 ¶

DSS 4.0.6 contains bugfixes. For the details of what’s new in 4.0, see below.

Data preparation ¶

Added new “coalesce()” function to formula language

Datasets ¶

Fixed error in some specific cases of using GCS connector
Fixed possible job failure when building large number of partitions on a HDFS dataset

Machine learning ¶

Improved display of “count vectorization” and “TF/IDF vectorization” in decision trees

Recipes ¶

Fixed possible error in scoring recipe with large schemas
Fixed various issues with capturing of “NULL” in split recipe

Dashboards ¶

Added “download” button to charts insight view

Security ¶

Fixed a few wrong comments in multi-user-security setup
Fixed some edge cases with SPNEGO authentication

Migration ¶

Fixed potential migration bug where migration from version 3.1.3 and below could fail in some specific use cases of metrics usage with “Integer out of range” errors.
Fixed potential migration bug where migration could fail with “timeNanos out of range” error.

Version 4.0.5 - June, 22nd 2017 ¶

DSS 4.0.5 contains both bugfixes and major new features. For the details of what’s new in 4.0, see below.

Hadoop and Spark ¶

New feature: HDFS connections can now reference any kind of HDFS URL, not only paths on the default FS. This makes it possible to read s3://, s3a://, wasb://, adl:// and others through HDFS connections. Credentials are passed through additional per-connection properties (Limitation: using S3 as a HDFS filesystem is not supported on MapR)
New feature: Add support for LDAP authentication and SSL on Impala, add more options for custom Impala URLs
Fix reading S3 datasets in Spark in multi-user-security mode
Make sure that we properly use the S3 fast path, even in single-user-security mode
Fixed support for s3a:// URLs on EMR 5.5
Add support for custom HiveServer2 URLs
Fixed creation of Hive tables when complex types have nested names with special characters
Added warnings when trying to use invalid Hive database and table names
Don’t print useless warnings when reading Spark-generated Parquet files
Fixed Spark pipelines on EMR 5.4 and above
Fixed Spark pipelines with partitioned datasets in Spark 2.X
Fixed reading of foreign datasets in function mode in Scala recipe with Spark 2.X
Fixed rare issue with reading datasets in PySpark
Added ability to set default value for “write datasets using Impala” in Impala and visual recipes
Impala: LEADDIFF / LAGDIFF on an “int” column now properly generate a “bigint” column
Fixed processing of multi-dimension partitioned datasets with Spark

Machine Learning ¶

New feature: Isolation Forest, for anomaly detection
New feature: Feature selection (filter, LASSO-based or tree-based) and reduction (PCA)
Updated H2O version, add support for H2O on Spark 2.1
Add support for H2O on CDH 5.9 and above
Clustering: fix wrong results when “Drop rows” is used for handling missing values
Fixed non-optimized scoring with multiple feature interactions on the same columns
Fixed optimized scoring with numerical derivatives
Fixed optimized scoring of partitioned datasets (was scoring the whole dataset)
Fixed SQL scoring with multiclass and impact coding
Fixed categorical feature interactions with non-ASCII column names
Properly disallow SQL scoring if there is a preparation script
API node: fixed enrichment with Oracle and SQLServer
Fixed “max features” selection in Random Forests and Extra Trees algorithms
Properly display actual number of trained estimators in XGBoost in case of early stopping
Preprocessed feature names are now displayed
Properly warn when export to Python notebook is not supported
Fix Python notebook export of XGBoost, SVM, SGD, and custom models
Fixed icon of the evaluation recipe
Fixed UI issue on tol and validation params for ANN algorithm

Datasets ¶

New feature: Add support for authentication on Elasticsearch datasets
New feature: Beta support for Exasol
ElasticSearch: fixed failure with uppercase type names and type names with special characters
Fixed silent failures when uploading files that are rejected by a proxy
Dont’ try to use Impala for metrics when a dataset has complex types (unsupported by Impala)
Fixed percentage display issues in analyze
Show computation errors when refreshing count of records from the dataset’s right contextual bar
Teradata: Fixed reading of SQL “DATE” fields
Let user choose whether SQL dates should be parsed as DSS dates
Fixed writing datasets with Excel format
Fixed handling of multiple “post-write” statements, when run from SQL recipes

Recipes ¶

New feature : Standard deviation in grouping and window recipes
Add automatic translation to SQL of “and” and “or” in filter formulas
Grouping and Window recipes: Fix postfilter with output column name overrides
Invalid computed columns will not break engine selection anymore
Fixed copy of SparkSQL recipes
Fixed bad handling of NULL values in Filter and Split recipes in SQL mode (NULL values were not taken into account in “other values”)
Join recipe: don’t lose complex type definition on retrieved columns
Fixed refresh of “OK / NOK” indicator on pre and post filters on several recipes
Proper warning in join recipe when trying to join on a non-existing column
Sync from S3 to Redshift: add ability to use IAM role instead of explicit credentials
Fixed postfilter on window recipe on DSS engine
Don’t fail if invalid engines are added to the list of preferred engines
Make sure that the default query in Impala recipes is always working out of the box, even with multiple databases
Impala recipe: show substitution variables even if query fails
SQL, Hive, Impala recipes: add variables for “database/schema”
Don’t use forbidden engines, even when there are only forbidden engines
Fixed partitioning in split recipe with SQL engine
Fixed UI issues in stack recipe when the same dataset is used several times
Fixed Hive->Impala recipe conversion
Fixed UI issues in “Custom Python” dependencies

Automation ¶

Fix Python API to send messages from custom Python scenarios/steps
Fixed code editor sizing on custom Python and SQL steps
Add minute resolution on time-based triggers
A broken scenario (because its run-as user does not exist) does not impact other scenarios anymore

Notebooks ¶

Added support for project variables in Scala notebooks

Data preparation ¶

Show more matching column names in typeahead suggestions

Security ¶

New feature: Added support for SAML SSO
New feature: Added support for SPNEGO SSO
New feature: Added ability to have expiring sessions
New feature: Added ability to enforce a single session per user
New feature: Added ability to restrict visibility of users and groups (to only the users in your groups)
New feature: Added ability to customize X-Frame-Options, Content-Security-Policy, X-XSS-Protection and X-Content-Type-Options headers
Fixed: only moderators may save non-owned dashboards
Fixed LDAP groups that were not available in connections security screen
Multi-user-security: fixed the case when UNIX user name is not the same as the Hadoop short user name
Multi-user-security: fixed Pyspark notebooks in some combination of Hadoop umasks and group memberships

Misc ¶

Performance improvements in internal databases
Homepage listing does not impact other users’ performance anymore
Add ability to select a subset of columns in Python’s iter_rows method
Various UI fixes
Added check for Pandas version, to warn against unsupported Pandas upgrades
install-R-integration: added ability to override CRAN mirror
Fixed possible “URI too long” issue in dataset “Share” window
Fixed possible “URI too long” issue in plugins with “fully custom forms”
Check for SELinux when installing
Add ability to clear internal databases with a time limit
Webapps: add ability to disable the Python backend
Fixed very rare possibility of data loss when the filesystem is having issues
Fixed wrongfully mandatory fields in SQL connection screens
Fixed possible nginx crash when webapps failed to initialize
Fixed default todo list on new projects

Version 4.0.4 - April, 27th 2017 ¶

DSS 4.0.4 is a bugfix release. For the details of what’s new in 4.0, see below.

Datasets ¶

New: Add compatibility with ElasticSearch 5.2 and 5.3
New: Add support for reading DATE columns in ORC files
New: BETA support for Snowflake database
New: Add support for Amazon S3 Server-Side Encryption
Fix failure in Azure Blob connector
Fix SQL splitting in PostgreSQL that could cause “No match found” error in SQL recipes
SQL datasets: Fix quoting of partitioning column names

Hadoop and Spark ¶

New: Add support for MapR 5.2 with MEP 3.0
New: Add support for HDP 2.6
New: Add support for CDH 5.11
Fix a bug in direct Spark-S3 interface when using EMRFS mode with implicit credentials
Fix null/empty mismatch in non-HDFS datasets on Spark

Machine learning ¶

New: Ability to see either rescaled or raw coefficients in regression
New: Add support for Vertica 8.0 AdvancedAnalytics
UI improvements in Lasso path analysis
Fix failures in grid search on regression models

Automation ¶

New: Add a new view of all triggers across instance
Performance improvements on instance scenario views
Fix sort of bundles list
Show conflicts indicator on scenarios

Flow and recipes ¶

Fix Spark pipelines when Pyspark or SparkR recipes are present (not pipelineable)
Truncate too long pipeline names that can make Spark pipeline jobs fail
Fix naming issues in sync recipe that caused issues when an input column was named “count”
Fix SQL recipe failure on some databases if the query ends with a comment

Data preparation ¶

Fixed ability to insert a custom projection system definition in coordinate system processor
Fix broken handling of “Others” columns in Pivot processor

Notebooks and webapps ¶

Fixed bad redirect after creation of a webapp with a _ in the name
Fixed custom JDBC notebooks in Impala mode (not recommended)

Dashboard ¶

Fix error when reading information of an insight whose source was deleted
Fix permission issue on charts for explorers
Fix mismatches when copying a slide

Misc ¶

New: Add support for Amazon Linux 2017.03 and Ubuntu 17.04
Small UI fixes
Add ability to restore macro settings to default
Performance improvements on data catalog
More ability to tune data catalog indexing
Fix too strict permission check for managing exposed elements
Fix error on home page when projects end with _
Various performance improvements and observability
Fix load of Intercom widget on very slow networks
Fix dataiku.Dataset.get_config() Python API

Version 4.0.3 - March, 27th 2017 ¶

DSS 4.0.3 is a bugfix release with several new features. For the details of what’s new in 4.0, see below.

Machine learning ¶

New feature: Lasso-LARS regression for automatic selection of a given number of features in a linear model
New feature: Ability to generate new “interaction” features by combining two existing features.
New feature: Partial dependency plots are now available for Random Forest and Decision Tree models (regression only)
Better scoring performance for models with large number of columns
Fix scikit-learn multiclass logistic regression in multinomial mode
Fix scoring of probability-aware custom models
Fix support of unlimited-depth tree models

Datasets ¶

Fix: don’t fail when the explore sampling had partitions selected and dataset was unpartitioned
Azure: fix support for files with double extension (like .csv.gz)
Azure: fix prepare recipe when target is another filesystem
Fix support for Tableau export plugin
Always allow the “files in folder” dataset, regardless of license
Fix live charts on Vertica and SQL server
Fix computation of statistics on whole data when there are empties
Allow non-standard ports for SSH connections

Webapps ¶

Fix ability to edit API key settings

Recipes ¶

Split recipe: Fix ability to add a new dataset from the recipe settings
Group and Window recipes: fix edition of aggregations
Join recipe: fix ability to replace inputs
Prepare recipe: fix display of Hadoop options for MapReduce engine

Data preparation ¶

Fix JSONPath extractor in “single result” mode

Automation ¶

Fix SQL probes executed on Hive
Strong performance improvement on saving metric values with very large DSS installs
Fix dsscli on the automation node
Fix “Run notebook” step in scenarios
Fix “add checks” link
Don’t lock DSS while computing metrics from the public API
Fix SQL probe plugins

Administration and security ¶

New feature: public API to list and unload Jupyter notebooks
New feature: Project leads can now allow arbitrary users to access the dashboards
Project administrators may now export datasets without explicit permission
Don’t fail if empty values are added for preferred and forbidden engines
Fix scenario link in background tasks monitoring
Show task owner in background tasks monitoring
Fix saving of Hive execution config keys
Fix display of connected users
Prefer using Hive or Impala for counting number of records

Performance ¶

These fixes mostly concern responsiveness of DSS UI for very large installations (in number of users, projects, datasets, …)

Strong performance improvements for home page display
Performance improvements for flow page, datasets list, dataset page, recipe creation, analysis page
Improved performance for Hive metastore synchronization, especially for large Hive databases

Misc ¶

Performance improvements for metastore synchronization
Make more things configurable for Data Catalog index
Fix dashboard save failure
Force Python not to try to connect to Internet during installation
Fix memory leak in scenarios that could lead to DSS crash after several days when a large number of scenarios are active
Improve capabilities of “Search” in objects lists
Fix typos and small UI issues
Fix possible hang while listing Jupyter notebooks

Version 4.0.2 - March, 1st 2017 ¶

DSS 4.0.2 is a bugfix release with minor new features. For the details of what’s new in 4.0, see below.

Data preparation ¶

New feature: it is now possible to re-edit date parsers with Smart date.
Smart date: new formats are detected and guessed
Smart date: ignore some very unlikely formats
Smart date: UI improvements
Fixed invalid reset of filters
Fixed display of column popup on prepare recipe
Sort on non-existing column does not create empty columns anymore

Datasets ¶

Fix miscounting of rows for Parquet and ORC file formats (could lead to smaller than expected samples)
Add mean and stddev to full-data-analysis of date columns
Show count and percentage of top values in full-data-analysis (numerical tab)
Fixed drop down to select meaning in column view
Various UI improvements on columns view
Performance improvements for metrics computation with many partitions
Make it possible to select port on FTP connections

Machine learning ¶

Partial dependencies: fix display when feature name contains ‘:’
Partial dependencies: Add a text filter for features
Fixed number of estimators for Extra Trees
Add missing “partitions selection” menu in Explicit extract policies
Fixed computation of cluster size on MLLib

Dashboards ¶

Add ability to export dataset in “dataset table” insight

Recipes ¶

Add ability to change Inputs / Outputs on Sync recipe
Fix display of pre-filters in join recipe
Bugfix on join recipe creation
Fix partition dependencies tester with multiple partitioned datasets
Fixed selection of grouping keys in grouping recipe

Hadoop & Spark ¶

Warn when trying to use Spark engines on HDFS datasets that are not compatible with Spark fast path
Faster Hive metastore synchronization for partitioned datasets with lots of partitions
Fix pipelining of split recipes (not pipelineable)
Added ability to customize HiveServer2 URL

Administration & Monitoring ¶

New feature: Add a view on scenario runs in the internal stats dataset
Fix possible hang when reporting to a non-responding Graphite server
Don’t let users create connections with no name

Setup and migration ¶

Migration from 3.X: Don’t force DSS engine when output is Redshift
Fixed ability to select LDAP as authentication source

API ¶

New feature: Ability to set CORS headers on public API.
Fixed datasets set_metadata call
Fixed recipe get_recipe_and_payload in Python wrapper

Misc ¶

New feature: Project consistency check is now available as a scenario step
New feature: Add ability to export macros results (to CSV, Excel, dataset, …)
New design for “Mass actions” button
Fix “last modified” date on analysis list
Work-around a Websocket deadlock in Jetty (https://github.com/eclipse/jetty.project/issues/272) that could hang DSS
Various performance improvements

Version 4.0.1 - February, 16th 2017 ¶

DSS 4.0.1 is a bugfix release. For the details of what’s new in 4.0, see below.

Setup ¶

Fixed migration from 3.X when recipe names or dataset names contain accented characters
Fixed migration of 3.X instances where no action had been performed
Fixed incorrect exit status for failed migrations
Added more information to diagnosis reports

Datasets ¶

Fixed BigQuery datasets

Explore ¶

Added ability to do analysis on full data for date types

R ¶

Implemented append mode in R recipes and notebooks

Hadoop ¶

Fixed writing of map and object fields to Parquet files
Fixed Hive icon in Flow

Spark ¶

Implement filtering of datasets in Spark (Python, R, Scala) API
Fixed ability to use foreign datasets in Spark recipes

UI ¶

Added “last modified” information to visual analysis
Better warnings when trying to use invalid dataset names
No more messages when building charts when you only have read access on a project
Fixed creation of recipes in recipe copy

Dashboards ¶

Fixed too strict permissions in Jupyter insights
Fixed too strict permissions in scenario run insights

Misc ¶

Performance enhancements in data catalog
Fixed notifications on project edition
Verify foreign dataset permissions when building jobs and training models
Better error reporting for empty jobs
Improve code snippets here and there
Better audit logging for job events
Fix download of files bigger than 2GB from folders

Version 4.0.0 - February, 13th 2017 ¶

DSS 4.0.0 is a major upgrade to DSS with a lot of new features and major architectural changes. For a summary of the major new features, see: https://www.dataiku.com/learn/whatsnew

New features ¶

New dashboard ¶

DSS 4 features a completely redesigned dashboard with far expanded capabilities.

The dashboard now uses a 12-cells responsive grid and a new layout engine that makes it much easier to move content around. The dashboard UX has been strongly overhauled.
You can now have multiple dashboard per project. Dashboard can either be personal or public. Each dashboard can have multiple slides. The dashboard can be put in fullscreen mode
Dashboard-only users, who don’t have access to the full project content can now create their own dashboards. A new authorization system lets data analysts choose what part of the project are available to dashboard-only users.
A new publication system
Many features have been added on published charts (show/hide axis, tooltips, legends, filters, ability for readers to set their own filters, …)
Many features have been added on published datasets (show/hide headers, select columns, show colorization, ability for readers to set their own filters, …)
Jupyter notebooks published on the dashboard can now have multiple versions, even concurrently.
New kind of publishable insights have been added: report of a saved model, DSS metrics, project activity info, object activity feed.
You can now add rich text, images, URLs and web pages directly on the dashboard.

For more information, see Dashboards.

Multi-user security ¶

The regular behavior of DSS is to run as a single UNIX account on its host machine. When a DSS end-user executes a code recipe, it runs as this single UNIX user. Similarly, when a DSS end-user executes an Hadoop recipe or notebook, it runs on the cluster as the Hadoop user of the DSS server.

DSS now supports an alternate mode of deployment, called multi-user security. In this mode, DSS will impersonate the end-user and run all user-controlled code under its own identity.

For more information, see User Isolation

Spark 2 ¶

DSS is now compatible with Spark 2.0. In addition, an experimental support for Spark 2.1 is provided (preview only).

For more information, see DSS and Spark

Spark pipelines ¶

When several consecutive steps in a DSS Flow (including with branches or splits) use the Spark engine, DSS now has the ability to automatically merge all of these recipes and run them as a single Spark job. This strongly boosts performance by avoiding needless writes and reads of intermediate datasets, and also alleviates Spark startup overheads.

The behavior of intermediate datasets can be configured by the user: write them or not (only the final datasets are written in that case).

For more information, see Spark pipelines

Sparklyr ¶

DSS now supports integration with sparklyr, an alternative API for using Spark from R code. The sparklyr integration cohabits with the SparkR integration. Both APIs are usable in recipes and notebooks.

For more information, see Usage of Spark in DSS

Interactive hierarchical clustering ¶

DSS now features a hierarchical clustering model. It has the unique feature of being “interactive”: rather than setting a fixed number of clusters, you can edit the hierarchy of clusters after training.

For example, if DSS has chosen to keep two clusters, but by studying them, you notice that the difference is not relevant to your problem, you can merge them. Oppositely, if a cluster contains two subpopulations that have relevant differences, you can split them to make deeper clusters.

Quick models ¶

DSS now includes a set of pre-configured “model templates”. When you create a new model, you can now choose what kind of models you want:

Very explainable models (based on simple decision trees or linear formulas)
Most performant models, with highly cross-validated algorithms and wide search for optimal parameters
Models leading to finding most insights in your datasets (by fitting different kinds of algorithms)

You can still set all settings of all kinds of algorithms, but quick models allow you to get started faster with common business requirements.

Distributed and in-database scoring ¶

For most models created in DSS visual machine learning (with Python or MLLib backends), you can now run scoring recipes:

In distributed mode, on Spark
In SQL databases, without data movement

This new feature strongly improves scalability of machine learning model application.

Notifications & Integrations ¶

The notifications system in DSS has been greatly overhauled to adapt better to larger teams.

You can now “watch” every kind of object in DSS (dataset, recipe, analysis, whole project, …) and get notified when updates are available (someone modified the recipe, a new comment has been posted, …). In your profile, you can edit which objects you watch.
A brand new “personal” drawer (click on your user image) which shows all of your notifications, your profile, …
In addition to receiving your notifications in your personal drawer, each user can also choose to receive “offline” summaries (what happened on your watched objects while you were away from DSS) or daily digests (each morning, get a summary of what happened on your watched objects).
DSS can push notifications to third party systems. Slack and Hipchat integrations are provided and more will follow. You can also connect DSS with Github so that commit messages in DSS can close Github issues.
A new “activity” drawer shows all your running activities (jobs, scenarios, notebooks, webapp backends, macros, and other long tasks).

New prebuilt notebooks ¶

DSS now comes with 4 new prebuilt notebooks for analyzing datasets:

Distribution analysis and statistical tests on a single numerical population
Distribution analysis and statistical tests on multiple population groups
High-dimensionality data visualization using t-SNE
Topics modeling using NMF and LDA

Sort in explore ¶

The explore view (also data preparation view in analysis and prepare recipe) can now be sorted, according to a single or multiple criterions.

This sort is only visual and on the sample. The original data is not sorted.

Analyze on whole dataset in explore ¶

The “Analyze” view in Explore can now be based on the whole dataset (in addition to the exploration sample). This is available on all dataset types and will automatically run in database, Hive or Impala depending on the type of dataset and available engines. See Analyze for more information.

New data sources ¶

DSS can now connect to:

Google Cloud Storage (read and write)
Azure Blob Storage (read and write)

Audit trail ¶

DSS now includes a full applicative audit trail of all activities performed by all users. With appropriate configuration, this audit trail is non repudiable: even if a user manages to compromise DSS, traces leading up to the compromise will not be alterable.

Macros ¶

Macros are predefined actions that allow you to automate a variety of tasks, like:

Maintenance and diagnostic tasks
Specific connectivity tasks for import of data
Generation of various reports, either about your data or DSS

Macros can either run either manually or from a scenario. Some macros are provided as part of DSS, and they can also be in a plugin or developed by you.

For example, the following macros are provided as part of DSS:

Generate an audit report of which connections are used
List and mass-delete datasets by tag filters
Clear internal DSS databases
Clear old DSS job logs

More information is available at DSS Macros

Sample and prepare memory limits ¶

The DSS administrator can now set maximum memory size for the design samples and the memory size occupied by memory representation of intermediate steps in a visual preparation recipe or analysis. This strongly incrases the stability and resilience of DSS, especially when users try to create huge design samples.

Limits are configured in Administration > Settings > Limits

Other notable enhancements ¶

Machine learning ¶

A new faster scoring engine has been implemented. Scoring recipes and API node will be much faster. They can also run on Spark or in-database.
The API node can now run models trained with Spark MLLib
A new “evaluation” recipe allows you to evaluate the performance of a model (getting all performance metrics) on any labeled dataset, independently from the training process.
New algorithm: Artificial Neural networks (multi-layer perceptron) for Python backend
New algorithm: KNN (K-Nearest-Neightbors) for Python backend
New algorithm: Extra Trees for Python backend
Impact coding preprocessing is now available for MLLib models
Clustering result screens: you can now edit cluster details from all screens
Clustering result screens: the heatmap can now display categorical variables, provides more sorting options, and provides multiple export formats for further analysis of significant clustering variables
The random seed can now be fixed for clustering models
Many more parameters can be grid-searched
Custom models without probabilites are now supported
Improved snippet auto description of models

Sampling ¶

New sampling modes have been introduced:

Exact “random count of records”: get exactly the count you asked for
“Last records”
Stratified sampling versus a target column

Note that some of these sampling methods are only available for explore, analyze and prepare recipes, not in the sampling recipe.

In addition:

“Random count of records” sampling is now up to 2 times faster.
It is now possible to define a filter within the sampling.
It is now possible to use “last N partitions” as a partition selection method in sampling

For more information, see:

Charts ¶

Charts engine selection has been overhauled to be more predictible: you now first choose your charts engine, and then can choose compatible sampling and charts.
It is now possible to set the line width for line charts

Coding recipes ¶

Advanced options and statements splitting capabilities have been added to the SQL Query recipe. See SQL recipes for more information. This makes it easier to do advanced things like stored procedures or CTEs in SQL recipes.
SQL script recipes can now automatically infer output schema, like SQL query recipes. See SQL recipes for more information
SparkSQL recipes can now use the global Hive metastore, alternatively to using only the local datasets. See SparkSQL recipes for more information
You can now disable validation of code prior to running recipes. This is useful for some kinds of recipes where validation can be very slow.

Visual recipes ¶

The Sync, Filter/Sample and Split recipes can now run on Spark, Hive, SQL and Impala
The window recipe can now work on any kind of dataset, even if you don’t have Spark.
Administrators can now set preferred engines, blacklisted engines
The Join recipe can now be configured to automatically select all columns of some datasets, even when their schema changes.
The join and stack recipes can now automatically downcast columns to match types

Security ¶

In addition to multi-user security and audit-trail, the permissions system has been overhauled and new permission definitions have been introduced. You can now define thinner-grained group permissions at the project level. See Main project permissions.
More options are available for sharing items between projects, and authorizing objects on dashboards. See /security/exposed-objects and Workspaces & dashboards authorizations.
User profiles can now be set directly from LDAP groups.
The details of connections can now be made available to some groups, who can then use them in recipes
Connection passwords are now encrypted on disk using a reversible encryption scheme

Datasets ¶

Administrator can now set preferred connections and file formats when creating new managed datasets.
It is now possible to import SQL tables as SQL datasets from the DSS UI, without being an administrator. Go to New dataset > Import from connection. This is also possible for Hive tables.
When a HDFS dataset has been imported from an existing Hive table, it is possible to “update” the definition of the dataset from the associated Hive table definition in the Hive metastore
The filtering infrastructure (used in filter recipe, for filtering in sample, in APIs, …) now more directly translates user-defined filters to SQL. This provides more efficient filtering in SQL and less timezone-related issues.
Support for ElasticSearch 5 has been added
It is now possible to define a dataset based on the files in a DSS managed folder.
The “internal stats” dataset now includes ability to view jobs information and build informatino
Teradata connections can now be put in “autocommit” mode, which makes it much easier to write DDL statements, use stored procedures, write stored procedures or use third-party plugins.
In-database charts are now available for Teradata.
Teradata datasets will more often avoid going over the Teradata max row size limitations
Sorting, searching, and mass actions have been added to the schema editor

Library editor and per project library ¶

You can now write / add your own Python modules or packages in per-project library paths in addition to the global “lib/python” one. In addition, you can edit both the global and per-project “lib/python” folders directly from the DSS UI.

You can also edit a new “lib/R” folder, which can be used to import R source files.

Hadoop & Spark ¶

It is now possible to import Hive tables as HDFS datasets from the DSS UI, without being an administrator. Go to New dataset > Import from connection.
The Spark-Scala recipe now features a new “function” mode which allows the recipe to be part of a Spark pipeline
You can now run SparkSQL recipes against the global Hive metastore. Note that this disables automatic validation.
You can now manage multiple named Hive configurations, used to pass additional Hive parameters on recipes and notebooks

Data preparation ¶

The “Round” processor can now round to a fixed number of decimal places
The “Pivot” processor can now keep repeated values
New meaning: “Currency amount” (i.e. a currency symbol and a numeric amount), with an associated processor to split currency and amount. This is particularly useful in conjunction with the existing currency converter processor
Holidays database have been updated and improved
User agents parsing has been updated and improved

Flow ¶

The “Consistency check” Flow tool has been greatly enhanced. It can now check many more kinds of recipes, and perform more structural checks
A new “engines” flow view let you see easily on which engine (DSS, SQL, Hadoop, Hive, Spark, …) each of your recipes run.
You can now copy recipes
You can now change the input / output datasets of all recipes

Version control ¶

DSS now features the ability to rollback configuration changes from the UI. We advise great care when rolling back changes.

You can also manage “Git remotes” directly from the DSS UI, including pushing to remotes. The public API features a new method to push to remotes.

API ¶

In addition to the previously existing project-specific and global API keys, DSS now features “personal” API keys. Personal API keys have the same rights has their owner. In some setups, creating datasets and recipes can only be done using Admin or Personal API keys.

The internal API (dataiku package) can now automatically call the public API. To obtain an API client, use dataiku.api_client().

The public API now includes methods for:

Getting and setting general DSS settings
Managing installed plugins
Monitoring and aborting “futures” (long-running tasks)
Getting metrics

For more information, see Public REST API

Plugins ¶

Plugins can provide Python modules that can be imported into Python code with a dedicated API.

Webapps ¶

Webapps now live as new top-level objects, besides code notebooks.

Python backends have been strongly overhauled with:

Ability to start automatically with DSS
Impersonation ability
Automatic restart in case of crash
Centralized monitoring screen (Administration > Monitoring > Webapp backends)

Monitoring ¶

Administrators now have better overviews of all what’s running in a DSS instance, with more information to relate to processes (pid, Jupyter kernel id, …)

Installation and setup ¶

Added support for Ubuntu 16.10
Removed support for Ubuntu 12.04

Misc ¶

New options are available for making datasets “relocatable”, easing copying and reimporting projects, while avoiding conflicts between projects. See Making relocatable managed datasets.
Mass actions are now available in many more locations in DSS (objects list, features screen, prepare column view, schema editor, …)
A lot of general performance improvements, especially for large number of users
Project export/import will now preserve timelines
Added rotation of nginx access logs

Notable bug fixes ¶

Performance and stability ¶

DSS could become unresponsive while deleting a dataset or a project if the remote data source was unreachable, or the Hive metastore server hanged. This has been fixed.
Browsing HDFS connections was very slow and could make DSS unresponsive. This has been fixed.
Performance of various UI parts with wide datasets (1000+ columns) has been strongly improved
With large number of users, notifications system could strongly slow down DSS. This has been fixed.
DSS could become unresponsive while testing a dataset if the remote data source was not answering. This has been fixed.
Fixed excessive logging in various parts of DSS

Datasets ¶

It is now possible to set the MongoDB port in the UI
Re-added ability to append on HDFS datasets (depending on the recipe)
Don’t fail when a partitioned SQL dataset contains null values in partitioning column

Data preparation ¶

Removing a column used for coloring output table will not cause an error anymore
Currency converter does not throw errors in “fixed currency” mode
Added val() method to handle columns with dots in formulas
Fixed various caching issues that led to not good enough performance in some cases

Recipes ¶

Creating a visual recipe from a partitioned dataset will now properly respect the “Non partitioned” setting (when creating the modal)
Changing name of partition columns, or partitioning or unpartitionig datasets is now much better handled.
Various issues around cases where partitioning columns must or must not be in the schemas have been fixed. This notably fixes redispatching of partitions when writing to a HDFS dataset.
Scoring recipe with preparation steps using additional datasets has been fixed
Filtering on dates has been fixed for several databases (Oracle, Teradata, …)
Join recipe with contains / ignore case has been fixed for Redshift and Impala
Fixed “Rename columns” feature in grouping recipe
Fixed various issues in sampling recipe
Fixed “distinct” pre and post filters in Window recipe on Impala and Hive engines

Charts ¶

Fixed taking into account of meaning for charts. Setting a meaning in the explore view will now be properly taken into account for charts.
Fixed display of hexabgonal binning charts on dashboard tiles
Added tooltips on pivot table charts

Automation ¶

A too long trigger will not cause other scenarios to hang
Fixed failures in custom Python step when too much data is returned

Hadoop & Spark ¶

Using reserved Hive names like “date” as partitioning column name in HDFS datasets will not cause issues anymore

Flow ¶

Fixed “propagate schema from Flow” on SQL datasets (string length issues)
Fixed type mismatch issues (strings instead of int) when propagating schema on some recipes

Machine learning ¶

It is now possible to rename the “outliers” cluster
Fixed text features with MLLib when there are null values in the text column
Fixed updating of “Cost matrix” in ML model reports
Many fixes around training and scoring with “foreign” datasets (datasets from other projects)

API ¶

Fixed issues in API around creation and edition of users

Misc ¶

Deleting a project now properly removes activity / timelines information for it
Fixed display of Python backend logs in webapps