DSS 3.0 Relase notes¶

Migration notes ¶

Warning

Migration to DSS 3.0 from a previous DSS 2.X instance requires some attention.

To migrate from DSS 1.X, you must first upgrade to 2.0. See DSS 2.0 Relase notes

Automatic migration from Data Science Studio 2.3.X is supported, with the following restrictions and warnings:

DSS 3.0 features an improved security model. The migration aims at preserving as much as possible the previously defined permissions, but we strongly encourage you to review the permissions of users and groups after migration.
DSS 3.0 now enforces the “Reader” / “Data Analyst” / “Data Scientist” roles in the DSS licensing model. You might need to adjust the roles for your users after upgrade.
DSS now includes the XGBoost library in the visual machine learning interface. If you had previously installed older versions of the XGBoost Python library (using pip), the XGBoost algorithm in the visual machine learning interface might not work
The usual limitations on retraining models and regenerating API node packages apply (see Upgrading a DSS instance for more information)
After migration, all previously scheduled jobs are disabled, to ease the “2.X and 3.X in parallel” deployment models. You’ll need to go to the scenarios pages in your projects to re-enable your previously scheduled jobs.

Automatic migration from Data Science Studio 2.0.X, 2.1.X and 2.2.X is supported, with the previous restrictions and warnings, and, in addition, the ones outlined in DSS 2.1 Relase notes, DSS 2.2 Relase notes, DSS 2.3 Relase notes

How to upgrade ¶

It is strongly recommended that you perform a full backup of your Data Science Studio data directory prior to starting the upgrade procedure.

For automatic upgrade information, see Upgrading a DSS instance

External libraries upgrades ¶

Several external libraries bundled with DSS have been bumped to major revisions. Some of these libraries include some backwards-incompatible changes. You might need to upgrade your code.

Notable upgrades:

Pandas 0.16-> 0.17
Scikit-learn 0.16 -> 0.17

From scheduled jobs to scenarios ¶

The 3.0 version introduces Scenarios, which replace Scheduled jobs.

Each scheduled job you had in 2.X, enabled or not, is transformed during the migration process into a simple scenario replicating the functionalities of that scheduled job:

the scenario contains a single build step to build the datasets that the scheduled job was building
the scenario contains a single time-based trigger with the same setup as the scheduled job, so that the trigger activates exactly with the same frequency and time point as the scheduled job

If the scheduled job was enabled, the time-based trigger of the corresponding scenario is enabled, and conversely. The scenarios themselves are set to inactive, so that after the migration none will run. You need to activate the scenarios (for example from the scenarios’ list), or take the opportunity to rearrange the work that the scheduled jobs were performing into a smaller number of scenarios; a single scenario can indeed launch multiple builds, waiting for a build to finish before launching the next one.

Since a scenario will execute the build corresponding to a scheduled job only when its trigger is active and the scenario itself is active, the quickest route to get the same scheduled builds as before is to activate all scenarios.

Version 3.0.5 - June 24th, 2016 ¶

This release fixes a critical bug related to Spark, plus several smaller bug fixes.

Spark ¶

Fix MLLib and Data preparation on Spark

Datasets ¶

Fix exception in JSON extractor with some specific cases of nested arrays

Machine learning ¶

Fix XGboost regression models when evaluation metrics is MAE, MAPE, EVS or MSE
Display grid search scores in regression reports

API node ¶

Fix various issues with data enrichment in “mapped” mode

Webapps ¶

Fix loading data from local/static

Recipes ¶

Fix validation of custom expressions in sample recipe

Automation ¶

Fix migration of scenarios from DSS 2.3 with partitions
Better explanations as to why some scenarios are aborted
Fix layout issues in scenario screens

Misc ¶

Fix mass tagging on Hive and Impala notebooks
Fixs on graph for job preview

Version 3.0.4 - June 16th, 2016 ¶

This release brings a lot of bug fixes and minor features for plugins.

Plugins ¶

Add ability to introduce visual separators in settings screen
Add ability to hide parameters in settings screen
Add ability to huse custom forms in settings screen

Production ¶

Add a metric for count of non null values
Add more metrics in the “data validity” probe
Expand capabilities for custom SQL aggregations
Add the ability to have custom checks in plugins
Use proxy settings for HTTP-based reporters
Fix and improve settings of the “append to dataset” reporter

SQL Notebook ¶

Make the spinner appear immediately after submitting the query
Fix error reporting issues
Fix reloading of results in multi-cells mode
Add support for variables expansion

Recipes ¶

Fix visual recipes running on Hive with multiple Hive DBs
Fix reloading of split and filtering recipe with custom variables

Machine learning ¶

Fix display of preparation step groups in model reports
Fix simple Shuffle-based cross-validation on regression models
Fix train-test split based on extract from two datasets with filter on test
Fix deploying “clustering” recipe on connections other than Filesystem
Add ability to disable XGBoost early stopping on regression

Datasets ¶

Fix renaming of datasets in the UI
Fix the Twitter dataset
Fix “Import data” modal in editable dataset
Fix reloading of schema for Redshift and other DBs

Data preparation ¶

Improved display of filters for small numerical values
Fix mass change meaning action
Add ability to mass revert to default meaning
Unselect the steps when unselecting a group
Fix UI issue on Firefox

Charts ¶

Add ability to have “external” legend on more charts
Fix several small bugs
Fix scale on charts with 2 Y-axis

Misc ¶

Fix issue with R installation on Redhat 6
Fix missing information in diagnostic tool
Fix import of projects with SQL notebooks from 2.X
Fix saving of summary info for web apps
Add dataset listing and schema fetching in web apps API

Version 3.0.3 - May 30th, 2016 ¶

DSS 3.0.3 is a bugfix release. For a summary of new features in DSS 3.0, see below.

Recipes ¶

Fix bug leading to unusable join recipe in some specific cases
Fix performance issue in code recipes with large number of columns

Metrics & Scenarios ¶

Fix history charts for points with no value
Fix possible race condition leading to considering some jobs as failed

Misc ¶

Fix various UI issues in read-only mode
Fix critical login bug
Fix “Disconnected” overlay on Monitoring page

Version 3.0.2 - May 25th, 2016 ¶

DSS 3.0.2 is a bugfix and minor enhancements release. For a summary of new features in DSS 3.0, see below.

Hadoop & Spark ¶

Preserve the “hive.query.string” Hadoop configuration key in Hive notebook
Clear error message when trying to use Geometry columns in Spark
Fix S3 support in Spark

Metrics & Checks ¶

Better performance for partitions list
Simplify and rework the way metrics are enabled and configured

Automation node & scenarios ¶

Add deletion of bundles
Remap connections in SQL notebooks
Fix scenario run URL in mails

Machine learning ¶

Fix wrongly computed multiclass metrics
Much faster multiclass scoring for MLLib
Fix multiclass AUC when only 2 classes appear in test set
Fix tooltip issues in the clustering scatter plot

API Node ¶

Fix typo in custom HTTP header that could lead to inability to parse the response
Fix the INSEE enrichment processor
Fix excessive verbosity

Data preparation ¶

Add a new processor to compute distance between geo points
Fix DateParser in multi-columns mode when some of the columns are empty
Modifying a step comment now properly unlocks the “Save” button

Visual recipes ¶

Fix split recipe on “exotic” boolean values (Yes, No, 1, 0, …)

Charts ¶

Add percentage mode on pie/donut chart

Misc ¶

Add new error reporting tools
Enforce hierarchy of files to prevent possible out-of-datadir reads
Fix support for nginx >= 1.10
Fix the ability to remove a group permission on a project

Webapps ¶

Automatically enable/disable the Save button
Warn if leaving with unsaved changes
Add history and explicit commit mode

Version 3.0.1 - May 11th 2016 ¶

DSS 3.0.1 is a bugfix release. For a summary of the major new features in DSS 3.0, see: https://www.dataiku.com/learn/whatsnew

Installation ¶

Added support for nginx >= 1.10

Connectivity ¶

Fixed “Other SQL databases” connections

Metrics & Checks ¶

Fixed ordering of partitions table
Default probes and metrics will now be enabled on migration from 2.X

Scenarios ¶

Improved description of triggers

Machine Learning ¶

Removed unapplicable parameter for MLLib
Improve explanations about target remapping in Jupyter export

Data preparation ¶

Fixed migration on groups
Multiple ColumnRenamer processors will automatically be merged

Misc ¶

Fixed display of Git diffs which could break
Fixed display of logs on Safari
Fixed tasks lists on projects
Added user-customized themes
“Read-only Analysts” can now fully view visual analysis screens
Added “project-import” and “project-export” commands to dsscli

Version 3.0.0 - May 1st 2016 ¶

DSS 3.0.0 is a major upgrade to DSS with exciting new features.

For a summary of the major new features, see: https://www.dataiku.com/learn/whatsnew

New features ¶

Automation deployment (“bundles”)¶

Dataiku DSS now comes in three flavors, called node types:

The Design node (the “classical” DSS), where you mainly design your workflows
The Automation node, where you run and automate your workflows
The API node (introduced in DSS 2.2), where you score new records in real-time using a REST API

After designing your data workflow in the design node, you can package it in a consistent artefact, called a “bundle”, which can the be deployed to the automation node.

On the automation node, you can activate, rollback and manage all versions of your bundles.

This new architecture makes it very easy to implement complex deployment use cases, with development, acceptance, preproduction and production environments.

For more information, please see our product page: http://www.dataiku.com/dss/features/deployment/

Scenarios¶

DSS has always been about rebuilding entire dataflows at once, thanks to its smart incremental reconstruction engine.

With the introduction of automation scenarios, you can now automate more complex use cases:

Building a part of the flow before another one (for partitioning purposes for example)
Automatically retraining models if they have diverged too much.

Scenarios are made up of:

Triggers, that decide when the scenario runs
Steps, the building blocks of your scenarios
Reporters, to notify the outside world.

You’ll find a lot of information in Automation scenarios, metrics, and checks

Metrics and checks¶

You can now track various advanced metrics about datasets, recipes, models and managed folders. For example:

The size of a dataset
The average of a column in a dataset
The number of invalid rows for a given meaning in a column
All performance metrics of a saved model
The number of files in a managed folder

In addition to these built-in metrics, you can define custom metrics using Python or SQL. Metrics are historized for deep insights into the evolution of your data flow and can be fully accessed through the DSS APIs.

Then, you can define automatic data checks based on these metrics, that act as automatic sanity tests of your data pipeline. For example, automatically fail a job if the average value of a column has drifted by more than 10% since the previous week.

Advanced version control¶

Git-based version control is now integrated much more tightly in DSS.

View the history of your project, recipes, scenarios, … from the UI
Write your own commit messages
Choose between automatic commit at each edit or manual commit (either by component or by project)

In addition, you can now choose between having a global Git repository or a Git repository per project

When viewing the history, you can get the diff of each commit, or compare two commits.

Team activity dashboards¶

Monitor the activity of each project thanks to our team activity dashboards.

Administrator monitoring dashboards¶

We’ve added a lot of monitoring dashboards for administrators, especially for large instances with lots of projects:

Global usage summary
Data size per connection
Tasks running on the Hadoop and Spark clusters and per database
Tasks running in the background on DSS
Authorization matrix for an overview of all effective authorizations

Other notable enhancements ¶

Project import/export¶

When exporting a project, you can now export all datasets from all connections (except partitioned datasets), saved models and managed folders. When importing the project in another DSS design node, the data is automatically reloaded.

This allows to export complete projects, including data.

When importing projects, you can also remap connections, removing the need to define connections with exactly the same name as on the source DSS instance.

Maintenance tasks¶

DSS now performs automatically several maintenance and cleanup tasks in the background.

Improved security model¶

We’ve added several new permissions for more fine-grained control. The following permissions can now be granted to each group, independently of the admin permissions:

Create projects and tutorials
Write “unsafe” code (that might be used to circumvent the permissions system)
Manage user-defined meanings

In addition, users can now create personal connections without admin intervention.

The administration UI now includes an authorization matrix for an overview of all effective authorizations

API¶

The public API includes new methods to interact with scenarios and metrics
The public API includes new methods for exporting projects

Data preparation¶

It’s now possible to delete columns based on a name pattern

Other changes ¶

DSS does not automatically grant Analyst access to the “first analysts group” when creating a project. After the creation of a project, only its creator (and the DSS administrators) can access it by default.