DSS 3.0 Relase notes¶
Migration to DSS 3.0 from a previous DSS 2.X instance requires some attention.
To migrate from DSS 1.X, you must first upgrade to 2.0. See DSS 2.0 Relase notes
Automatic migration from Data Science Studio 2.3.X is supported, with the following restrictions and warnings:
DSS 3.0 features an improved security model. The migration aims at preserving as much as possible the previously defined permissions, but we strongly encourage you to review the permissions of users and groups after migration.
DSS 3.0 now enforces the “Reader” / “Data Analyst” / “Data Scientist” roles in the DSS licensing model. You might need to adjust the roles for your users after upgrade.
DSS now includes the XGBoost library in the visual machine learning interface. If you had previously installed older versions of the XGBoost Python library (using pip), the XGBoost algorithm in the visual machine learning interface might not work
The usual limitations on retraining models and regenerating API node packages apply (see Upgrading a DSS instance for more information)
After migration, all previously scheduled jobs are disabled, to ease the “2.X and 3.X in parallel” deployment models. You’ll need to go to the scenarios pages in your projects to re-enable your previously scheduled jobs.
Automatic migration from Data Science Studio 2.0.X, 2.1.X and 2.2.X is supported, with the previous restrictions and warnings, and, in addition, the ones outlined in DSS 2.1 Relase notes, DSS 2.2 Relase notes, DSS 2.3 Relase notes
It is strongly recommended that you perform a full backup of your Data Science Studio data directory prior to starting the upgrade procedure.
For automatic upgrade information, see Upgrading a DSS instance
Several external libraries bundled with DSS have been bumped to major revisions. Some of these libraries include some backwards-incompatible changes. You might need to upgrade your code.
Pandas 0.16-> 0.17
Scikit-learn 0.16 -> 0.17
The 3.0 version introduces Scenarios, which replace Scheduled jobs.
Each scheduled job you had in 2.X, enabled or not, is transformed during the migration process into a simple scenario replicating the functionalities of that scheduled job:
the scenario contains a single build step to build the datasets that the scheduled job was building
the scenario contains a single time-based trigger with the same setup as the scheduled job, so that the trigger activates exactly with the same frequency and time point as the scheduled job
If the scheduled job was enabled, the time-based trigger of the corresponding scenario is enabled, and conversely. The scenarios themselves are set to inactive, so that after the migration none will run. You need to activate the scenarios (for example from the scenarios’ list), or take the opportunity to rearrange the work that the scheduled jobs were performing into a smaller number of scenarios; a single scenario can indeed launch multiple builds, waiting for a build to finish before launching the next one.
Since a scenario will execute the build corresponding to a scheduled job only when its trigger is active and the scenario itself is active, the quickest route to get the same scheduled builds as before is to activate all scenarios.
This release fixes a critical bug related to Spark, plus several smaller bug fixes.
Fix XGboost regression models when evaluation metrics is MAE, MAPE, EVS or MSE
Display grid search scores in regression reports
Fix migration of scenarios from DSS 2.3 with partitions
Better explanations as to why some scenarios are aborted
Fix layout issues in scenario screens
This release brings a lot of bug fixes and minor features for plugins.
Add ability to introduce visual separators in settings screen
Add ability to hide parameters in settings screen
Add ability to huse custom forms in settings screen
Add a metric for count of non null values
Add more metrics in the “data validity” probe
Expand capabilities for custom SQL aggregations
Add the ability to have custom checks in plugins
Use proxy settings for HTTP-based reporters
Fix and improve settings of the “append to dataset” reporter
Make the spinner appear immediately after submitting the query
Fix error reporting issues
Fix reloading of results in multi-cells mode
Add support for variables expansion
Fix visual recipes running on Hive with multiple Hive DBs
Fix reloading of split and filtering recipe with custom variables
Fix display of preparation step groups in model reports
Fix simple Shuffle-based cross-validation on regression models
Fix train-test split based on extract from two datasets with filter on test
Fix deploying “clustering” recipe on connections other than Filesystem
Add ability to disable XGBoost early stopping on regression
Fix renaming of datasets in the UI
Fix the Twitter dataset
Fix “Import data” modal in editable dataset
Fix reloading of schema for Redshift and other DBs
Improved display of filters for small numerical values
Fix mass change meaning action
Add ability to mass revert to default meaning
Unselect the steps when unselecting a group
Fix UI issue on Firefox
Add ability to have “external” legend on more charts
Fix several small bugs
Fix scale on charts with 2 Y-axis
DSS 3.0.3 is a bugfix release. For a summary of new features in DSS 3.0, see below.
Fix bug leading to unusable join recipe in some specific cases
Fix performance issue in code recipes with large number of columns
Fix history charts for points with no value
Fix possible race condition leading to considering some jobs as failed
DSS 3.0.2 is a bugfix and minor enhancements release. For a summary of new features in DSS 3.0, see below.
Preserve the “hive.query.string” Hadoop configuration key in Hive notebook
Clear error message when trying to use Geometry columns in Spark
Fix S3 support in Spark
Better performance for partitions list
Simplify and rework the way metrics are enabled and configured
Add deletion of bundles
Remap connections in SQL notebooks
Fix scenario run URL in mails
Fix wrongly computed multiclass metrics
Much faster multiclass scoring for MLLib
Fix multiclass AUC when only 2 classes appear in test set
Fix tooltip issues in the clustering scatter plot
Fix typo in custom HTTP header that could lead to inability to parse the response
Fix the INSEE enrichment processor
Fix excessive verbosity
Add a new processor to compute distance between geo points
Fix DateParser in multi-columns mode when some of the columns are empty
Modifying a step comment now properly unlocks the “Save” button
Add new error reporting tools
Enforce hierarchy of files to prevent possible out-of-datadir reads
Fix support for nginx >= 1.10
Fix the ability to remove a group permission on a project
DSS 3.0.1 is a bugfix release. For a summary of the major new features in DSS 3.0, see: https://www.dataiku.com/learn/whatsnew
Fixed ordering of partitions table
Default probes and metrics will now be enabled on migration from 2.X
Removed unapplicable parameter for MLLib
Improve explanations about target remapping in Jupyter export
Fixed migration on groups
Multiple ColumnRenamer processors will automatically be merged
DSS 3.0.0 is a major upgrade to DSS with exciting new features.
For a summary of the major new features, see: https://www.dataiku.com/learn/whatsnew
Automation deployment (“bundles”)¶
Dataiku DSS now comes in three flavors, called node types:
The Design node (the “classical” DSS), where you mainly design your workflows
The Automation node, where you run and automate your workflows
The API node (introduced in DSS 2.2), where you score new records in real-time using a REST API
After designing your data workflow in the design node, you can package it in a consistent artefact, called a “bundle”, which can the be deployed to the automation node.
On the automation node, you can activate, rollback and manage all versions of your bundles.
This new architecture makes it very easy to implement complex deployment use cases, with development, acceptance, preproduction and production environments.
For more information, please see our product page: http://www.dataiku.com/dss/features/deployment/
DSS has always been about rebuilding entire dataflows at once, thanks to its smart incremental reconstruction engine.
With the introduction of automation scenarios, you can now automate more complex use cases:
Building a part of the flow before another one (for partitioning purposes for example)
Automatically retraining models if they have diverged too much.
Scenarios are made up of:
Triggers, that decide when the scenario runs
Steps, the building blocks of your scenarios
Reporters, to notify the outside world.
You’ll find a lot of information in Automation scenarios, metrics, and checks
Metrics and checks¶
You can now track various advanced metrics about datasets, recipes, models and managed folders. For example:
The size of a dataset
The average of a column in a dataset
The number of invalid rows for a given meaning in a column
All performance metrics of a saved model
The number of files in a managed folder
In addition to these built-in metrics, you can define custom metrics using Python or SQL. Metrics are historized for deep insights into the evolution of your data flow and can be fully accessed through the DSS APIs.
Then, you can define automatic data checks based on these metrics, that act as automatic sanity tests of your data pipeline. For example, automatically fail a job if the average value of a column has drifted by more than 10% since the previous week.
Advanced version control¶
Git-based version control is now integrated much more tightly in DSS.
View the history of your project, recipes, scenarios, … from the UI
Write your own commit messages
Choose between automatic commit at each edit or manual commit (either by component or by project)
In addition, you can now choose between having a global Git repository or a Git repository per project
When viewing the history, you can get the diff of each commit, or compare two commits.
Team activity dashboards¶
Monitor the activity of each project thanks to our team activity dashboards.
Administrator monitoring dashboards¶
We’ve added a lot of monitoring dashboards for administrators, especially for large instances with lots of projects:
Global usage summary
Data size per connection
Tasks running on the Hadoop and Spark clusters and per database
Tasks running in the background on DSS
Authorization matrix for an overview of all effective authorizations
When exporting a project, you can now export all datasets from all connections (except partitioned datasets), saved models and managed folders. When importing the project in another DSS design node, the data is automatically reloaded.
This allows to export complete projects, including data.
When importing projects, you can also remap connections, removing the need to define connections with exactly the same name as on the source DSS instance.
DSS now performs automatically several maintenance and cleanup tasks in the background.
Improved security model¶
We’ve added several new permissions for more fine-grained control. The following permissions can now be granted to each group, independently of the admin permissions:
Create projects and tutorials
Write “unsafe” code (that might be used to circumvent the permissions system)
Manage user-defined meanings
In addition, users can now create personal connections without admin intervention.
The administration UI now includes an authorization matrix for an overview of all effective authorizations
The public API includes new methods to interact with scenarios and metrics
The public API includes new methods for exporting projects
It’s now possible to delete columns based on a name pattern