DSS 3.0 Relase notes¶
- Migration notes
- Version 3.0.5 - June 24th, 2016
- Version 3.0.4 - June 16th, 2016
- Version 3.0.3 - May 30th, 2016
- Version 3.0.2 - May 25th, 2016
- Version 3.0.1 - May 11th 2016
- Version 3.0.0 - May 1st 2016
Migration to DSS 3.0 from a previous DSS 2.X instance requires some attention.
To migrate from DSS 1.X, you must first upgrade to 2.0. See DSS 2.0 Relase notes
Automatic migration from Data Science Studio 2.3.X is supported, with the following restrictions and warnings:
- DSS 3.0 features an improved security model. The migration aims at preserving as much as possible the previously defined permissions, but we strongly encourage you to review the permissions of users and groups after migration.
- DSS 3.0 now enforces the “Reader” / “Data Analyst” / “Data Scientist” roles in the DSS licensing model. You might need to adjust the roles for your users after upgrade.
- DSS now includes the XGBoost library in the visual machine learning interface. If you had previously installed older versions of the XGBoost Python library (using pip), the XGBoost algorithm in the visual machine learning interface might not work
- The usual limitations on retraining models and regenerating API node packages apply (see Upgrading a DSS instance for more information)
- After migration, all previously scheduled jobs are disabled, to ease the “2.X and 3.X in parallel” deployment models. You’ll need to go to the scenarios pages in your projects to re-enable your previously scheduled jobs.
Automatic migration from Data Science Studio 2.0.X, 2.1.X and 2.2.X is supported, with the previous restrictions and warnings, and, in addition, the ones outlined in DSS 2.1 Relase notes, DSS 2.2 Relase notes, DSS 2.3 Relase notes
It is strongly recommended that you perform a full backup of your Data Science Studio data directory prior to starting the upgrade procedure.
For automatic upgrade information, see Upgrading a DSS instance
Several external libraries bundled with DSS have been bumped to major revisions. Some of these libraries include some backwards-incompatible changes. You might need to upgrade your code.
- Pandas 0.16-> 0.17
- Scikit-learn 0.16 -> 0.17
The 3.0 version introduces Scenarios, which replace Scheduled jobs.
Each scheduled job you had in 2.X, enabled or not, is transformed during the migration process into a simple scenario replicating the functionalities of that scheduled job:
- the scenario contains a single build step to build the datasets that the scheduled job was building
- the scenario contains a single time-based trigger with the same setup as the scheduled job, so that the trigger activates exactly with the same frequency and time point as the scheduled job
If the scheduled job was enabled, the time-based trigger of the corresponding scenario is enabled, and conversely. The scenarios themselves are set to inactive, so that after the migration none will run. You need to activate the scenarios (for example from the scenarios’ list), or take the opportunity to rearrange the work that the scheduled jobs were performing into a smaller number of scenarios; a single scenario can indeed launch multiple builds, waiting for a build to finish before launching the next one.
Since a scenario will execute the build corresponding to a scheduled job only when its trigger is active and the scenario itself is active, the quickest route to get the same scheduled builds as before is to activate all scenarios.
This release fixes a critical bug related to Spark, plus several smaller bug fixes.
- Fix XGboost regression models when evaluation metrics is MAE, MAPE, EVS or MSE
- Display grid search scores in regression reports
- Fix migration of scenarios from DSS 2.3 with partitions
- Better explanations as to why some scenarios are aborted
- Fix layout issues in scenario screens
This release brings a lot of bug fixes and minor features for plugins.
- Add ability to introduce visual separators in settings screen
- Add ability to hide parameters in settings screen
- Add ability to huse custom forms in settings screen
- Add a metric for count of non null values
- Add more metrics in the “data validity” probe
- Expand capabilities for custom SQL aggregations
- Add the ability to have custom checks in plugins
- Use proxy settings for HTTP-based reporters
- Fix and improve settings of the “append to dataset” reporter
- Make the spinner appear immediately after submitting the query
- Fix error reporting issues
- Fix reloading of results in multi-cells mode
- Add support for variables expansion
- Fix visual recipes running on Hive with multiple Hive DBs
- Fix reloading of split and filtering recipe with custom variables
- Fix display of preparation step groups in model reports
- Fix simple Shuffle-based cross-validation on regression models
- Fix train-test split based on extract from two datasets with filter on test
- Fix deploying “clustering” recipe on connections other than Filesystem
- Add ability to disable XGBoost early stopping on regression
- Fix renaming of datasets in the UI
- Fix the Twitter dataset
- Fix “Import data” modal in editable dataset
- Fix reloading of schema for Redshift and other DBs
- Improved display of filters for small numerical values
- Fix mass change meaning action
- Add ability to mass revert to default meaning
- Unselect the steps when unselecting a group
- Fix UI issue on Firefox
- Add ability to have “external” legend on more charts
- Fix several small bugs
- Fix scale on charts with 2 Y-axis
DSS 3.0.3 is a bugfix release. For a summary of new features in DSS 3.0, see below.
- Fix bug leading to unusable join recipe in some specific cases
- Fix performance issue in code recipes with large number of columns
- Fix history charts for points with no value
- Fix possible race condition leading to considering some jobs as failed
DSS 3.0.2 is a bugfix and minor enhancements release. For a summary of new features in DSS 3.0, see below.
- Preserve the “hive.query.string” Hadoop configuration key in Hive notebook
- Clear error message when trying to use Geometry columns in Spark
- Fix S3 support in Spark
- Better performance for partitions list
- Simplify and rework the way metrics are enabled and configured
- Add deletion of bundles
- Remap connections in SQL notebooks
- Fix scenario run URL in mails
- Fix wrongly computed multiclass metrics
- Much faster multiclass scoring for MLLib
- Fix multiclass AUC when only 2 classes appear in test set
- Fix tooltip issues in the clustering scatter plot
- Fix typo in custom HTTP header that could lead to inability to parse the response
- Fix the INSEE enrichment processor
- Fix excessive verbosity
- Add a new processor to compute distance between geo points
- Fix DateParser in multi-columns mode when some of the columns are empty
- Modifying a step comment now properly unlocks the “Save” button
- Add new error reporting tools
- Enforce hierarchy of files to prevent possible out-of-datadir reads
- Fix support for nginx >= 1.10
- Fix the ability to remove a group permission on a project
DSS 3.0.1 is a bugfix release. For a summary of the major new features in DSS 3.0, see: https://www.dataiku.com/learn/whatsnew
- Fixed ordering of partitions table
- Default probes and metrics will now be enabled on migration from 2.X
- Removed unapplicable parameter for MLLib
- Improve explanations about target remapping in Jupyter export
- Fixed migration on groups
- Multiple ColumnRenamer processors will automatically be merged
DSS 3.0.0 is a major upgrade to DSS with exciting new features.
For a summary of the major new features, see: https://www.dataiku.com/learn/whatsnew
Automation deployment (“bundles”)¶
Dataiku DSS now comes in three flavors, called node types:
- The Design node (the “classical” DSS), where you mainly design your workflows
- The Automation node, where you run and automate your workflows
- The API node (introduced in DSS 2.2), where you score new records in real-time using a REST API
After designing your data workflow in the design node, you can package it in a consistent artefact, called a “bundle”, which can the be deployed to the automation node.
On the automation node, you can activate, rollback and manage all versions of your bundles.
This new architecture makes it very easy to implement complex deployment use cases, with development, acceptance, preproduction and production environments.
For more information, please see our product page: http://www.dataiku.com/dss/features/deployment/
DSS has always been about rebuilding entire dataflows at once, thanks to its smart incremental reconstruction engine.
With the introduction of automation scenarios, you can now automate more complex use cases:
- Building a part of the flow before another one (for partitioning purposes for example)
- Automatically retraining models if they have diverged too much.
Scenarios are made up of:
- Triggers, that decide when the scenario runs
- Steps, the building blocks of your scenarios
- Reporters, to notify the outside world.
You’ll find a lot of information in Automation scenarios, metrics, and checks
Metrics and checks¶
You can now track various advanced metrics about datasets, recipes, models and managed folders. For example:
- The size of a dataset
- The average of a column in a dataset
- The number of invalid rows for a given meaning in a column
- All performance metrics of a saved model
- The number of files in a managed folder
In addition to these built-in metrics, you can define custom metrics using Python or SQL. Metrics are historized for deep insights into the evolution of your data flow and can be fully accessed through the DSS APIs.
Then, you can define automatic data checks based on these metrics, that act as automatic sanity tests of your data pipeline. For example, automatically fail a job if the average value of a column has drifted by more than 10% since the previous week.
Advanced version control¶
Git-based version control is now integrated much more tightly in DSS.
- View the history of your project, recipes, scenarios, … from the UI
- Write your own commit messages
- Choose between automatic commit at each edit or manual commit (either by component or by project)
In addition, you can now choose between having a global Git repository or a Git repository per project
When viewing the history, you can get the diff of each commit, or compare two commits.
Team activity dashboards¶
Monitor the activity of each project thanks to our team activity dashboards.
Administrator monitoring dashboards¶
We’ve added a lot of monitoring dashboards for administrators, especially for large instances with lots of projects:
- Global usage summary
- Data size per connection
- Tasks running on the Hadoop and Spark clusters and per database
- Tasks running in the background on DSS
- Authorization matrix for an overview of all effective authorizations
When exporting a project, you can now export all datasets from all connections (except partitioned datasets), saved models and managed folders. When importing the project in another DSS design node, the data is automatically reloaded.
This allows to export complete projects, including data.
When importing projects, you can also remap connections, removing the need to define connections with exactly the same name as on the source DSS instance.
DSS now performs automatically several maintenance and cleanup tasks in the background.
Improved security model¶
We’ve added several new permissions for more fine-grained control. The following permissions can now be granted to each group, independently of the admin permissions:
- Create projects and tutorials
- Write “unsafe” code (that might be used to circumvent the permissions system)
- Manage user-defined meanings
In addition, users can now create personal connections without admin intervention.
The administration UI now includes an authorization matrix for an overview of all effective authorizations
- The public API includes new methods to interact with scenarios and metrics
- The public API includes new methods for exporting projects
- It’s now possible to delete columns based on a name pattern