DSS 3.1 Release notes¶
Migration notes¶
Migration paths to DSS 3.1¶
From DSS 3.0: Automatic migration is supported, with the following restrictions and warnings
From DSS 2.X: In addition to the following restrictions and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions: see 2.0 -> 2.1 2.1 -> 2.2 2.2 -> 2.3 2.3 -> 3.0
Migration from DSS 1.X is not supported. You must first upgrade to 2.0. See DSS 2.0 Relase notes
Limitations and warnings¶
The usual limitations on retraining models and regenerating API node packages apply (see Upgrading a DSS instance for more information). Note that DSS 3.1 includes a vast overhaul of the machine learning part, so machine learning models trained with previous DSS will not work in DSS 3.1
How to upgrade¶
It is strongly recommended that you perform a full backup of your Data Science Studio data directory prior to starting the upgrade procedure.
For automatic upgrade information, see Upgrading a DSS instance
External libraries upgrades¶
Several external libraries bundled with DSS have been bumped to major revisions. Some of these libraries include some backwards-incompatible changes. You might need to upgrade your code.
Notable upgrades:
ggplot 0.6 -> 0.9
pandas 0.17 -> 0.18
numpy 1.9 -> 1.10
requests 2.9 -> 2.10
Version 3.1.5 - November 21st 2016¶
Data preparation¶
Fix selection of partial column content
Fix removal of a value in a “Delete matching rows” step
Improve explanations for “Filter on invalid meaning” processor
Fix error when removing a column which was used for coloring cells
Fix unsaved changes to design sample in preparation recipes
Add reference of all processors in documentation
Flow & Recipes¶
Fix timezone issues on group and join recipes on Filesystem datasets
Fix disabling of pre-filter in visual recipes
Charts¶
Fix flickering and reset of zoom in map charts
Fix disappearing smallest bubble in scatter plot
Display an error message when trying to plot 100% stacked columns with negative values
Version 3.1.4 - October 3rd 2016¶
Hadoop & Spark¶
Add support for HDP 2.5
Add support for EMR 4.7 and 4.8
Spark writing: Faster write for Parquet by using native Spark code
Spark writing: don’t fail on invalid dates
Pig: Fix PigStorage (for CSV files) on Pig 0.14+
Fix possible hang when aborting Hive+Tez queries
Improve logging inside the hproxy process
Datasets¶
Fix Redshift support (bug introduced in 3.1.3)
Add ability to load AWS credentials from environment
Fix “COUNT” metric on Oracle
Make fetch size configurable for all SQL datasets
Several fixes for Teradata support
Machine learning¶
Fix MROC AUC computation on Jupyter export of multiclass model
H2O: bump version and fix support out-of-the-box on CDH’s Spark
Misc¶
Fix dataset export from dashboard
Add support for Markdown on custom “Homepage” messages
SQL notebook: show aborted status immediately when aborting a query
Add API to read metrics on managed folders
Create the underlying folder of a managed folder upon addition
Fix scrolling on API keys page
Add ability to use case-insensitive logins on LDAP
LDAP users will now be imported as readers by default
Version 3.1.3 - September 19th 2016¶
DSS 3.1.3 is a bugfix release. For more information about 3.1.X, see the release notes for 3.1.0.
Hadoop & Spark¶
Add support for MapR 5.2
Add partial support for Hive 2.1
Add ability to pass arbitrary arguments to Spark, useful for –packages
Data preparation¶
Fix random failure occuring in the “Holidays computer” processor
Fix output data of the JSONPath extractor processor
Fix date diff (reversed order)
Visual recipes¶
Fix date filtering
Data viualization¶
Add ability to use shapes in scatter plot
Minor improvements in tooltip handling
Machine Learning¶
Fix “Impute with Median” in MLlib on CDH 5.7/5.8
Fix possible failure in clustering results
Fix error in clustering recipe when filtering columns
Add configurability of max features in random forest algorithms
Misc¶
Metrics & Checks: Fix multiple SQL probes on the same datasets
Performance improvements for custom exporters
Performance improvements for Data Catalog
Performance improvements on home page
Small UI fixes in themes
Small UI improvements here and there
Update PostgreSQL driver (fixes result sets with more than 2B results)
Version 3.1.2 - August 22nd 2016¶
DSS 3.1.2 is a bugfix release. For more information about 3.1.X, see the release notes for 3.1.0.
ML¶
Fixed “red/green” indicator for MAPE
Improved visualization of decision trees
Warn when trying to use numerical features for Naive Bayes
Make GBT regression exportable to notebook
Fixed clustering scoring recipes migrated from 3.0
Add Impute with median on MLLib
Don’t fail when rejected features are not present in the scoring recipe input
Datasets¶
Configurable batch size for writing to ElasticSearch
Fixed edition of columns on editable dataset
Automation¶
Fix attachment of a dataset in the “Send message” step
Fix intermittent failures with “Make API node package” step
Add ability to directly use
get_custom_variables
in a custom check
Installation & Admin¶
Fixed R integration, following changes in IRKernel
Fixed “radial” layout on home page
Optional reporting on internal metrics to Graphite
Fixed “Cluster tasks” and “Per-connection data” views on Hadoop
Misc¶
Major performance improvements in various areas, especially with large number of projects, datasets, or users
Improved copy/paste of code from diff viewer
Tighten permissions on managed folders
Fixes for custom Scala recipe in plugin development environment
Fixed
get_config
call on Python APIDon’t fail on homepage with broken Jupyter notebooks
Fixed small UI issue on custom aggregations in grouping recipe
Fixed extension of export filenames
Fixed small UI issues with Chrome 52
Don’t allow the custom formula processor’s edition form to overflow
Version 3.1.1 - August 10th 2016¶
DSS 3.1.1 is a bugfix release. For more information about 3.1.X, see the release notes for 3.1.0.
ML¶
Fixed various errors in models status
Fixed deployment of Vertica ML models when the target is not in the dataset to score
Improved the autocomputed schema as output of scoring recipes
Fixed bug when a custom evaluation function is partially defined
Improved resiliency and error messages for custom evaluation functions
Version 3.1.0 - July 27th 2016¶
DSS 3.1.0 is a major upgrade to DSS with exciting new features.
For a summary of the major new features, see: https://www.dataiku.com/learn/whatsnew
New features¶
Scala recipe and notebook¶
You can now interact with Spark using Scala, the most native language for Spark processing.
This release brings to DSS:
Spark-Scala recipes
Spark-Scala notebooks
Custom recipes (plugins) written in Scala
For more information, please see Spark-Scala recipes
H2O integration (through Sparkling-Water)¶
H2O is a distributed machine-learning library, with a wide range of algorithms and methods.
DSS now includes full support for H2O (in its “Sparkling Water” variant) in its visual machine learning interface.
For more information, please see H2O (Sparkling Water) engine
Advanced users can also leverage H2O through all Spark-based recipes and notebooks of DSS.
New DSS home page & workflow¶
The DSS home page now features:
The ability to set a customizable “status” to projects, in order to materialize your workflow (draft, production, archived, …) in DSS
The ability to filter projects by tags, status, owner, …
The ability to sort projects
A new “list” view with advanced details (contents of the project, activity monitoring, …)
A new “flow” view to study the dependencies between projects
Useful “Tips and Tricks”
Prebuilt notebooks¶
You can now use prebuilt templates for notebooks when creating a notebook from a dataset. This allows for reusable interactive analysis
DSS 3.1 comes with 4 prebuilt notebooks for analyzing datasets:
PCA
Correlations between variables
Time series visualization and analytics
Time series forecasting
Machine learning visualizations¶
DSS now includes the following new visualizations in Machine Learning
Decision tree(s) visualization for Decision Tree, Random Forests and Gradient Boosting
Partial dependency plots for Gradient Boosting
More custom algorithms support¶
Custom algorithms are now supported in:
Python Clustering (Python)
Spark MLLib Prediction (Scala)
Spark MLLib Clustering (Scala)
Custom Formats and Export¶
A brand new export mechanism has been introduced. It provides easier configuration and expands what can be supported.
It is now possible to write custom format extractors and exporters, either in Python or Java. See our plugins library for examples.
This notably provides a much improved support for export to Tableau (TDE files or Tableau Server): open any data from DSS in Tableau in just 2 clicks!
Other notable enhancements¶
Data preparation¶
New processor: date filter
New processor: compute distance between geopoints
Machine learning¶
Handling of data types has been strongly overhauled, resulting in better reliability in machine learning
Additional algorithms have been added in Spark MLLib
DSS now supports clustering in the Spark MLLib implementation
You can now export variables importance and coefficients data directly from the machine learning UI
When doing dummy-encoding, DSS can now remove the last dummy to avoid collinearity (especially useful for regression models). DSS by default automatically uses the proper behavior according to the algorithm.
When doing dummy-encoding, DSS has more options for handling features with large cardinalities (clip above a number of dummies, clip after a cumulative distribution, clip below a threshold in number of records)
Much faster scoring in MLLib multiclass
In scoring recipe, it is now possible to select the input columns to retain in output
In scoring recipe, it is now possible to “unplug” the output schema from the input. This is especially useful in corner cases where the data type is incorrect
Added support for in-database machine learning on Vertica, through Vertica 7 Advanced Analytics package
Added links to original analysis from training recipe & saved models
Visual recipes¶
The join recipe now has support for more join types: Inner, Left, Right, Outer, Cross, Natural and Advanced (left with optional dedup)
The join recipe now has support for various kinds of inequality joins
Datasets & formats¶
Very large Excel files can now be opened with small memory overhead
New option for CSV and SQL: normalize doubles (ie: always add .0 to doubles). This makes operation between doubles and integers generally more reliable
Add support for newer AWS S3 regions (like eu-central-1)
Automation (scenarios, bundles, metrics, checks)¶
Counting records on small datasets will not use Hive anymore
Custom checks (in a plugin) can now be used
Hadoop & Spark¶
It is now possible to import Hive tables as HDFS datasets from the DSS UI
You can now validate SparkSQL recipes without having to run them
Installation and setup¶
The most standard Java options can now be set directly from the install.ini file. See Advanced Java runtime configuration
DSS can now use Conda for managing its internal Python environment instead of virtualenv/pip
Enhanced the content of DSS diagnosis reports
Misc¶
You can now expose a folder or a file in a folder on the Dashboard
Error handling has been improved in numerous places. DSS will now more prominently display the actual errors, especially when using code recipes
DSS now includes a public API for interacting with recipes
New interaction features in plugins
The schema of a dataset can be exported (to any supported formatter) from the settings screen
Access to datasets from Python and R is much faster, especially for small datasets
SQL connectors can now use custom JDBC URLs for advanced customization
Custom variables are now available in Webapps
New default pictures for users
Lots of performance improvements, both in the backend and frontend
Notable bug fixes¶
Very large Excel files can now be opened with small memory overhead
Machine Learning: Imputation with Unicode values has been fixed
Visual preparation: much faster drag & drop with Firefox
Fixed a bunch of JS errors
Visual recipes running on Hive or Impala will properly take into account the case-insensitivity of these DBs and not generate case-mismatched Parquet files anymore
Fixed possible job failures in Kerberos-secured clusters
Add multi-schema support to S3 -> Redshift syncing
Don’t forget to clear a dataset before doing a redispatch-sync
Switched to CartoDB tiles for maps