DSS 3.1 Release notes

Migration notes

Migration paths to DSS 3.1

  • From DSS 3.0: Automatic migration is supported, with the following restrictions and warnings

  • From DSS 2.X: In addition to the following restrictions and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions: see 2.0 -> 2.1 2.1 -> 2.2 2.2 -> 2.3 2.3 -> 3.0

  • Migration from DSS 1.X is not supported. You must first upgrade to 2.0. See DSS 2.0 Relase notes

Limitations and warnings

  • The usual limitations on retraining models and regenerating API node packages apply (see Upgrading a DSS instance for more information). Note that DSS 3.1 includes a vast overhaul of the machine learning part, so machine learning models trained with previous DSS will not work in DSS 3.1

How to upgrade

It is strongly recommended that you perform a full backup of your Data Science Studio data directory prior to starting the upgrade procedure.

For automatic upgrade information, see Upgrading a DSS instance

External libraries upgrades

Several external libraries bundled with DSS have been bumped to major revisions. Some of these libraries include some backwards-incompatible changes. You might need to upgrade your code.

Notable upgrades:

  • ggplot 0.6 -> 0.9

  • pandas 0.17 -> 0.18

  • numpy 1.9 -> 1.10

  • requests 2.9 -> 2.10

Version 3.1.5 - November 21st 2016

Data preparation

  • Fix selection of partial column content

  • Fix removal of a value in a “Delete matching rows” step

  • Improve explanations for “Filter on invalid meaning” processor

  • Fix error when removing a column which was used for coloring cells

  • Fix unsaved changes to design sample in preparation recipes

  • Add reference of all processors in documentation

Flow & Recipes

  • Fix timezone issues on group and join recipes on Filesystem datasets

  • Fix disabling of pre-filter in visual recipes

Charts

  • Fix flickering and reset of zoom in map charts

  • Fix disappearing smallest bubble in scatter plot

  • Display an error message when trying to plot 100% stacked columns with negative values

Datasets

  • Uploaded files don’t disappear anymore when going back to the “Connection” tab

  • Fix writing dates to “CSV (Hive Compatible)” format from a Python recipe

Misc

  • Fix ability to abort a project export

  • Don’t fail project imports containing data for a SQL query dataset

  • Fix UI bug in messaging channels

  • Fix R install on Mac OS X

  • Fix export to GeoJSON

Version 3.1.4 - October 3rd 2016

Hadoop & Spark

  • Add support for HDP 2.5

  • Add support for EMR 4.7 and 4.8

  • Spark writing: Faster write for Parquet by using native Spark code

  • Spark writing: don’t fail on invalid dates

  • Pig: Fix PigStorage (for CSV files) on Pig 0.14+

  • Fix possible hang when aborting Hive+Tez queries

  • Improve logging inside the hproxy process

Datasets

  • Fix Redshift support (bug introduced in 3.1.3)

  • Add ability to load AWS credentials from environment

  • Fix “COUNT” metric on Oracle

  • Make fetch size configurable for all SQL datasets

  • Several fixes for Teradata support

Machine learning

  • Fix MROC AUC computation on Jupyter export of multiclass model

  • H2O: bump version and fix support out-of-the-box on CDH’s Spark

Misc

  • Fix dataset export from dashboard

  • Add support for Markdown on custom “Homepage” messages

  • SQL notebook: show aborted status immediately when aborting a query

  • Add API to read metrics on managed folders

  • Create the underlying folder of a managed folder upon addition

  • Fix scrolling on API keys page

  • Add ability to use case-insensitive logins on LDAP

  • LDAP users will now be imported as readers by default

Version 3.1.3 - September 19th 2016

DSS 3.1.3 is a bugfix release. For more information about 3.1.X, see the release notes for 3.1.0.

Hadoop & Spark

  • Add support for MapR 5.2

  • Add partial support for Hive 2.1

  • Add ability to pass arbitrary arguments to Spark, useful for –packages

Datasets

  • Fix some kinds of formulas in Excel reader

Data preparation

  • Fix random failure occuring in the “Holidays computer” processor

  • Fix output data of the JSONPath extractor processor

  • Fix date diff (reversed order)

Visual recipes

  • Fix date filtering

Data viualization

  • Add ability to use shapes in scatter plot

  • Minor improvements in tooltip handling

Machine Learning

  • Fix “Impute with Median” in MLlib on CDH 5.7/5.8

  • Fix possible failure in clustering results

  • Fix error in clustering recipe when filtering columns

  • Add configurability of max features in random forest algorithms

Lab

  • Fix encoding issues in PCA notebook

Misc

  • Metrics & Checks: Fix multiple SQL probes on the same datasets

  • Performance improvements for custom exporters

  • Performance improvements for Data Catalog

  • Performance improvements on home page

  • Small UI fixes in themes

  • Small UI improvements here and there

  • Update PostgreSQL driver (fixes result sets with more than 2B results)

Version 3.1.2 - August 22nd 2016

DSS 3.1.2 is a bugfix release. For more information about 3.1.X, see the release notes for 3.1.0.

ML

  • Fixed “red/green” indicator for MAPE

  • Improved visualization of decision trees

  • Warn when trying to use numerical features for Naive Bayes

  • Make GBT regression exportable to notebook

  • Fixed clustering scoring recipes migrated from 3.0

  • Add Impute with median on MLLib

  • Don’t fail when rejected features are not present in the scoring recipe input

Datasets

  • Configurable batch size for writing to ElasticSearch

  • Fixed edition of columns on editable dataset

Automation

  • Fix attachment of a dataset in the “Send message” step

  • Fix intermittent failures with “Make API node package” step

  • Add ability to directly use get_custom_variables in a custom check

Installation & Admin

  • Fixed R integration, following changes in IRKernel

  • Fixed “radial” layout on home page

  • Optional reporting on internal metrics to Graphite

  • Fixed “Cluster tasks” and “Per-connection data” views on Hadoop

Misc

  • Major performance improvements in various areas, especially with large number of projects, datasets, or users

  • Improved copy/paste of code from diff viewer

  • Tighten permissions on managed folders

  • Fixes for custom Scala recipe in plugin development environment

  • Fixed get_config call on Python API

  • Don’t fail on homepage with broken Jupyter notebooks

  • Fixed small UI issue on custom aggregations in grouping recipe

  • Fixed extension of export filenames

  • Fixed small UI issues with Chrome 52

  • Don’t allow the custom formula processor’s edition form to overflow

Version 3.1.1 - August 10th 2016

DSS 3.1.1 is a bugfix release. For more information about 3.1.X, see the release notes for 3.1.0.

ML

  • Fixed various errors in models status

  • Fixed deployment of Vertica ML models when the target is not in the dataset to score

  • Improved the autocomputed schema as output of scoring recipes

  • Fixed bug when a custom evaluation function is partially defined

  • Improved resiliency and error messages for custom evaluation functions

Spark

  • Fixed Spark recipes on CDH

  • Fixed Scala recipes on CDH

  • Fixed SparkR recipe

  • Added the ability ot have Unicode characters in Scala recipe source

Misc

  • Added Jupyter logs to diagnostic reports

  • Fixed visibility of “Clear filters” link on some themes

Version 3.1.0 - July 27th 2016

DSS 3.1.0 is a major upgrade to DSS with exciting new features.

For a summary of the major new features, see: https://www.dataiku.com/learn/whatsnew

New features

Scala recipe and notebook

You can now interact with Spark using Scala, the most native language for Spark processing.

This release brings to DSS:

  • Spark-Scala recipes

  • Spark-Scala notebooks

  • Custom recipes (plugins) written in Scala

For more information, please see Spark-Scala recipes

H2O integration (through Sparkling-Water)

H2O is a distributed machine-learning library, with a wide range of algorithms and methods.

DSS now includes full support for H2O (in its “Sparkling Water” variant) in its visual machine learning interface.

For more information, please see H2O (Sparkling Water) engine

Advanced users can also leverage H2O through all Spark-based recipes and notebooks of DSS.

New DSS home page & workflow

The DSS home page now features:

  • The ability to set a customizable “status” to projects, in order to materialize your workflow (draft, production, archived, …) in DSS

  • The ability to filter projects by tags, status, owner, …

  • The ability to sort projects

  • A new “list” view with advanced details (contents of the project, activity monitoring, …)

  • A new “flow” view to study the dependencies between projects

  • Useful “Tips and Tricks”

Prebuilt notebooks

You can now use prebuilt templates for notebooks when creating a notebook from a dataset. This allows for reusable interactive analysis

DSS 3.1 comes with 4 prebuilt notebooks for analyzing datasets:

  • PCA

  • Correlations between variables

  • Time series visualization and analytics

  • Time series forecasting

New data sources

DSS can now connect to the following SQL databases

Machine learning visualizations

DSS now includes the following new visualizations in Machine Learning

  • Decision tree(s) visualization for Decision Tree, Random Forests and Gradient Boosting

  • Partial dependency plots for Gradient Boosting

More custom algorithms support

Custom algorithms are now supported in:

  • Python Clustering (Python)

  • Spark MLLib Prediction (Scala)

  • Spark MLLib Clustering (Scala)

Custom Formats and Export

A brand new export mechanism has been introduced. It provides easier configuration and expands what can be supported.

It is now possible to write custom format extractors and exporters, either in Python or Java. See our plugins library for examples.

This notably provides a much improved support for export to Tableau (TDE files or Tableau Server): open any data from DSS in Tableau in just 2 clicks!

Other notable enhancements

Data preparation

  • New processor: date filter

  • New processor: compute distance between geopoints

Machine learning

  • Handling of data types has been strongly overhauled, resulting in better reliability in machine learning

  • Additional algorithms have been added in Spark MLLib

  • DSS now supports clustering in the Spark MLLib implementation

  • You can now export variables importance and coefficients data directly from the machine learning UI

  • When doing dummy-encoding, DSS can now remove the last dummy to avoid collinearity (especially useful for regression models). DSS by default automatically uses the proper behavior according to the algorithm.

  • When doing dummy-encoding, DSS has more options for handling features with large cardinalities (clip above a number of dummies, clip after a cumulative distribution, clip below a threshold in number of records)

  • Much faster scoring in MLLib multiclass

  • In scoring recipe, it is now possible to select the input columns to retain in output

  • In scoring recipe, it is now possible to “unplug” the output schema from the input. This is especially useful in corner cases where the data type is incorrect

  • Added support for in-database machine learning on Vertica, through Vertica 7 Advanced Analytics package

  • Added links to original analysis from training recipe & saved models

Visual recipes

  • The join recipe now has support for more join types: Inner, Left, Right, Outer, Cross, Natural and Advanced (left with optional dedup)

  • The join recipe now has support for various kinds of inequality joins

Datasets & formats

  • Very large Excel files can now be opened with small memory overhead

  • New option for CSV and SQL: normalize doubles (ie: always add .0 to doubles). This makes operation between doubles and integers generally more reliable

  • Add support for newer AWS S3 regions (like eu-central-1)

Automation (scenarios, bundles, metrics, checks)

  • Counting records on small datasets will not use Hive anymore

  • Custom checks (in a plugin) can now be used

Hadoop & Spark

  • It is now possible to import Hive tables as HDFS datasets from the DSS UI

  • You can now validate SparkSQL recipes without having to run them

Installation and setup

  • The most standard Java options can now be set directly from the install.ini file. See Advanced Java runtime configuration

  • DSS can now use Conda for managing its internal Python environment instead of virtualenv/pip

  • Enhanced the content of DSS diagnosis reports

Misc

  • You can now expose a folder or a file in a folder on the Dashboard

  • Error handling has been improved in numerous places. DSS will now more prominently display the actual errors, especially when using code recipes

  • DSS now includes a public API for interacting with recipes

  • New interaction features in plugins

  • The schema of a dataset can be exported (to any supported formatter) from the settings screen

  • Access to datasets from Python and R is much faster, especially for small datasets

  • SQL connectors can now use custom JDBC URLs for advanced customization

  • Custom variables are now available in Webapps

  • New default pictures for users

  • Lots of performance improvements, both in the backend and frontend

Notable bug fixes

  • Very large Excel files can now be opened with small memory overhead

  • Machine Learning: Imputation with Unicode values has been fixed

  • Visual preparation: much faster drag & drop with Firefox

  • Fixed a bunch of JS errors

  • Visual recipes running on Hive or Impala will properly take into account the case-insensitivity of these DBs and not generate case-mismatched Parquet files anymore

  • Fixed possible job failures in Kerberos-secured clusters

  • Add multi-schema support to S3 -> Redshift syncing

  • Don’t forget to clear a dataset before doing a redispatch-sync

  • Switched to CartoDB tiles for maps