DSS 2.1 Relase notes

Migration notes

Warning

Migration to DSS 2.1 from DSS 1.X is not officially supported. You should first migrate to the latest 2.0.X version. See DSS 2.0 Relase notes

Automatic migration from Data Science Studio 2.0.X is supported, with the following restrictions and warnings:

  • The “Auto Mirrors” feature, which was deprecated since DSS 1.3 has been removed

  • In the Webapp builder, the deprecated API dataiku_load_dataset and dataiku_dataset_object have been removed

  • If you had any webapps that did not authorize any dataset, the API key associated to this webapp is not usable anymore. You will need to create a new webapp if you want to authorize datasets.

  • If you previously used Impala (in notebooks or charts), you will need to reconfigure Impala connectivity. See Impala

  • Following an upgrade of the IPython notebook, graphing features must be enabled manually in older Python notebooks.

  • After upgrade, you will need to follow again the procedure to Install R integration

Getting charts in migrated Python notebooks

In DSS 1.X and 2.0, DSS used an older version of the IPython notebook that had graphing features “magically enabled”. DSS 2.1 includes a more recent version of the Jupyter notebook. This new version requires manual activation of the graphing features in the notebook.

Notebooks created directly in DSS 2.1 come with a snippet that automatically enables them.

To enable them on migrated notebooks, simply add a cell at the top of your notebook containing:

%pylab inline

Version 2.1.4 - October 29th 2015

DSS 2.1.4 is a bugfix release.

Datasets

  • Fix UI for partitioned ElasticSearch datasets

  • Fix possible issue in writing to partitioned ElasticSearch datasets

UI

  • Fix small issues in the “Run” button of recipes

Flow

  • Improve handling of datasets being written while being used: Now, if a dataset is being written while being used as input of a recipe, DSS will properly remember the status of the source dataset at the beginning of the recipe, and will thus properly retrigger build if the dataset was modified during the previous recipe execution

Recipes

  • Fix UI for the “Time range” dependency when the output is not partitioned

  • Improve the results of partitioning tests

Charts

  • Fix approximate computation of quantiles in boxplots

  • Fix case sensitivity issue in Vertica live-processing charts

Version 2.1.3 - October 20th 2015

DSS 2.1.3 is a bugfix release.

Installation

  • Add support for Amazon Linux 2016.x and Ubuntu 15.10

  • Improve installation R on Redhat / CentOS

  • Make installation of R integration on Mac OS X more robust

  • Fix publication of Jupyter notebooks to dashboard

Spark

  • Fix loading of non-Filesystem datasets (like SQL tables)

  • Fix loading of Parquet files with complex types (arrays, maps, structs)

  • Fix scoring recipes with Random Forest algorithm

Charts

  • Fix processing of charts on HDFS when Impala is not installed

  • Fix aggregation on filtered grid geo charts

  • Fix display of records on grid geo charts

  • Fix handling of “No value” on grid geo charts

  • Fix display of filtered record count on scatter plots

R

  • Fix handling of CSV files with multi-line fields

  • General cleanup

UI

  • Fix UI glitch in lists editor

  • Fix display of preparation recipe on Chrome 46.

  • Fix schema update modal when a very large number were modified

Version 2.1.2 - October 6th 2015

DSS 2.1.2 is a bugfix release.

Installation

  • Fix migration from 2.0.X for webapps that had 0 dataset enabled.

Flow and recipes

  • Code recipes can now have a partitioned dataset as input and a folder as output

  • Fix handling of column types in the Split recipe > “Dispatch values of a column”

Datasets

  • Elastic Search

    • fix failure reading indices with very large number of shards

    • fix handling of sampling.

    • close scroll handles on the server as soon as possible, avoids excessive resource consumption on the server

Spark

  • Fix support for Spark 1.5.1

Machine learning

  • Fix optimization of regression models based on RSME score

Version 2.1.1 - October 2nd 2015

DSS 2.1.1 is a bugfix release.

Warning

For migration from previous versions, please see the DSS 2.1.0 release notes

Installation

  • Fix support for Debian 7 in deps checker

  • Add support for Amazon Linux 2015.09

  • Fix migration of Webapp API keys in some cases

  • Fix migration of ColumnRenamer processor on analysis

  • Fix R automatic installation support for Debian 8

Visual preparation

  • Fix numerical filters

Spark

  • Drop rows where target is null

  • Drop missing now also handles Infinity

Hadoop

  • Fix occasional failure of prepare recipes on Hadoop when starting many preparation recipes at once.

UI

  • Fix scrolling in filesystem dataset schema

  • Fix aspect ratio of User pictures

  • Fix “Add checklist” button on project pages

Version 2.1.0 - September 29th 2015

DSS 2.1.0 is a major upgrade that brings a wealth of new features and improvements.

For a summary of the major new features, see: https://learn.dataiku.com/whatsnew

New features

Spark integration

DSS now features full integration with Apache Spark, the next-generation distributed analytic framework.

The integration of Spark in DSS 2.1 is pervasive and extends to all of the following features, which are now Spark-enabled:

  • Visual data preparation

  • “VisualSQL” recipes (Grouping, Joining, Stacking)

  • Guided machine learning in analysis

  • Training and prediction in Flow

  • PySpark recipe

  • SparkR recipe

  • SparkSQL recipe

  • PySpark-enabled notebook

  • SparkR-enabled notebook

All DSS data sources can be handled using Spark. As always in DSS, you can mix technologies freely, both Spark-enabled and traditional

For more information about Spark integration, see DSS and Spark

Plugins

Plugins let you extend the features of DSS. You can add new kinds of datasets, recipes, visual preparation processors, custom formula functions, and more.

Plugins can be downloaded from the official Dataiku community site, or created by you and shared with your team.

Public API

DSS now includes a REST API to programmatically manage DSS from any HTTP-capable language.

To learn more about this API, see Public REST API

Enhanced charts

The charts module of DSS has been vastly enhanced:

  • New chart types

    • Scatterplots

    • Horizontal bar charts

    • Pie and donut charts

    • Boxplots

    • Pivot table (text view) and colored pivot table

    • Geographic scatter map

    • “Grid” map (fixed-width aggregation grid)

    • (Experimental) 2D distribution plot

  • Brand new user experience. It is now much easier to understand what’s going in your chart, and to switch dimensions and measures.

  • Improved presentation of date axis

  • All charts can now have semi-transparency

  • New computation mode for aggregated charts: “Percentage scale” (% of the total)

  • Tooltips now provide the ability to drill-down and exclude values

New R integration and Jupyter notebook

The bundled IPython notebook has been upgraded. DSS now includes Jupyter 4.0

This is a major new release of IPython/Jupyter, with lots of improvements, and the support for multiple languages.

The bundled Jupyter now comes builtin with a brand new R kernel, with vastly improved features over the previous R integration:

  • Syntax highlighting

  • Auto-completion (just hit Tab)

  • Much improved error handling

In addition, DSS features a new R API. See R API for more details.

Editable datasets

Editable datasets are a new kind of dataset in DSS, which you can directly create and modify in the DSS UI, ala Excel or Google Spreadsheets.

They can be used for example to create referentials, configuration datasets, …

Editable datasets can be imported from any file, or another dataset.

“Managed folders”

DSS comes with a large number of supported formats, machine learning engines, … But sometimes you want to do more.

DSS code recipes (Python and R) can now read and write from “Managed Folders”, handles on filesystem-hosted folders, where you can store any kind of data.

DSS will not try to read or write structured data from managed folders, but they appear in Flow and can be used for dependencies computation. Furthermore, you can upload and download files from the managed folder using the DSS API.

Here are some example use cases:

  • You have some files that DSS cannot read, but you have a Python library which can read them: upload the files to a manged folder, and use your Python library in a Python recipe to create a regular dataset

  • You want to use Vowpal Wabbit to create a model. DSS does not have full-fledged integration in VW. Make a first Python recipe that has a managed folder as output, and write the saved VW model in it. Write another recipe that reads from the same managed folder to make a prediction recipe

  • Anything else you might think of. Thanks to managed folders, DSS can help you even when it does not know about your data.

Impala recipe

Previously, DSS could use Impala:

  • For charts creation

  • In the Impala notebook

You can now also use Impala in a new Impala recipe to benefit from the speed of this engine for aggregations on HDFS.

Impala is also available as an engine for “VisualSQL” (Grouping, Join, Stack) recipes

Shell recipe

There are times when you just need to:

  • Run a shell command

  • Stream a dataset through a shell command (stdin/stdout)

The shell recipe lets you do just that.

Code snippets

In all modules of DSS where you can write code, you now have the ability to insert code snippets. DSS comes builtin with lots of useful snippets, and you can also write your own and share them with your team.

Elasticsearch dataset

DSS now has full read-write support for ElasticSearch datasets

Easter eggs

Will you find all our new easter eggs ?

Other major enhancements

Multiple SQL statements support

Every module of DSS where you can write SQL code now support multiple statements.

For example, in the SQL Query recipe, you can now add statements before the main “SELECT” statement. This can allow you to issue some SET statements to tune the optimizer, or to create stored procedures.

In other words, you can now declare and call a stored procedure in a SQL Query recipe.

This also applies for the pre-write and post-write statements. For example, you can now create multiple indexes.

Running SQL, Hive and Impala from Python and R recipes

SQL is the most pervasive way to make data analysis queries. However, doing advanced logic, like loops, conditions, … is often difficult in SQL. There are some options like stored queries, but they require learning new languages.

DSS now lets you run SQL queries directly from a Python recipe. This lets you:

  • sequence several SQL queries

  • dynamically generate some new SQL queries to execute in the database

  • use SQL to obtain some aggregated data for further numerical processing in Python

To learn more, see Performing SQL, Hive and Impala queries

“Repartitioning” mode

Thanks to “repartitioning” mode, you can now:

  • start with a non-partitioned files-based dataset where a column could act as a partitioning column

  • create a “repartitioning-enabled” sync or prepare recipe to transform it to a partitioned dataset

  • Build in a single pass all partitions of the target dataset

“Append” mode

Several recipes and datasets now support “Append” instead of “Overwrite” in target datasets.

While we do not recommend to use this for general-purpose datasets, it can be useful in some cases, like for example create a “history” dataset as output of a recipe, writing a new line each time the recipe is run.

HDP 2.3 support

DSS is now compatible with Hadoop HDP 2.3.0

Misc

  • Visual preparation

    • Added detection of US states in Preparation recipes

    • New processor to concatenate arrays

    • The column renamer processor can now perform multiple renames

    • New mass actions to lowercase/uppercase/simplify many column names

    • Find/Replace processor can now work on all columns at once

  • Machine learning

    • Analysis: Machine Learning training does not start automatically anymore, giving you an opportunity to review the settings prior to initial training.

  • Custom formula: new random() function

  • Automatic R integration installation, no more manual steps

  • UX

    • All recipes now share a more common and more consistent layout, with the Run button always present on the most important tab.

    • When creating managed datasets, you now have more options to preconfigure the dataset based on common formats, or using known partitioning schemes

    • Better build dataset modal, with more clear explanation of the build modes

Notable bug fixes

UI

  • Fixed various scrolling and drag-and-drop issues

  • Fix renaming of datasets when the name was already used

  • Fix project home display for readers

  • Fixed job preview when training a model

  • It’s now possible to choose a different SQL port

  • Improved behavior of exports

  • Fixed tagging of analysis

  • Fixed DSS disconnection when very long log lines are transferred

Charts

  • Fixed an issue with date range filters

  • Fixed a bug that could happen with numerical axis and small values

  • Fixed issue in geo chart when the value of the aggregation is 0

  • Fixed various issues in legend, especially with “Mixed columns/lines” charts

Hadoop

  • [HDP only] Fixed Hive recipes on readonly HDFS datasets

  • Fixed/Added support for native Snappy and LZO datasets

  • Fixed EXPLAIN statements in notebook for Hive

  • The command-line tool to import Hive databases now supports partitioning

SQL and VisualSQL recipes

  • Added ability to disable the “LIMIT” in statements for some operations that don’t support it (e.g. tuple mover operations on Vertica)

  • Fixed cases where the “limit” statement could get disabled

  • Fixed issues with Join recipe on Oracle

  • Grouping recipe: fixed issue with “LAST” aggregations function

  • Fixed typing of MIN/MAX on non-string columns

  • Grouping recipe can now use faster stream engine when computing COUNT

Machine learning

  • Fixed Gradient Boostring Tree in kfold mode with multiple losses

  • Fixed Lasso regression in AutoCV mode

  • SGD classification can now use sparse matrixes

  • Fixed error when imputation was disabled for all numerical features

  • Fixed error when using feature hashing on a numerical feature

  • Fixed broken models when RSMLE can’t be computed

  • Fixed some errors in notebook export

Reliability

  • Using “Compute number of records” on large datasets does not lock the studio anymore

  • Fixed a deadlock leading to the whole notification system ceasing to work

  • Fixed the infamous “error during execution of add command” issue with the internal Git repository

  • Fixed several cases where long operations could be performed while holding a lock on the DSS configuration (like Hive validation or aborting Impala queries)

  • DSS diagnosis now includes more useful information

Datasets

  • Fixed detection of Excel dates

  • Fixed detection of UTF-8 BOMs

Misc

  • Shapefiles that require Vecmath now display a proper informative message

  • We now display proper warning messages when using non-Hive-compatible HDFS dataset names