DSS 2.1 Relase notes¶
Migration to DSS 2.1 from DSS 1.X is not officially supported. You should first migrate to the latest 2.0.X version. See DSS 2.0 Relase notes
Automatic migration from Data Science Studio 2.0.X is supported, with the following restrictions and warnings:
The “Auto Mirrors” feature, which was deprecated since DSS 1.3 has been removed
In the Webapp builder, the deprecated API
have been removedIf you had any webapps that did not authorize any dataset, the API key associated to this webapp is not usable anymore. You will need to create a new webapp if you want to authorize datasets.
If you previously used Impala (in notebooks or charts), you will need to reconfigure Impala connectivity. See Impala
Following an upgrade of the IPython notebook, graphing features must be enabled manually in older Python notebooks.
After upgrade, you will need to follow again the procedure to Install R integration
In DSS 1.X and 2.0, DSS used an older version of the IPython notebook that had graphing features “magically enabled”. DSS 2.1 includes a more recent version of the Jupyter notebook. This new version requires manual activation of the graphing features in the notebook.
Notebooks created directly in DSS 2.1 come with a snippet that automatically enables them.
To enable them on migrated notebooks, simply add a cell at the top of your notebook containing:
DSS 2.1.4 is a bugfix release.
Fix UI for partitioned ElasticSearch datasets
Fix possible issue in writing to partitioned ElasticSearch datasets
Improve handling of datasets being written while being used: Now, if a dataset is being written while being used as input of a recipe, DSS will properly remember the status of the source dataset at the beginning of the recipe, and will thus properly retrigger build if the dataset was modified during the previous recipe execution
Fix UI for the “Time range” dependency when the output is not partitioned
Improve the results of partitioning tests
DSS 2.1.3 is a bugfix release.
Add support for Amazon Linux 2016.x and Ubuntu 15.10
Improve installation R on Redhat / CentOS
Make installation of R integration on Mac OS X more robust
Fix publication of Jupyter notebooks to dashboard
Fix loading of non-Filesystem datasets (like SQL tables)
Fix loading of Parquet files with complex types (arrays, maps, structs)
Fix scoring recipes with Random Forest algorithm
Fix processing of charts on HDFS when Impala is not installed
Fix aggregation on filtered grid geo charts
Fix display of records on grid geo charts
Fix handling of “No value” on grid geo charts
Fix display of filtered record count on scatter plots
DSS 2.1.2 is a bugfix release.
Code recipes can now have a partitioned dataset as input and a folder as output
Fix handling of column types in the Split recipe > “Dispatch values of a column”
fix failure reading indices with very large number of shards
fix handling of sampling.
close scroll handles on the server as soon as possible, avoids excessive resource consumption on the server
DSS 2.1.1 is a bugfix release.
For migration from previous versions, please see the DSS 2.1.0 release notes
Fix support for Debian 7 in deps checker
Add support for Amazon Linux 2015.09
Fix migration of Webapp API keys in some cases
Fix migration of ColumnRenamer processor on analysis
Fix R automatic installation support for Debian 8
Fix occasional failure of prepare recipes on Hadoop when starting many preparation recipes at once.
DSS 2.1.0 is a major upgrade that brings a wealth of new features and improvements.
For a summary of the major new features, see: https://learn.dataiku.com/whatsnew
DSS now features full integration with Apache Spark, the next-generation distributed analytic framework.
The integration of Spark in DSS 2.1 is pervasive and extends to all of the following features, which are now Spark-enabled:
Visual data preparation
“VisualSQL” recipes (Grouping, Joining, Stacking)
Guided machine learning in analysis
Training and prediction in Flow
All DSS data sources can be handled using Spark. As always in DSS, you can mix technologies freely, both Spark-enabled and traditional
For more information about Spark integration, see DSS and Spark
Plugins let you extend the features of DSS. You can add new kinds of datasets, recipes, visual preparation processors, custom formula functions, and more.
Plugins can be downloaded from the official Dataiku community site, or created by you and shared with your team.
DSS now includes a REST API to programmatically manage DSS from any HTTP-capable language.
To learn more about this API, see Public REST API
The charts module of DSS has been vastly enhanced:
New chart types
Horizontal bar charts
Pie and donut charts
Pivot table (text view) and colored pivot table
Geographic scatter map
“Grid” map (fixed-width aggregation grid)
(Experimental) 2D distribution plot
Brand new user experience. It is now much easier to understand what’s going in your chart, and to switch dimensions and measures.
Improved presentation of date axis
All charts can now have semi-transparency
New computation mode for aggregated charts: “Percentage scale” (% of the total)
Tooltips now provide the ability to drill-down and exclude values
New R integration and Jupyter notebook¶
The bundled IPython notebook has been upgraded. DSS now includes Jupyter 4.0
This is a major new release of IPython/Jupyter, with lots of improvements, and the support for multiple languages.
The bundled Jupyter now comes builtin with a brand new R kernel, with vastly improved features over the previous R integration:
Auto-completion (just hit Tab)
Much improved error handling
In addition, DSS features a new R API. See R API for more details.
Editable datasets are a new kind of dataset in DSS, which you can directly create and modify in the DSS UI, ala Excel or Google Spreadsheets.
They can be used for example to create referentials, configuration datasets, …
Editable datasets can be imported from any file, or another dataset.
DSS comes with a large number of supported formats, machine learning engines, … But sometimes you want to do more.
DSS code recipes (Python and R) can now read and write from “Managed Folders”, handles on filesystem-hosted folders, where you can store any kind of data.
DSS will not try to read or write structured data from managed folders, but they appear in Flow and can be used for dependencies computation. Furthermore, you can upload and download files from the managed folder using the DSS API.
Here are some example use cases:
You have some files that DSS cannot read, but you have a Python library which can read them: upload the files to a manged folder, and use your Python library in a Python recipe to create a regular dataset
You want to use Vowpal Wabbit to create a model. DSS does not have full-fledged integration in VW. Make a first Python recipe that has a managed folder as output, and write the saved VW model in it. Write another recipe that reads from the same managed folder to make a prediction recipe
Anything else you might think of. Thanks to managed folders, DSS can help you even when it does not know about your data.
Previously, DSS could use Impala:
For charts creation
In the Impala notebook
You can now also use Impala in a new Impala recipe to benefit from the speed of this engine for aggregations on HDFS.
Impala is also available as an engine for “VisualSQL” (Grouping, Join, Stack) recipes
There are times when you just need to:
Run a shell command
Stream a dataset through a shell command (stdin/stdout)
The shell recipe lets you do just that.
In all modules of DSS where you can write code, you now have the ability to insert code snippets. DSS comes builtin with lots of useful snippets, and you can also write your own and share them with your team.
DSS now has full read-write support for ElasticSearch datasets
Will you find all our new easter eggs ?
Multiple SQL statements support¶
Every module of DSS where you can write SQL code now support multiple statements.
For example, in the SQL Query recipe, you can now add statements before the main “SELECT” statement. This can allow you to issue some SET statements to tune the optimizer, or to create stored procedures.
In other words, you can now declare and call a stored procedure in a SQL Query recipe.
This also applies for the pre-write and post-write statements. For example, you can now create multiple indexes.
Running SQL, Hive and Impala from Python and R recipes¶
SQL is the most pervasive way to make data analysis queries. However, doing advanced logic, like loops, conditions, … is often difficult in SQL. There are some options like stored queries, but they require learning new languages.
DSS now lets you run SQL queries directly from a Python recipe. This lets you:
sequence several SQL queries
dynamically generate some new SQL queries to execute in the database
use SQL to obtain some aggregated data for further numerical processing in Python
To learn more, see Performing SQL, Hive and Impala queries
Thanks to “repartitioning” mode, you can now:
start with a non-partitioned files-based dataset where a column could act as a partitioning column
create a “repartitioning-enabled” sync or prepare recipe to transform it to a partitioned dataset
Build in a single pass all partitions of the target dataset
Several recipes and datasets now support “Append” instead of “Overwrite” in target datasets.
While we do not recommend to use this for general-purpose datasets, it can be useful in some cases, like for example create a “history” dataset as output of a recipe, writing a new line each time the recipe is run.
HDP 2.3 support¶
DSS is now compatible with Hadoop HDP 2.3.0
Added detection of US states in Preparation recipes
New processor to concatenate arrays
The column renamer processor can now perform multiple renames
New mass actions to lowercase/uppercase/simplify many column names
Find/Replace processor can now work on all columns at once
Analysis: Machine Learning training does not start automatically anymore, giving you an opportunity to review the settings prior to initial training.
Custom formula: new random() function
Automatic R integration installation, no more manual steps
All recipes now share a more common and more consistent layout, with the Run button always present on the most important tab.
When creating managed datasets, you now have more options to preconfigure the dataset based on common formats, or using known partitioning schemes
Better build dataset modal, with more clear explanation of the build modes
Fixed various scrolling and drag-and-drop issues
Fix renaming of datasets when the name was already used
Fix project home display for readers
Fixed job preview when training a model
It’s now possible to choose a different SQL port
Improved behavior of exports
Fixed tagging of analysis
Fixed DSS disconnection when very long log lines are transferred
Fixed an issue with date range filters
Fixed a bug that could happen with numerical axis and small values
Fixed issue in geo chart when the value of the aggregation is 0
Fixed various issues in legend, especially with “Mixed columns/lines” charts
[HDP only] Fixed Hive recipes on readonly HDFS datasets
Fixed/Added support for native Snappy and LZO datasets
Fixed EXPLAIN statements in notebook for Hive
The command-line tool to import Hive databases now supports partitioning
SQL and VisualSQL recipes¶
Added ability to disable the “LIMIT” in statements for some operations that don’t support it (e.g. tuple mover operations on Vertica)
Fixed cases where the “limit” statement could get disabled
Fixed issues with Join recipe on Oracle
Grouping recipe: fixed issue with “LAST” aggregations function
Fixed typing of MIN/MAX on non-string columns
Grouping recipe can now use faster stream engine when computing COUNT
Fixed Gradient Boostring Tree in kfold mode with multiple losses
Fixed Lasso regression in AutoCV mode
SGD classification can now use sparse matrixes
Fixed error when imputation was disabled for all numerical features
Fixed error when using feature hashing on a numerical feature
Fixed broken models when RSMLE can’t be computed
Fixed some errors in notebook export
Using “Compute number of records” on large datasets does not lock the studio anymore
Fixed a deadlock leading to the whole notification system ceasing to work
Fixed the infamous “error during execution of add command” issue with the internal Git repository
Fixed several cases where long operations could be performed while holding a lock on the DSS configuration (like Hive validation or aborting Impala queries)
DSS diagnosis now includes more useful information
Fixed detection of Excel dates
Fixed detection of UTF-8 BOMs
Shapefiles that require Vecmath now display a proper informative message
We now display proper warning messages when using non-Hive-compatible HDFS dataset names