DSS 4.1 Release notes¶
- Migration notes
- Version 4.1.1 - November, 20th 2017
- Version 4.1.0 - November, 13th 2017
- New features
- For coders: Multiple code environments for Python and R
- For coders: Python 3 support
- For coders: Shiny and Bokeh
- Mass actions and view on the Flow
- Filtering on the Flow
- For coders: RMarkdown
- For coders: custom charts on the dashboard
- For data scientists: Auto-ML features
- For data scientists: Grid search in MLLib models
- For data scientists: Ensemble models
- For analysts: Pivot recipe
- For analysts: Sort recipe
- For analysts: New “Split” recipe
- For analysts: Distinct recipe
- For analysts: “Top N” recipe
- R models in API node
- Arbitrary Python or R functions in the API node
- SQL queries in the API node
- Datasets lookup in the API node
- Index external tables in DSS catalog
- Folders and uploads everywhere
- New HTTP / FTP / SCP / SFTP support
- New chart features
- Attach more things to scenario emails
- Other notable enhancements
- Edit recipes in notebook
- Code editor enhancements
- Managed folder browser
- Plugin and libraries editor
- Plugin edition for non-administrators
- Spark 2.2
- SQL code formatter
- “Concat” aggregates in grouping and window recipes
- Better support for empty datasets
- Native TDCH support
- Misc (datasets)
- Misc (webapps)
- Misc (recipes)
- Misc (charts)
- Misc (data preparation)
- Misc (administration)
- Notable bug fixes
- New features
- From DSS 4.0: Automatic migration is supported, with the restrictions and warnings described in Limitations and warnings
- From DSS 3.1: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions. See 3.1 -> 4.0
- From DSS 3.0: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying your previous versions. See 3.0 -> 3.1 and 3.1 -> 4.0
- From DSS 2.X: In addition to the restrictions and warnings described in Limitations and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions: see 2.0 -> 2.1 2.1 -> 2.2 2.2 -> 2.3, 2.3 -> 3.0, 3.0 -> 3.1 and 3.1 -> 4.0
- Migration from DSS 1.X is not supported. You must first upgrade to 2.0. See DSS 2.0 Relase notes
It is strongly recommended that you perform a full backup of your DSS data directory prior to starting the upgrade procedure.
For automatic upgrade information, see Upgrading a DSS instance.
Pay attention to the warnings described in Limitations and warnings.
DSS 4.1 is a major release, which changes some underlying workings of DSS. Automatic migration from previous versions is supported, but there are a few points that need manual attention.
If a SSH dataset had an absolute path, the migrated download recipe may fail to locate files. You will need to adjust the path in the connection versus the path in the dataset.
If you had custom pooling configuration, please contact Dataiku Support for update instructions
- Models trained with prior versions of DSS must be retrained when upgrading to 4.1 (usual limitations on retraining models and regenerating API node packages - see Upgrading a DSS instance). This includes models deployed to the flow (re-run the training recipe), models in analysis (retrain them before deploying) and API package models (retrain the flow saved model and build a new package)
- After installation of the new version, R setup must be replayed
- We now recommend using mainly personal API keys for external applications controlling DSS, rather than project or global keys. Some operations, like creating datasets or recipes, are not always possible using non-personal API keys.
- DSS 4.1 is compatible with Anaconda/Miniconda version 4.3.27 or later only. If your existing DSS instance is integrated with Anaconda Python,
you should check your current conda version with
conda -V, and if necessary upgrade your conda installation with
conda update conda.
Several external libraries bundled with DSS have been bumped to major revisions. Some of these libraries include some backwards-incompatible changes. You might need to upgrade your code.
- pandas 0.18 -> 0.20
- scikit-learn 0.18 -> 0.19
As usual, remember that you should not change the version of Python libraries bundled with DSS. Instead, use Code environments
DSS 4.1.1 is a bugfix release. For the list of new features in the 4.1 branch, see release notes for 4.1.0 below
- Fixed errors when using HTTP datasets as inputs of some visual recipes
- Fixed spurious warning when creating an editable dataset
- Improved error handling when creating new managed datasets
- Added experimental support for “autocommit” execution on Microsoft SQL Server
- Fixed write support in custom Python datasets
- Fixed handling of partitioned inputs in pivot recipe
- Fixed handling of partitioned outputs in split recipe with Hive engine
- Fixed reading of partitioned Parquet HDFS datasets in Spark notebook
- Fixed validation of partitioned SparkSQL recipe with Parquet HDFS inputs
- Fixed failure to synchronize Hive metastore after manual clear of dataset
- Fixed some buggy cases in Spark pipelines building
- Added default setting for better out-of-the-box experience on MUS with Spark 2.2
- Fixed computation of score for multiclass when not all classes are in test set
- Fixed selectability of models in saved model versions list
- Fix evaluate recipe in multiclass when evaluating rows with target not seen at training.
- Fixed schema propagation accross Hive recipes in HiveServer2 execution mode
- Performance enhancements in job details page
DSS 4.1.0 is a major upgrade to DSS with a lot of new features. For a summary of the major new features, see: https://www.dataiku.com/learn/whatsnew
DSS now allows you to create an arbitrary number of code environments. A code environment is a standalone and self-contained environment to run Python or R code.
Each code environment has its own set of packages. Environments are independent: you can install different packages or different versions of packages in different environments without interaction between them. In the case of Python environments, each environment may also use its own version of Python. You can for example have one environment running Python 2.7 and one running Python 3.6
See Code environments for more information.
As a consequence of the multiple code environments support, you can now run Python 3 (3.4 to 3.6) for all your DSS code.
You can now write web applications in DSS using the Shiny and Bokeh libraries that are fully natively integrated.
You write your Shiny code as you would outside of DSS, and DSS takes care of hosting your webapp and making it available to users with access to your data project.
You write your Bokeh code as you would outside of DSS, and DSS takes care of hosting your webapp and making it available to users with access to your data project.
The UI capabilities of the DSS Flow have been strongly boosted.
You can now:
- Select multiple items (using Shift+Click, or using rectangular selections)
- Apply a large number of mass actions on multiple items , like: * Delete * Clear * Build * Add to a scenario * Change engines * Change Spark configurations * Change tags * Change dataset types and connections * …
- View items in the Flow using a variety of inspection layers, like: * By connection * By recipe engine * By partitioning scheme * By creation/modification date or author * By Spark configuration * …
- Copy entire parts of the Flow, either within a project or between projects
In addition to the above mentioned mass actions capabilities, you can now filter the flow view by tags, users, types and modification dates. This allows you to focus on your part of the Flow while not being distracted by the rest of the Flow, and is particularly useful for large projects.
Code reports allow you to write code that will be rendered as beautiful reports that you can download, attach by mail or render on the dashboard. R Markdown reports can be used to generate documents based on your project’s data.
When editing your R Markdown report in DSS, you can “build” it to generate the output document. This document can be displayed, published into the dashboard, downloaded in various document formats, or attached to emails.
See Code reports for more information.
Through the new “static insights” mechanism, it is now possible to easily display on the DSS dashboard charts or other arbitrary files produced using Python or R code directly on the DSS dashboard.
This notably includes charts created with:
The capabilities of DSS to automatically optimize machine learning models have been greatly improved:
- Real-time models comparison charts to see the progress of your grid search optimization
- Support for random search
- Time-boundable search
- Interruptible and resumable search
- Plot the impact of any hyperparameter on the model’s performance and training time
Many more parameters can now be optimized in all algorithms.
More support has been added for custom cross-validation strategies. Code samples are available for one-click setup of LeaveOneOut / LeavePOut strategies.
The UI has been overhauled to make it clearer how to try more parameters, and to better document various options.
DSS previously had support for hyperparameters optimization for Python (in-memory) models. This capability has been extended to MLLib models.
You can now select multiple models and create an ensemble model out of them.
Ensembling can be done using:
- Linear stacking (for regression models) or logistic stacking (for classification problems)
- Prediction averaging or median (for regression problems)
- Majority voting (for classification problems)
The pivot recipe lets you build pivot tables, with advanced control over the rows, columns and aggregations. It supports execution of the pivot on external systems, like SQL databases, Spark, Hive or Impala.
The pivot recipe supports advanced features like limiting the number of pivoted columns, multi-key pivot, …
See Pivot recipe for more information.
The “Top N” recipe allows you to retrieve the first and last N rows of subsets with the same grouping keys values. The rows within a subset are ordered by the columns you specify. It can be performed on any dataset in DSS, whether it’s a SQL dataset or not.
See Top N: retrieve first N rows for more information
You can now write custom model predictions in R and expose them on the DSS API node. DSS will automatically handle deployment, distribution, high availability and scalability of your R model, written using any R package
See Exposing a R prediction model for more information
In addition to custom prediction models in Python or R, you can now expose arbitrary functions in the DSS API node. DSS will automatically handle deployment, distribution, high availability and scalability of your functions.
You can expose a parametrized SQL query as a DSS API node endpoint. Calling the endpoint with a set of parameters will execute the SQL query with these parameters.
The DSS API node automatically handles pooling connections to the database, high availability and scalability for execution of your query.
See Exposing a SQL query for more information
The “dataset(s) lookup” endpoint offers an API for searching records in a DSS dataset by looking it up using lookup keys.
For example, if you have a “customers” dataset in DSS, you can expose a “dataset lookup” endpoint where you can pass in the email address and retrieve other columns from the matching customer.
See Exposing a lookup in a dataset for more information
It is now possible to scan and index DSS connections. The DSS catalog will then contain items for every table in the remote connection. You can search tables by connection, schema, name, columns, descriptions, … You can preview tables, see if they are already imported as DSS datasets, and import them easily.
Managed folders and uploaded datasets can now be stored on any “files-aware” location supported by DSS (local filesystem, HDFS, S3, GCS, Azure Blob, FTP, SFTP).
The support for these protocols has been completely overhauled:
- The old cache can now be completely bypassed.
- You can now use the Download recipe to cache files from a remote location to another location (which can be local or remote)
- SFTP datasets are now writable
DSS 4.1 comes with several major improvements to the data visualization features:
- Ability to create animated charts, animated by another dimension
- Ability to create “sub-charts”, broken down by another dimension
- More control on legend position, legend for continuous colors
- Support for displaying geometry directly on maps
- Customizable color palettes
- Diverging color palettes
- Customizable map backgrounds
- Compute charts using Hive
In DSS scenarios you can send email reports at the end of a scenario.
You can now attach to these email reports multiple items:
- The data of a dataset, in any format supported by DSS format exporters
- A file from a managed folder
- The full content of a managed folder
- An export of a RMarkdown report
- An export of a Jupyter noteboook
- The log file of the scenario
A mail can have multiple attachments.
You can now easily edit a Python or R recipe (regular or Spark) in a Jupyter notebook, and go back to the recipe from the notebook/
In all places where you can edit code in DSS, you can now:
- Customize the theme of the code editor (go to your user profile)
- Customize font and font size
- Customize some key mappings
In addition, code editors now support many more features:
- Code folding
- Auto close of brackets and tags
- Multiple cursors
The browser for managed folder has been strongly enhanced and now allows you full control and modificability of the folder content.
You can also directly unzip Zip files in a managed folder
The plugin and libraries editor is now much more powerful, feature multi-files edition, direct creation of all component types, move/rename capabilities.
It is now possible for administrators to delegate to non-administrators the right to create plugins and edit code libraries
You can now aggregate string columns by creating a concatenation of all values (or all distinct values) in the grouping and window recipes.
- Until now, an empty SQL dataset (table exists but has 0 rows, or SQL query returns 0 rows) was considered
- as “not ready” and could not be used in the Flow. This is now a configurable dataset.
DSS includes TDCH support in the sync recipe for fast transfers:
- Teradata to HDFS
- HDFS to Teradata
This includes interaction with other Hadoop Filesystems, as HDFS datasets (S3, …)
- Plugins can now contain “FS providers” to define new kinds of “file-aware” datasets and managed folders
- Plugins can now contain templates for webapps and code reports
- Plugins can now contain custom palettes and map backgrounds for charts
- DSS has beta support for IBM DB2.
- DSS has beta support for Exasol
- You can now check schema consistency on all files in a dataset
- Relocation settings are now available for many more types of datasets
- Checking if a SQL query dataset is ready is now much faster
- Uploaded datasets can now be partitioned
- Improved error and status reporting in datasets screens
Improvements in the webapps mechanism make it more robust to copy a project containing a webapp within a DSS instance
- You won’t get prompted to update the schema anymore of an output dataset when it’s empty (happens automatically)
- Added support for filter by date in SQL Server
- Ability to reorder charts
- Option not to replace missing values by 0 (notably in line charts)
- Columns of integers containing 0-leading values will now be considered as Text
- When both integer and decimal are possible, but some values are not valid integerts, DSS will now properly choose decimal
- Forced meanings are better preserved accross preparation recipes, fixing some invalid “switch to numeric” behaviors
- You can now download multiple files in a HTTP dataset
- HTTP dataset now supports SSL SNI
- Other stability fixes for HTTP datasets
- Fixed intermittent “Address already in use” issue with custom Python datasets