DSS 2.3 Relase notes¶
- Migration notes
- Version 2.3.5 - May 23rd, 2016
- Version 2.3.4 - April 28th, 2016
- Version 2.3.3 - April 7th, 2016
- Version 2.3.2 - March 1st, 2016
- Version 2.3.1 - February 16th, 2016
- Version 2.3.0 - Feburary 11th 2016
For automatic upgrade information, see Upgrading a DSS instance
Migration to DSS 2.3 from DSS 1.X is not supported. You should first migrate to the latest 2.0.X version. See DSS 2.0 Relase notes
Automatic migration from Data Science Studio 2.2.X and 2.1.X is supported, with the following restrictions and warnings:
- The usual limitations on retraining models and regenerating API node packages apply (see Upgrading a DSS instance for more information)
Automatic migration from Data Science Studio 2.0.X is supported, with the previous restrictions and warnings, and, in addition, the ones outlined in DSS 2.1 Relase notes
DSS 2.3.5 is a bugfix and minor enhancements release. For a summary of new features in DSS 2.3.0, see below.
- Preserve the “hive.query.string” Hadoop configuration key in Hive notebook
- Clear error message when trying to use Geometry columns in Spark
- Fix wrongly computed multiclass metrics
- Much faster multiclass scoring for MLLib
- Fix multiclass AUC when only 2 classes appear in test set
- Fix tooltip issues in the clustering scatter plot
- Fix typo in custom HTTP header that could lead to inability to parse the response
- Fix the INSEE enrichment processor
- Fix excessive verbosity
- Fix DateParser in multi-columns mode when some of the columns are empty
- Modifying a step comment now properly unlocks the “Save” button
DSS 2.3.4 is a bugfix and minor enhancements release. For a summary of new features in DSS 2.3.0, see below.
- Fix “flag” processorso perating on “columns pattern” mode
- Fix a UI issue with date facets
- Add a few missing country names
- Fix modulo operator on large numericals (> 2^31)
DSS 2.3.3 is a bugfix release. For a summary of new features in DSS 2.3.0, see below.
- Fixed support for Hive in MapR-ecosystem versions 1601 and above
- Added support for Spark 1.6
``select now()`in Impala notebook
- Fixed misinterpretation of numbers like “017” as octal in the “Nest” processor
- The variables will now be interpreted in the context of the current preparation, not the context of the input dataset
- Fixed UI flickering
- Fixed wrong “dirty” (i.e. not saved) state in the UI when there is a group in the recipe
- Fixed bad state for the “prefix” option of the Unnest processor
- SQL recipe: Fixed conversion from SQL notebook
- Window recipe: Fixed data types for cumulative distribution
- Stack recipe: Fixed custom schema definition in Stack recipe
- Stack recipe: Fixed postfiler on filesystem-based datasets
- Join recipe: Fixed “join on string contains” on Greenplum
- Join recipe: Fixed computed date columns on Oracle
- Join recipe: Fixed possible issue with cross-project joins on Hive
- Filter recipe: Fixed boolean conditions on filesystem-based datasets
- Group recipe: Fixed grouping in stream engine with empty values in integer columns
- Group recipe: Fixed grouping on custom key in stream engine
- Group recipe: Fixed UI issue when removing all grouping keys
- All visual recipes: Fixed filters in multi-word column names
- Sync recipe: Fixed partitioned to non-partitioned dataset creation
- All recipes: Fixed UI for the time-range partitions dependency
- Fixed mishandling of accentuated values, leading them to appear as “Others” in dummy-encoded columns
- Fixed clustering scoring recipe if no name was given to any cluster
- Fixed wrong description of the metric used for classification threshold optimization
- Fixed possible migration / settings issue in Ridge regression
DSS 2.3.2 is a bugfix release. For a summary of new features in DSS 2.3.0, see below.
- Spark engine: Add support for probabilities in Random forest
- Spark engine: Improve stability of results for models
- Spark engine: Fix casting of output integers to doubles
- Python engine: Fix a code error in Jupyter notebook export
- Fix fuzzy join hanging in rare conditions
- Fix custom Python steps when there are custom variables with quotes
- Fix deploying analysis on other dataset
- Split: Fix support for adding custom variables to output dataset
- Split: Fix UI reloading that could lead to broken recipe
- Stack: Fix on Spark engine
DSS 2.3.1 is a bugfix release. For a summary of new features in DSS 2.3.0, see below.
- Fix running on Hadoop and Spark with “Remove rows on bad meaning” processor
- Fix a case where the data quality bars were not properly updated
- Fix formatting issue in latitude/longitude for GeoIP processor
- Add a few missing countries to the Countries detector
- Default to repartitioning non-HDFS datasets in fewer partitions to avoid consuming too many file handles
- Fixed filtering of features in settings screens
- Don’t show “Export notebook” option for MLLib models
DSS 2.3.0 is a major upgrade to DSS with exciting improvements.
For a summary of the major new features, see: https://www.dataiku.com/learn/whatsnew
Visual data preparation, reloaded¶
Our visual data preparation component is a major focus of DSS 2.3, with numerous large improvements:
- You can now group, color and comment all script steps
- It is now possible to color the cells of the data table by their values, to easily locate co-occurences, or columns changing together.
- The Python and Formula editors are now much easier to use.
- The Formula language has been strongly enriched, making it a very powerful, but still very easy to use tool. See Formula language and our new Howto for more information.
- The Quick Column View provides an immediate overview of all columns, and allows you to navigate very simply across columns.
- The header at the top of the table now always precisely tells you what you are seing and the impact of your preparation on your data.
- Many processors are now multi-column able, including the ability to select all columns, or all columns matching a pattern.
- It is now possible to temporarily hide some columns
- Hit Shift+V to view “long” values in a data table cell, including JSON formatting and folding
- The redesigned UI makes it much clearer to navigate your preparation steps and data table.
Schemas edition and user-defined meanings¶
If it now possible to edit the schemas of datasets directly in the Exploration screen.
You can now define your own meanings, either declaratively, through values lists or through patterns. User-defined meanings are available everywhere you need them, bringing easy documentation to your data projects. For more informations, see Schemas, storage types and meanings
Since the very first versions, DSS let you search within your project. Thanks to the new Data Catalog, you now have an extremely powerful instance-wide search. Even if you have dozens of projects, you’ll be able to find easily all DSS objects, with deep search (search a dataset by column name, a recipe by comments in the code, ...).
The Data Catalog provides a faceted search and fully respects applicative security.
Flow tools and views¶
The new Flow views system is an advanced productivity tool, especially designed for large projects. It provides additional “layers” onto the Flow:
- Color by tag name and easily propagate tags
- Recursively check and propagate schema changes across a Flow
- Check the consistency of datasets and recipes across a project.
New SQL / Hive / Impala notebook¶
The SQL / Hive / Impala notebook now features a “multi-cells” mechanism that lets you work on several queries at once, without having to juggle several notebooks or search endlessly in the history.
You can also now add comments and descriptions, to transform your SQL notebooks into real data stories.
- DSS now includes helper tooltips to guide you through the UI
- The Help menu now features a contextual search in our documentation
You can now automatically install the plugin dependencies. Plugin authors can also declare custom installation scripts, if the installation of your plugin is not a simple matter of installing Python or R packages.
- New processor for converting “french decimals” (1 234,23) or “US decimals” (1,234.23) to “regular decimals” (1234.23)
- New processors to clear cells on number range or with a value
- New processors to flag records on number range or with a value
- New processor to count occurences of a string or pattern within a cell
- Added support of full names in “US State” meaning
- Added more mass actions
- (Hortonworks only) Tez sessions are now reused in the SQL notebook
- Coefficients are now displayed in Ordinary Least Squares regression
- Support for ElasticSearch 2.X
- Experimental support for append mode on Elasticsearch
- Shell: now supports variables
- Split: can now split on the result of a formula
- Can now define custom additional configuration and statements in “Visual recipes on SQL / Hive / Impala”
- The public API now includes a set of methods to interact with user-defined meanings.
- R API: now automatically handles lists in resulting dataframes
Infrastructure / Packaging¶
- Environment ulimit is now checked when starting
- DSS now checks whether server ports are busy at startup
- Counting records on SQL query datasets is now possible
- MongoDB: dates support has been fixed
- MySQL: fixed handling of null dates
- MongoDB empty columns are now properly shown
- Currency converter now works properly with input date columns
- “Step preview” doesn’t change output dataset schema anymore
- Copying a preparation recipe with a Join step now generates proper Flow
- Formula dates support has been improved
- Fixed sort by cardinality in columns view
- Hidden clustering buttons in explore view
- Exporting “Grouped XY” charts to Excel has been fixed
- Fixed issues on charts created by a “Deploy script”
- The hproxy process now starts properly if the Hive metastore is unreachable
- After a metastore failure, the hproxy now recovers properly
- Partitioned recipes on Impala engine have been fixed
- Fixed an UI bug on confusion matrix
- Managed Folders properly appear in search results
- Grouping: drag and drop when 0 keys has been fixed
- Stack: “schema union” mode now works properly on Vertica
- Window: fixed lead/lag on dates in Vertica
- Don’t accept to run a failing join recipe on filesystem datasets with quotes in columns
- Fixed various bugs related to Abort in SQL notebooks
- Fixed code samples in SQL notebooks
- Upgraded Jupyter
- Project admin permission now properly grants all other project permissions
- R API: now displays a proper error when trying to write non-existent dataset
- R API: fixed writing of data.table object
Infrastructure / Packaging¶
- Java startup options are now properly set on all processes
- DSS now works properly when you have an http_proxy environment variable