DSS 1.2 Relase notes

Version 1.2.4 - September 25th, 2014

Bug fixes

Data preparation

  • Fix filter returning invalid results on first call in some rare cases

  • Added Indian holidays

  • Fix an out of memory condition that could happen after several days of usage

  • Fix overflow with long column names in “Remove columns” processor

  • Fix caching issues for charts with Live Processing

Flow

  • Fix scrolling in recipe IO screens

  • Hive recipes and notebooks can now use “pre-init” and “post-init” scripts to deal with authorization issues

  • Fix date blacklist on non-UTC machines

  • Fix failure of Hive recipes whenever an Avro dataset existed

  • Add support for SQL schemas when syncing to Vertica

Notebooks

  • Fix scrolling issues

  • Improve handling of logs for Hive

Modeling

  • Fix the cutoff selector

Version 1.2.3 - August 26th, 2014

Please see the 1.2.0 release notes for information about migrations.

New features

New processor to extract information about bank holidays, school holidays and weekends in several countries

Enhancements

Improved performance for prediction/scoring/clustering recipes

Bug fixes

Core

  • Fixed installation on CentOS following IUS repository location change

  • Fixed SQL datasets when name contains some reserved keywords

  • Fixed various scrolling issues

  • Fixed Community Edition on OpenVZ machines

  • Improve startup scripts

Data preparation and visualization

  • Don’t use Live Processing charts for partitioned datasets

  • Fixed “Analyse” in explore view for numerical columns containing NULL values

  • Fixed Stemming option in Tokenizer processor

  • Fixed Fuzzy matching when used in Recipe

Recipes/Flow

  • Fixed compatibility issues with Pig 0.13

  • Fixed “View in Flow” in Chrome

  • Fixed “Run” button in Recipes Actions

Machine Learning

  • Fixed “Display outliers” in Clustering results scatter plot

Version 1.2.2 - August 8th, 2014

Please see the 1.2.0 release notes for information about migrations.

Enhancements

  • Prediction / Clustering scoring recipe works with datasets of any size.

  • Added write_dataframe in dataset writers.

Bug fixes

  • Fixed Hive script validator compatibility with Hive 0.13

  • Fixed abnormally high UI memory usage in some cases.

  • Fixed incorrect SQL types mapping (PostgreSQL/Vertica/Redshift).

  • Fixed issue in SQL-backed charts when the Impala JDBC driver cannot be loaded.

  • Fixed a bug preventing file selection in connection’s root folder (S3/HDFS).

  • Fixed “raw values” numerical axis in charts.

  • Unicode characters allowed in target values of prediction models.

  • Worked around Pandas bug when loading large datasets.

  • Fixed the profiling feature functionality in clustering.

  • Scoring recipes append all columns to the output rather than only INPUT columns.

Version 1.2.1 - July 29th, 2014

Please see the 1.2.0 release notes for information about migrations.

Bug fixes

  • Fixed various UI issues.

  • Fixed charts in exported project.

  • Fixed various unicode issues in python & models.

  • Cleanup temporary files generated by R recipes.

Enhancements

  • Switched to FreshDesk.

  • Better support of Hive 0.13.

  • Experimental Oracle and Sybase support.

  • Improved support of MapR and CDH5.

  • Dataiku’s Hive UDF jar is loaded in recipes and SQL notebooks by default.

Version 1.2.0 - July 21st, 2014

Important notes about migration

The automatic data migration procedure is documented in Upgrading a DSS instance

As usual, we strongly recommend that you perform a full backup of your Data Science Studio data directory prior to starting the upgrade procedure.

Automatic migration of data from Data Science Studio 1.1.X is supported, with the following restrictions and warnings:

  • Notebooks created from trained models using Impact Coding will not work anymore. Please remove the impact coding section.

  • SQL identifiers are now quoted and case preserving across the studio. This can cause issues with pre-existing mixed-case tables and columns.

    If you have mixed-case managed SQL datasets, it is recommended to lowercase the table name in the dataset configuration to avoid another table being created

    If you have mixed-case column names, you might need to recompute your table

For migrations from Data Science Studio 1.0.X, please also see the release notes of version 1.1.0

New features

Please also see our Blog Post for more information.

General

  • DSS now features a free Community Edition which allows you to use the Studio without any time limit

  • DSS now integrates a Scheduler for automatic builds of datasets. The scheduler leverages the incremental build support in order to just recompute and rebuild predictions on new data or data that has changed.

  • Support for complex types in dataset schemas has been improved. Complex types (including nested maps, nested arrays, nested objects) will properly be represented in Hive, MongoDB, ElasticSearch and JSON

  • Many non-fatal errors that can happen during processing a job are now reported in a “Warnings” tab at the end of a job, for easy verification of warnings

Insights

  • A whole new web app editor has been introduced. It features a much improved ergonomy, integrated code templates and snippets, and the ability to write custom Python-based backends for even more interactive webapps.

  • Various improvements have been made on the pinboard visual appearance

  • It is now possible to choose where to display the title of insights, and whether to display insight description

Data Preparation

  • The date parser processor now supports multiple formats. Formats are tested in order until a matching format is found

  • New processor : Geocoder, based on MapQuest or Bing API.

Machine Learning

  • Multiclass classification has been introduced

  • Some Machine Learning algorithms can take quite a long time to run. Data Science Studio now features integration with H2O, a distributed machine learning framework. This integration brings distributed implementations of a selected set of machine learning algorithms (Random Forest, Deep Learning, Gradient Boosting Methods).

Flow and recipes

  • A new “Sampling” recipe lets you create a dataset which is a sample extracted from an input dataset.

  • The Python recipe now comes with a large number of fully documented code samples

Datasets and connection

  • New experimental support for Teradata

  • Schemas are now supported in SQL databases. All parts of the Studio that deal with SQL tables are now schema-aware

  • Support for Apache Cassandra datasets (both read and write, partitioned and unpartitioned)

Notebooks

The SQL / Hive / Impala notebook has been completely overhauled for better ergonomy

Visual analytics

  • Better information and choices to switch between internal DSS engine or native SQL engine for charts

  • When performing Visualization on a HDFS dataset, if you use the whole dataset, and if Impala is installed, Impala will be used to compute the aggregations

  • Charts based on SQL now support Greenplum and Redshift