DSS 1.4 Relase notes

Version 1.4.5 - June 23rd, 2015

Important notes about migration

Please see the release notes for 1.4.0

New features

  • Added support for Hive 1.1

Bug fixes

  • Fixed FTP listing for WS_FTPD

  • Fixed Apache log file format

  • Fixed “Last value” in grouping recipe

  • Updated CentOS IUS repository URL

Version 1.4.4 - March 30th, 2015

Important notes about migration

Please see the release notes for 1.4.0

New features and enhancements

  • Official support for Oracle has been added.

  • Amazon Linux 2015.03 is now supported.

Bug fixes

Data preparation

  • Fixed Python processors in recipes over filesystem datasets (happened when no custom variables and no contribs were used)

Datasets and formats

  • Fixed ORC files with Hive 0.14+

  • Fixed explore in the Twitter dataset

  • Improved detection of Shapefiles and fixed support for uppercase-named shapefiles

  • Fixes on SAS format parser

Version 1.4.3 - March 4th, 2015

Important notes about migration

Please see the release notes for 1.4.0

Bug fixes

Recipes

  • An issue with running grouping recipe on non-SQL datasets has been fixed

  • Running SQL script recipes on PostgreSQL could fail depending on how the “psql” binary is implemented

  • Running SQL script recipes on PostgreSQL could loose the return code and all recipes always appeared to succeed whereas they did not

  • Reading datasets in R when the schema contains comments has been fixed

Datasets

  • Initial download of data in HTTP datasets has been fixed

Misc

  • The job screen now properly automatically refreshes

  • Fixed an issue with Websockets and very long log messages

Version 1.4.2 - February 17th, 2015

Important notes about migration

Please see the release notes for 1.4.0

New features

  • A new processor has been added: “Generate combinations of numerical variables”

  • New dataset: “FTP (no cache)”, which allows both read and write on FTP and does not cache input data. (see FTP)

  • DSS now supports proxies for outgoing connections (FTP, HTTP, S3, Twitter). See /installation/custom/reverse-proxies)

  • Hadoop: Support for HDP 2.2 has been added

Bug fixes

  • Python’s write_with_schema could fail on some configurations of SQL output datasets

  • Installation of plugins on Mac OS X without JDK has been fixed

  • An issue has been fixed when running Impala on a Kerberos-enabled Hadoop cluster

  • Exception reporting has been improved in the Machine Learning part

  • Several issues have been fixed in the UI for New Cassandra Dataset.

  • Support for \r as End-Of-Line marker has been added for CSV datasets

Version 1.4.1 - January 29th, 2015

Important notes about migration

Please see the release notes for 1.4.0

New features

  • A new reshaping processor has been added: “fold the keys of an array”

Bug fixes

Hadoop support

  • Fixed compatibility with MapR 4.0

  • Support for MapR 2.0 and 3.0 has been removed

Machine learning

  • Fix failures in prediction recipes with textual features

Python

  • SQL code executed in Python recipes is now be properly streamed, even for very large result sets

Flow

  • Various UI bugfixes and improvements on the grouping recipe

  • A bug has been fixed on the TimeRange dependency for MONTH-based partitioning

  • A leak of file handles has been fixed in the backend <-> job communication, which could lead to “too many open files” errors

Misc

  • Support for Firefox 35.0 has been improved - However, pan and zoom in Flow on Firefox 35.0 remains slow

  • Some documentation links have been fixed

Version 1.4.0 - January 13th, 2015

Important notes about migration

The automatic data migration procedure is documented in Upgrading a DSS instance

As usual, we strongly recommend that you perform a full backup of your Data Science Studio data directory prior to starting the upgrade procedure.

Automatic migration of data from Data Science Studio 1.3.X is supported, with the following restrictions and warnings:

  • The “dip.flow.activities.nthreads” parameter in dip.properties has been removed. Setting the number of activities is now done in the Administration pages (Settings > Build).

  • Referencing a dip.properties entry from a FS dataset path is not possible anymore. Use the variables expansion system.

For migrations from Data Science Studio 1.2.X, please also see the release notes of version 1.3.0

New features

Mac OS X

  • DSS is now available on Mac OS X (10.9 and above).

Security

Visual data preparation

  • Transformation recipes created with the visual data preparation tool can now fully run on a Hadoop cluster. See Execution engines

  • BETA feature: geo processing processors (see Geographic processors)

  • Reverse geocoding (from GPS coordinates to city / region / country)

  • Zipcode-based geocoding (from zipcode to GPS coordinates)

Data Visualization

  • You can now export charts built with DSS as Excel documents for easy embedding

  • Several color palettes are now available for charts. Furthermore, you can add custom color palettes

  • BETA feature: geo charts (see Map Charts)

  • Geo charts allow you to aggregate a dataset based on a column containing geo coordinates.

  • Aggregation is made by administrative boundaries (city / region / country)

New data transformation recipes

  • “Grouping” recipe: new visual tool to perform grouping and aggregations (sum, avg, first, last, …)

  • Multi-key grouping and multi-aggregations

  • Integrated filtering

  • Automatically runs in-database or on-Hadoop when possible.

  • “Split” recipe: split one input dataset into several output datasets based on the value of a column or advanced rules

Datasets

  • Support for GeoJSON files

  • Support for ESRI Shapefiles

Collaboration

  • DSS now detects and warns when several users are working on the same dataset or recipe at the same time.

  • Edit conflicts are automatically detected and avoided

Advanced usage

  • New Custom variables expansion system that allows you to use some shared and reusable variables in several parts of the Studio.

  • You can now write custom Python code for advanced partition dependencies in Flow. See Specifying partition dependencies

  • The number of concurrent running activities (builds of a partition) can now be set:

  • per-job

  • per-connection

  • globally

  • A command-line tool lets you mass-import all (or selected) tables of a Hive database as DSS datasets. See Hive

Other enhancements

Flow

  • When aborting a job, DSS tries to cancel running SQL queries and running Hadoop jobs

  • Performance improvements for computing partition dependencies on large flows

  • Partitioning variables substitution in code recipes can now either use $DKU_XXX or ${DKU_XXX} syntax

Visual data preparation

  • The “column split” processor can now either ignore or keep empty chunks

  • The “regexp extractor” processor can now extract multiple matches

  • New processor to duplicate a column

  • New processors for advanced processing of complex content (JSON arrays and objects)

  • New processor to bin the values of a numerical column

  • “Live processing” charts can now work on a set of partitions

Web apps

  • You can now write a filtering formula to select a subset of rows in the JS webapp API

  • Better syntax highlighting in the JS editor

  • Code folding in all code editors

Data visualization

  • MIN and MAX are now available as aggregations in charts

  • Legend can now be hidden

  • Smoothing of lines and areas can now be disabled

Advanced usage

  • The Jython environment for custom processors now includes the “json” python package

Recipes

  • The R API now supports sampling of the input datasets

Security

  • The “Administrator” information is now handled by groups instead of being a simple flag on the user.

Deployment

  • Added support for RHEL 7 and CentOS 7

Major bugfixes

Visual data preparation

  • Various issues around copy/paste in explore have been fixed

  • “Analyse” will not give invalid data when Infinity appears in the column

  • Parsing dates without any time information (only year/month/day) now properly respects the selected timezone

  • When editing a preparation recipe, the “Custom format” form of the “Smart date” feature has been fixed

Data visualization

  • Several issues around “week” and “week of year” handling for timeline axis have been fixed

SQL notebook

  • Performance with large tables has been improved

  • Error reporting has been improved

Datasets

  • Hidden / Useless files (like Hadoop success markers) are now properly ignored everywhere

  • Counting records now works on MongoDB datasets

Machine learning

  • Training recipes now properly work on paritiotned datasets

  • The “heatmap” in clustering results is now properly updated when switching between models

Collaboration

  • Cropping of transparent PNG for insights and projects icon now works properly

Misc

  • An issue when changing the network of the host while DSS is running has been fixed