Installing a new DSS instance

Note

This does not apply to Mac OS X. For Mac OS X instructions, please see http://www.dataiku.com/dss/editions/community-download/ and OS X installation details.

Pre-requisites

To install Data Science Studio, you need:

  • the installation tar.gz file
  • to make sure that you meet the installation Requirements.
  • Root access is not strictly required, but you might need to install dependencies. If you want to start Data Science Studio at machine boot time, root access is required.

It is highly recommended to create an UNIX user dedicated to running the Data Science Studio software.

Data Science Studio may use up to 10 consecutive TCP ports. Only the first port needs to be opened out of the machine. It is highly recommended to firewall the other ports.

Installation folders

A Data Science Studio installation spans over two folders:

  • The installation directory, which contains the code of Data Science Studio. This is the directory where the Data Science Studio tarball is unzipped.
  • The data directory (which will later be named “DATA_DIR”).

The data directory contains :

  • The configuration of Data Science Studio, including all user-generated configuration (datasets, recipes, insights, models, ...)
  • Log files for the server components
  • Log files of job executions
  • Various caches and temporary files
  • A Python virtual environment dedicated to running the Python components of Data Science Studio, including any user-installed supplementary packages
  • Data Science Studio startup and shutdown scripts and command-line tools

Depending on your configuration, the data directory can also contain some managed datasets. Managed datasets can also be created outside of the data directory with some additional configuration.

It is highly recommended that you reserve at least 100 GB of space for the data directory.

Installation

Unpack

Unpack the tar.gz in the location you have chosen for the installation directory and move to the installation directory.

tar xzf dataiku-dss-VERSION.tar.gz
cd dataiku-dss-VERSION

Install Data Science Studio

From the user account which will be used to run Data Science Studio, enter the following command:

./installer.sh -d DATA_DIR -p PORT [-l LICENSE_FILE]

Where:

  • DATA_DIR is the location of the data directory that you want to use. If the directory already exists, it must be empty.
  • PORT is the base TCP port. Data Science Studio will use PORT, PORT+1, PORT+2 and PORT+3.
  • LICENSE_FILE is your Data Science Studio license file.

Note

If you don’t enter a license file at this point, DSS will start as a Community Edition. You can enter a license file at any time.

The installer automatically checks for any missing system dependencies. If any is missing, it will give you the command to run to install them with superuser privileges. After installation of dependencies is complete, you can start the Data Science Studio installer again, using the same command as above.

(Optional) Enable startup at boot time

At the end of installation, Data Science Studio will give you the optional command to run with superuser privileges to configure automatic boot-time startup:

sudo -i INSTALL_DIR/scripts/install/install-boot.sh DATA_DIR USER_ACCOUNT

Start Data Science Studio

To start Data Science Studio, run the following command:

DATA_DIR/bin/dss start

Complete installation example

The following shows a transcript from a complete installation sequence:

# Start from the home directory of user account "dataiku"
# which will be used to run the Data Science Studio
# We will install DSS using data directory: /home/dataiku/dss_data
$ pwd
/home/dataiku
$ ls -l
-rw-rw-r-- 1 dataiku dataiku 159284660 Feb  4 15:20 dataiku-dss-VERSION.tar.gz
-r-------- 1 dataiku dataiku       786 Jan 31 07:42 license.json

# Unpack distribution kit
$ tar xzf dataiku-dss-VERSION.tar.gz
$ cd dataiku-dss-VERSION

# Run installer, with data directory $HOME/dss_data and base port 10000
# This fails because of missing system dependencies
$ ./installer.sh -d /home/dataiku/dss_data -l /home/dataiku/license.json -p 10000

# Install dependencies with elevated privileges, using the command shown by the previous step
$ sudo -i "/home/dataiku/dataiku-dss-VERSION/scripts/install/install-deps.sh"

# Rerun installer script, which will succeed this time
$ ./installer.sh -d /home/dataiku/dss_data -l /home/dataiku/license.json -p 10000

# Configure boot-time startup, using the command shown by the previous step
$ sudo -i "/home/dataiku/dataiku-dss-VERSION/scripts/install/install-boot.sh" "/home/dataiku/dss_data" dataiku

# Manually start Data Science Studio, using the command shown by the installer step
$ /home/dataiku/dss_data/bin/dss start

# Connect to Data Science Studio by opening the following URL in a web browser:
#    http://HOSTNAME:10000
# Initial credentials : username = "admin" / password = "admin"

Manual dependency installation

The Data Science Studio installer includes a dependency management script, to be run with superuser privileges, which automatically installs the additional Linux packages required for your particular configuration.

In some cases however, it might be necessary to manually install these dependencies, for instance when the person installing DSS does not have access to administrative privileges, or when the server does not have access to the required package repositories.

If you manually pre-installed all the dependencies that would have been selected by the automated script, you can continue installing Data Science Studio using standard procedures. If that is not the case (because you explicitely chose to leave a component missing, or you installed some component from an alternate source) you must then run the DSS installer with the “-n” flag, to disable the default dependency checks.

Red Hat / CentOS Linux distributions

You may need to configure the following additional repositories:

Name Address Notes
EPEL https://fedoraproject.org/wiki/EPEL

[RedHat/CentOS v6.x] for pandoc, R

[RedHat/CentOS v7.x] for pandoc, R, nginx

nginx http://nginx.org/en/linux_packages.html [RedHat/CentOS v6.x] for nginx v1.6+
IUS https://iuscommunity.org/pages/Repos.html [RedHat/CentOS v6.x] for Python 2.7

Data Science Studio depends on the following packages:

Name Notes
graphviz Mandatory
nginx Version 1.4 or later
pandoc See “Pandoc” note below
java-1.7.0-openjdk See “Java” note below
python27 libpng freetype libgfortran [RedHat/CentOS v6.x] for Python 2.7 and built-in Python packages. See “Python” note below
python27-devel [RedHat/CentOS v6.x] See “Additional Python packages” note below
libpng12 freetype libgfortran [RedHat/CentOS v7.x] for built-in Python packages. See “Python” note below
python-devel [RedHat/CentOS v7.x] See “Additional Python packages” note below
R-devel libcurl-devel readline-devel See “R” note below

Debian / Ubuntu Linux distributions

You may need to configure the following additional repository:

Name Address Notes
nginx http://nginx.org/en/linux_packages.html [Debian 7.x, Ubuntu 12.04] for nginx v1.6+

Data Science Studio depends on the following packages:

Name Notes
curl graphviz Mandatory
nginx Version 1.4 or later
pandoc See “Pandoc” note below
openjdk-7-jre See “Java” note below
python2.7 libpython2.7 libpng12-0 libfreetype6 libgfortran3 For built-in Python packages. See “Python” note below
python2.7-dev See “Additional Python packages” note below
r-base-dev libcurl4-openssl-dev See “R” note below

Additional notes

Pandoc

The pandoc package itself requires many dependencies, and is not strictly mandatory for DSS core functionality. If this package is not installed, the following features will not be available:

  • Creating insights from IPython notebooks
  • Downloading IPython notebooks as HTML
Java
The suggested dependency package is the platform default, but DSS can use other Java runtime environments. See Java runtime environment for details.
Python
The dependencies listed above are required to use the precompiled set of Python packages provided with DSS. This does not apply when using custom-built Python libraries. See The Python environment for details.
Additional python packages
Installing additional Python packages which include native code require the dependencies listed above and the system development tools to be installed (typically C/C++ compilers and headers), in addition to any package-specific dependency.
R
The dependencies listed above are only necessary to enable R integration in DSS. Note that the system development tools and additional dependencies are usually needed in order to build the required R packages.

OS X installation details

Data Science Studio can be installed on Mac OS X.

For standard desktop use, download the native OS X package at http://www.dataiku.com/dss/editions/community-download/ and install it by simple drag-and-drop into the Applications folder. This configures Data Science Studio with the following default options:

  • Installation directory: /Applications/DataScienceStudio.app/Contents/Resources/kit
  • Data directory: $HOME/Library/DataScienceStudio/dss_home
  • TCP base port: 11200

This package installs an icon in the application dock with which you can start and stop DSS.

Note

The native OS X wrapper for Data Science Studio logs its messages to OS X system logs. In case of installation failure, or of problems starting / stopping DSS, open the OS X “Console” application and type “DataScienceStudio” in the search filter to browse relevant troubleshooting messages.

OS X prerequisites

Data Science Studio can only be installed on OS X 10.9 “Mavericks” or later.

Data Science Studio requires a 64-bit Java Development Kit (JDK) or Java Runtime Environment (JRE) version 7 or 8, installed in the standard system location. Suitable installation kits for your system can be found at http://java.com.

Data Science Studio for OS X requires the additional package pandoc in order to convert IPython notebook to HTML pages. This package may be installed using the native installer found at https://github.com/jgm/pandoc/releases, and may also be available through the macports or homebrew package managers. If pandoc is not available, it will not be possible to create an insight from a DSS Python notebook, nor to download a Python notebook as HTML.

Advanced OS X installation

For advanced or non-standard uses, it is possible to install Data Science Studio on OS X using the Linux procedure described above, starting with the same dataiku-dss-VERSION.tar.gz installation kit. You can follow the Linux installation procedure, apart from the script installing dependencies and the script configuring DSS to start on boot.

In that mode, you keep full control over all installation parameters (directories, port, Java and Python subsystems used). However, the native widget enabling start/stop of DSS from the OS X dock is not available.