Python integration

DSS comes with native Python integration. The DSS installation phase creates an initial “builtin” Python environment, which is used to run all Python-based internal DSS operations, and is also used as a default environment to run user-provided Python code.

This builtin Python environment comes with a default set of packages, suitable for this version of DSS. These are setup by the DSS installer and updated accordingly on DSS upgrades. Starting with DSS 6.0, the builtin environment may be based on Python 2.7 or Python 3.6, to be chosen at installation time. See Initial setup of the builtin Python environment.

In addition to this builtin environment, DSS can dynamically build and manage multiple additional Python environments, to run user-provided Python code. These can be built with different versions of Python, and different sets of installed packages. See Code environments.

Installing Python packages

It is possible to install additional packages in the builtin environment. This is not recommended however as it can lead to package dependency conflicts with the mandatory set of packages provided by the DSS installer, and may complexify later DSS upgrades.

Managed code environments should be preferred whenever possible. See Installing Python packages for details.

Initial setup of the builtin Python environment

The builtin Python environment can be built using Python 2.7 or Python 3.6. The DSS installation kit contains all required Python packages, for both these versions of Python, but does not contain the Python interpreters themselves. These must be present on the system, and the DSS dependency installer checks this at the beginning of the installation sequence.

System dependencies

The DSS installer starts by checking that all required system dependencies are present on the system. This includes Python 2.7 (on all DSS-supported platforms) and Python 3.6 (only on Linux distributions where Python 3.6 is natively available).

If any required DSS dependency is missing, the DSS installer aborts, after printing the name of the missing packages, and a command line which can be run (as root) to install them.

You can force the installation to proceed without checking for missing packages by adding the -n flag to the installer script, as in:

/PATH/TO/dataiku-dss-VERSION/installer.sh -d DATA_DIR -p PORT -n [OTHER_OPTIONS...]

This may be necessary when you want to entirely run DSS on one single version of Python (2.7 or 3.6), without the other one being installed at all, or if you want to use a Python subsystem installed from an alternate source.

Note

On CentOS and RedHat 6.x, where the system’s version of Python is 2.6, Python 2.7 is pulled from the additional repository IUS (https://ius.io/).

Warning

The native libraries of the standard Python 2.7 packages shipped with DSS are built using UCS-4 Unicode characters. Make sure the default Python interpreter used by DSS has been built with --enable-unicode=ucs4. This is the default on most recent Linux distributions, but it is not the default when building Python interpreters directly from source.

Choosing the version of Python for the builtin environment

The Python subsystem used to build the builtin environment can be controlled using the optional -P BASE_PYTHON option to the DSS installer script. Without this option, it defaults to using python2.7 as found through the PATH environment variable.

# Use 'python2.7' command from PATH
/PATH/TO/dataiku-dss-VERSION/installer.sh -d DATA_DIR -p PORT

# Use 'python3.6' command from PATH
/PATH/TO/dataiku-dss-VERSION/installer.sh -d DATA_DIR -p PORT -P python3.6

# Use a specific version of Python 3.6 installed on the host
# It may be necessary to add '-n' to skip the dependency check for the system-default Python 3.6
/PATH/TO/dataiku-dss-VERSION/installer.sh -d DATA_DIR -p PORT -P /usr/local/bin/python3.6 [-n]

# Error: Python 3.5 is not supported for the builtin Python environment
# (it can be used for managed code environments though)
/PATH/TO/dataiku-dss-VERSION/installer.sh -d DATA_DIR -p PORT -P /usr/bin/python3.5
*** Unsupported Python version: 3.5

DSS uses this builtin environment to run internal Python code necessary to the proper working of DSS. User code can run either in the builtin environment, or using a code environment.

The DATA_DIR/bin/pip command can be used to list or otherwise manage the contents of the builtin Python environment. See Installing Python packages.

For test purposes, the builtin Python environment used by DSS can be launched with DATA_DIR/bin/python

Upon DSS upgrades, the builtin Python environment is preserved, and the installer simply switches the mandatory set of packages to their new versions. In order to change the underlying Python subsystem (for example, to switch the builtin environment from python2.7 to python3.6) you need to rebuild the builtin Python environment as described below.

Rebuilding the builtin Python environment

It is possible to rebuild the builtin Python virtual environment, if necessary. This is the case if you moved or renamed DSS’s data directory, as Python virtual environments embed their full directory name. This may be also be the case if you want to reset the virtualenv to a pristine state following installation / desinstallation of additional packages, or to change the underlying Python subsystem.

The builtin Python virtualenv is automatically created by the installer when it is not present. The sequence of operations to reinitialize it thus consists in removing the virtualenv and reinstalling DSS, keeping track of any local package which you want to reinstall afterwards:

# Stop DSS
DATA_DIR/bin/dss stop
# Save the list of locally-installed packages
DATA_DIR/bin/pip freeze -l >dss-local-packages.txt
# Remove the virtualenv, keeping backup
mv DATA_DIR/pyenv DATA_DIR/pyenv.backup
# Reinstall DSS (upgrade mode), choosing the underlying base Python to use
dataiku-dss-VERSION/installer.sh -d DATA_DIR -u [-P BASE_PYTHON]
# Review and possibly edit the list of locally-installed packages
vi dss-local-packages.txt
# Reinstall local packages
DATA_DIR/bin/pip install -r dss-local-packages.txt
# Start DSS
DATA_DIR/bin/dss start
# When everything is considered stable, remove the backup
rm -rf DATA_DIR/pyenv.backup

Advanced: using a fully custom Python environment

For non-standard needs, you can force DSS to use an externally-maintained Python installation by defining the DKUPYTHONBIN environment variable for the Linux user account running the Studio.

Warning

Using this mode is not officially supported and not recommended.

This variable points to the Python binary to use. It should be defined before running the installer, and for all subsequent runs of the Studio startup or management scripts. You would typically define it as follows:

$ echo "export DKUPYTHONBIN=/usr/local/bin/python" >>$HOME/.profile

When this variable is defined, the precompiled third-party Python packages shipped with DSS are not used. You must make sure that the interpreter started by $DKUPYTHONBIN contains all packages required by DSS. Please refer to the script INSTALL_DIR/scripts/install/install-python-packages.sh, found in the DSS installation directory, for this purpose.

Using Anaconda Python for the builtin environment

DSS supports using Anaconda Python instead of standard system-provided Python for the builtin environment. In that mode, the DSS installer builds an Anaconda environment, containing the standard set of packages required by DSS, instead of a virtualenv-based environment, and uses it for all Python-based tasks.

Warning

Using Anaconda Python for the builtin environment is only supported with Python 2.7

Tier 2 support: Using conda for DSS is covered by Tier 2 support

Conda package repositories tend to be very bleeding-edge, and move quickly, with frequent backwards-incompatibles changes.

Various incompatibilities may happen, and Dataiku can only provide limited support with setup and usage of conda-based DSS setups

For these reasons, Dataiku does not generally recommend using conda for the builtin Python environment. We recommend that you only use conda if there are reasons for which you cannot use the native virtualenv and R packages systems.

As for virtualenv-based installations, it is possible but not recommended to manually add supplementary packages to this environment, for use in recipes and notebooks.

Note

You can install individual code environments using conda while still using regular virtualenv for the builtin environment. See Mixed conda / virtualenv support.

Prerequisites

Installation

The DSS installer switches to Anaconda mode when given the -C flag:

dataiku-dss-VERSION/installer.sh -d DATA_DIR -p PORT -C

It will then download all required packages/versions from the Anaconda repository (plus a few custom ones which are provided directly from the DSS installation directory), and build an Anaconda environment from them in directory DATA_DIR/condaenv.

Once an Anaconda environment is built in DATA_DIR/condaenv it is used instead of the standard virtualenv in DATA_DIR/pyenv.

Upgrading an Anaconda-based DSS installation installs the new set of required packages/versions in the DSS Anaconda environment, preserving manually-installed additional packages, or upgrading them in case of versioning conflicts.

Offline installation

If the DSS host does not have an outgoing Internet connection nor access to a local mirror, you can create a local repository containing the packages needed by DSS to install properly. To do so, you need an access to a host with an Internet connection or a local mirror. This host must have conda installed too. From this host, download DSS and run the following script:

dataiku-dss-VERSION/scripts/install/download-conda-python-packages

You will get information about the operation progress. Once it is finished, it produces a directory called dataiku-dss-VERSION-conda-python-offline-mirror containing the packages. Copy this directory to the DSS host. Then, from the DSS host, run the following commands:

conda config --add channels file:///FULL/PATH/TO/dataiku-dss-VERSION-conda-python-offline-mirror
conda config --remove channels defaults

Then run the installer:

dataiku-dss-VERSION/installer.sh -d DATA_DIR -p PORT -C

Further operations

Adding / removing / listing additional packages from the DSS-managed Anaconda environment can be done using the standard conda commands:

conda list -p DATA_DIR/condaenv
conda install -p DATA_DIR/condaenv PACKAGE

Warning

  • Uninstalling / upgrading / downgrading the standard packages installed by DSS is not supported and may lead to subtle compatibility problems.
  • It is recommended to use code envs for user packages instead. See Installing Python packages

Adding / removing / listing additional packages may be done through the pip command, when the required packages are not available as conda packages:

DATA_DIR/bin/pip list
DATA_DIR/bin/pip install PACKAGE

For testing purposes, it is possible to run the DSS Anaconda environment outside DSS using:

DATA_DIR/bin/python

It is possible to migrate a virtualenv-based DSS installation to Anaconda mode by running the installer in “upgrade” mode with the -C flag:

dataiku-dss-VERSION/installer.sh -d DATA_DIR -u -C

It is possible to migrate back an Anaconda-based DSS installation to standard virtualenv mode by moving away the conda environment and re-running the installer in “upgrade” mode without the -C flag:

mv DATA_DIR/condaenv DATA_DIR/condaenv.BAK
dataiku-dss-VERSION/installer.sh -d DATA_DIR -u [-P BASE_PYTHON]

Mixed conda / virtualenv support

DSS can simultaneously use conda and non-conda (i.e. virtualenv) Python environments when:

  • the builtin DSS environment is installed with virtualenv (default installer option) but one wishes to build conda-based managed code environments as well
  • the builtin DSS environment is installed with conda (installer option “-C”) but one wishes to build virtualenv-based managed code environments as well

DSS builds conda environments by calling the conda command as found in the PATH environment variable of the DSS user account.

DSS builds virtualenv-based environments by looking up the corresponding Python command in PATH (ie python2.7, python3.5, etc.). However this should NOT resolve to a conda-provided Python system, as these are not compatible with virtualenv.

As a consequence, one must make sure that adding conda to the DSS PATH variable does not add conda-provided Python to this PATH as well:

  • Recent versions of Anaconda/Miniconda provide a condabin subdirectory which only contains the conda command. This directory should be added to the DSS PATH instead of the Anaconda/Miniconda bin subdirectory, which typically contains python, pip and many other Python-related commands as well, and would hide the system versions of these commands:

    PATH="/PATH/TO/ANACONDA/condabin:$PATH"
    
  • For earlier versions of Anaconda/Miniconda which do not provide a condabin subdirectory, it is typically necessary to configure the PATH variable for DSS so that the conda binaries are after the system binaries, so that “which python2.7” or “which python3.6” resolve to the system Python (which supports virtualenv) and not conda Python (which does not) as in:

    PATH="$PATH:/PATH/TO/ANACONDA/bin"