- Installing Python packages
- Initial setup of the builtin Python environment
- Rebuilding the builtin Python environment
- Advanced: using a fully custom Python environment
- Using Anaconda Python for the builtin environment
- Mixed conda / virtualenv support
DSS comes with native Python integration. The DSS installation phase creates an initial “builtin” Python environment, which is used to run all Python-based internal DSS operations, and is also used as a default environment to run user-provided Python code.
This builtin Python environment comes with a default set of packages, suitable for this version of DSS. These are setup by the DSS installer and updated accordingly on DSS upgrades. Starting with DSS 6.0, the builtin environment may be based on Python 2.7 or Python 3.6, to be chosen at installation time. See Initial setup of the builtin Python environment.
In addition to this builtin environment, DSS can dynamically build and manage multiple additional Python environments, to run user-provided Python code. These can be built with different versions of Python, and different sets of installed packages. See Code environments.
It is possible to install additional packages in the builtin environment. This is not recommended however as it can lead to package dependency conflicts with the mandatory set of packages provided by the DSS installer, and may complexify later DSS upgrades.
Managed code environments should be preferred whenever possible. See Installing Python packages for details.
The builtin Python environment can be built using Python 2.7 or Python 3.6. The DSS installation kit contains all required Python packages, for both these versions of Python, but does not contain the Python interpreters themselves. These must be present on the system, and the DSS dependency installer checks this at the beginning of the installation sequence.
The DSS installer starts by checking that all required system dependencies are present on the system. This includes Python 2.7 (on all DSS-supported platforms) and Python 3.6 (only on Linux distributions where Python 3.6 is natively available).
If any required DSS dependency is missing, the DSS installer aborts, after printing the name of the missing packages, and a command line which can be run (as root) to install them.
You can force the installation to proceed without checking for missing packages by adding the
-n flag to the installer script, as in:
/PATH/TO/dataiku-dss-VERSION/installer.sh -d DATA_DIR -p PORT -n [OTHER_OPTIONS...]
This may be necessary when you want to entirely run DSS on one single version of Python (2.7 or 3.6), without the other one being installed at all, or if you want to use a Python subsystem installed from an alternate source.
On CentOS and RedHat 6.x, where the system’s version of Python is 2.6, Python 2.7 is pulled from the additional repository IUS (https://ius.io/).
The native libraries of the standard Python 2.7 packages shipped with DSS are built using UCS-4 Unicode characters. Make sure the
default Python interpreter used by DSS has been built with
--enable-unicode=ucs4. This is the default on most recent
Linux distributions, but it is not the default when building Python interpreters directly from source.
The Python subsystem used to build the builtin environment can be controlled using the optional
-P BASE_PYTHON option to the DSS installer script.
Without this option, it defaults to using
python2.7 as found through the PATH environment variable.
# Use 'python2.7' command from PATH /PATH/TO/dataiku-dss-VERSION/installer.sh -d DATA_DIR -p PORT # Use 'python3.6' command from PATH /PATH/TO/dataiku-dss-VERSION/installer.sh -d DATA_DIR -p PORT -P python3.6 # Use a specific version of Python 3.6 installed on the host # It may be necessary to add '-n' to skip the dependency check for the system-default Python 3.6 /PATH/TO/dataiku-dss-VERSION/installer.sh -d DATA_DIR -p PORT -P /usr/local/bin/python3.6 [-n] # Error: Python 3.5 is not supported for the builtin Python environment # (it can be used for managed code environments though) /PATH/TO/dataiku-dss-VERSION/installer.sh -d DATA_DIR -p PORT -P /usr/bin/python3.5 *** Unsupported Python version: 3.5
DSS uses this builtin environment to run internal Python code necessary to the proper working of DSS. User code can run either in the builtin environment, or using a code environment.
DATA_DIR/bin/pip command can be used to list or otherwise manage the contents of the builtin Python environment. See Installing Python packages.
For test purposes, the builtin Python environment used by DSS can be launched with
Upon DSS upgrades, the builtin Python environment is preserved, and the installer simply switches the mandatory set of packages to their new versions. In order to change the underlying Python subsystem (for example, to switch the builtin environment from python2.7 to python3.6) you need to rebuild the builtin Python environment as described below.
It is possible to rebuild the builtin Python virtual environment, if necessary. This is the case if you moved or renamed DSS’s data directory, as Python virtual environments embed their full directory name. This may be also be the case if you want to reset the virtualenv to a pristine state following installation / desinstallation of additional packages, or to change the underlying Python subsystem.
The builtin Python virtualenv is automatically created by the installer when it is not present. The sequence of operations to reinitialize it thus consists in removing the virtualenv and reinstalling DSS, keeping track of any local package which you want to reinstall afterwards:
# Stop DSS DATA_DIR/bin/dss stop # Save the list of locally-installed packages DATA_DIR/bin/pip freeze -l >dss-local-packages.txt # Remove the virtualenv, keeping backup mv DATA_DIR/pyenv DATA_DIR/pyenv.backup # Reinstall DSS (upgrade mode), choosing the underlying base Python to use dataiku-dss-VERSION/installer.sh -d DATA_DIR -u [-P BASE_PYTHON] # Review and possibly edit the list of locally-installed packages vi dss-local-packages.txt # Reinstall local packages DATA_DIR/bin/pip install -r dss-local-packages.txt # Start DSS DATA_DIR/bin/dss start # When everything is considered stable, remove the backup rm -rf DATA_DIR/pyenv.backup
For non-standard needs, you can force DSS to use an externally-maintained Python installation by defining the DKUPYTHONBIN environment variable for the Linux user account running the Studio.
Using this mode is not officially supported and not recommended.
This variable points to the Python binary to use. It should be defined before running the installer, and for all subsequent runs of the Studio startup or management scripts. You would typically define it as follows:
$ echo "export DKUPYTHONBIN=/usr/local/bin/python" >>$HOME/.profile
When this variable is defined, the precompiled third-party Python packages shipped with DSS are not used. You must make sure that the
interpreter started by
$DKUPYTHONBIN contains all packages required by DSS. Please refer to the script
INSTALL_DIR/scripts/install/install-python-packages.sh, found in the DSS installation directory, for this purpose.
DSS supports using Anaconda Python instead of standard system-provided Python for the builtin environment. In that mode, the DSS installer builds an Anaconda environment, containing the standard set of packages required by DSS, instead of a virtualenv-based environment, and uses it for all Python-based tasks.
Using Anaconda Python for the builtin environment is only supported with Python 2.7
Tier 2 support: Using conda for DSS is covered by Tier 2 support
Conda package repositories tend to be very bleeding-edge, and move quickly, with frequent backwards-incompatibles changes.
Various incompatibilities may happen, and Dataiku can only provide limited support with setup and usage of conda-based DSS setups
For these reasons, Dataiku does not generally recommend using conda for the builtin Python environment. We recommend that you only use conda if there are reasons for which you cannot use the native virtualenv and R packages systems.
As for virtualenv-based installations, it is possible but not recommended to manually add supplementary packages to this environment, for use in recipes and notebooks.
You can install individual code environments using conda while still using regular virtualenv for the builtin environment. See Mixed conda / virtualenv support.
- You must have a 64-bit version of Anaconda (https://www.anaconda.com/distribution/) or Miniconda (https://docs.conda.io/en/latest/miniconda.html) installed on the DSS host.
- Anaconda/Miniconda are supported in version 4.3.27 or later only.
- The binary directory for Anaconda must be in the PATH for the DSS user account. In particular, the
condacommand must be accessible to this user. See Mixed conda / virtualenv support.
- You must have access to a repository of standard Anaconda packages, either through an outgoing Internet connection (direct or using a proxy), or through a local mirror. See Offline installation for a possible workaround.
The DSS installer switches to Anaconda mode when given the
dataiku-dss-VERSION/installer.sh -d DATA_DIR -p PORT -C
It will then download all required packages/versions from the Anaconda repository
(plus a few custom ones which are provided directly from the DSS installation directory),
and build an Anaconda environment from them in directory
Once an Anaconda environment is built in
DATA_DIR/condaenv it is used instead of the standard virtualenv in
Upgrading an Anaconda-based DSS installation installs the new set of required packages/versions in the DSS Anaconda environment, preserving manually-installed additional packages, or upgrading them in case of versioning conflicts.
If the DSS host does not have an outgoing Internet connection nor access to a local mirror, you can create a local repository containing
the packages needed by DSS to install properly. To do so, you need an access to a host with an Internet connection or a local mirror.
This host must have
conda installed too. From this host, download DSS and run the following script:
You will get information about the operation progress. Once it is finished, it produces a directory called
containing the packages. Copy this directory to the DSS host. Then, from the DSS host, run the following commands:
conda config --add channels file:///FULL/PATH/TO/dataiku-dss-VERSION-conda-python-offline-mirror conda config --remove channels defaults
Then run the installer:
dataiku-dss-VERSION/installer.sh -d DATA_DIR -p PORT -C
Adding / removing / listing additional packages from the DSS-managed Anaconda environment can be done using the standard
conda list -p DATA_DIR/condaenv conda install -p DATA_DIR/condaenv PACKAGE
- Uninstalling / upgrading / downgrading the standard packages installed by DSS is not supported and may lead to subtle compatibility problems.
- It is recommended to use code envs for user packages instead. See Installing Python packages
Adding / removing / listing additional packages may be done through the
pip command, when the required packages are not available as conda packages:
DATA_DIR/bin/pip list DATA_DIR/bin/pip install PACKAGE
For testing purposes, it is possible to run the DSS Anaconda environment outside DSS using:
It is possible to migrate a virtualenv-based DSS installation to Anaconda mode by running the installer in “upgrade” mode with the
dataiku-dss-VERSION/installer.sh -d DATA_DIR -u -C
It is possible to migrate back an Anaconda-based DSS installation to standard virtualenv mode by moving away the conda environment
and re-running the installer in “upgrade” mode without the
mv DATA_DIR/condaenv DATA_DIR/condaenv.BAK dataiku-dss-VERSION/installer.sh -d DATA_DIR -u [-P BASE_PYTHON]
DSS can simultaneously use conda and non-conda (i.e. virtualenv) Python environments when:
- the builtin DSS environment is installed with virtualenv (default installer option) but one wishes to build conda-based managed code environments as well
- the builtin DSS environment is installed with conda (installer option “-C”) but one wishes to build virtualenv-based managed code environments as well
DSS builds conda environments by calling the
conda command as found in the PATH environment variable of the DSS user account.
DSS builds virtualenv-based environments by looking up the corresponding Python command in PATH (ie
python3.5, etc.). However this should NOT
resolve to a conda-provided Python system, as these are not compatible with virtualenv.
As a consequence, one must make sure that adding
conda to the DSS PATH variable does not add conda-provided Python to this PATH as well:
Recent versions of Anaconda/Miniconda provide a
condabinsubdirectory which only contains the
condacommand. This directory should be added to the DSS PATH instead of the Anaconda/Miniconda
binsubdirectory, which typically contains
pipand many other Python-related commands as well, and would hide the system versions of these commands:
For earlier versions of Anaconda/Miniconda which do not provide a
condabinsubdirectory, it is typically necessary to configure the PATH variable for DSS so that the conda binaries are after the system binaries, so that “which python2.7” or “which python3.6” resolve to the system Python (which supports virtualenv) and not conda Python (which does not) as in: