The Python environment

Adding Python packages to the default environment

The default Data Science Studio installation builds a Python virtual environment which contains all packages required for Data Science Studio operation.

If you need to install additional third-party Python packages (to make them available to notebooks and recipes), you must use the command DATA_DIR/bin/pip, where DATA_DIR is the Studio data directory.

$ DATA_DIR/bin/pip list
$ DATA_DIR/bin/pip install PKG

As usual with Python package installation on Linux, you may need to install additional system dependencies if the target Python packages include native code. In particular you may need the system development tools (“build-essentials” on Debian/Ubuntu, “@Development tools” on RedHat/CentOS) and the Python interpreter header files (“python-dev” on Debian/Ubuntu, “python27-devel” on RedHat/CentOS 6.x, “python-devel” on RedHat/CentOS 7.x).

Warning

Using the system’s pip command will not work. Data Science Studio’s Python environment is fully isolated.

In addition to the above, you can add locally-managed Python code and resources in directory DATA_DIR/lib/python. This directory is created but left empty by the Data Science Studio installer, and is included in the Python search path for both notebooks and recipes. You can use it to deploy additional Python modules used by your code but not managed by pip.

Note

The additional Python packages installed by DATA_DIR/bin/pip or added to DATA_DIR/lib/python are preserved by DSS upgrades.

The default Python environment setup

Data Science Studio requires a Python 2.7 interpreter. As part of the standard DSS installation, the presence of the distribution default packages for Python 2.7 is checked and if necessary those are pulled by the dependency installation phase.

Note

On CentOS and RedHat 6.x, where the system’s version of Python is 2.6, Python 2.7 is pulled from the additional repository IUS (http://iuscommunity.org/pages/Repos.html).

The installation script locates the Python interpreter to use by looking up python2.7 in the standard PATH. It then proceeds to build a Python virtual environment on top of this interpreter, containing the standard Python packages shipped with Data Science Studio.

Data Science Studio uses this virtual environment to run all Python code, including IPython notebooks and Python dataset manipulation recipes.

The DATA_DIR/bin/pip command can be used to list or otherwise manage the contents of this virtual environment, as described above.

For testing purposes, the Python virtual environment used by DSS can be launched with DATA_DIR/bin/python

Warning

The native libraries of the standard Python packages shipped with DSS are built using UCS-4 Unicode characters. Make sure the default Python interpreter used by DSS has been built with --enable-unicode=ucs4. This is the default on most recent Linux distributions, but it is not the default when building Python interpreters directly from source.

Using a custom Python environment

For non-standard needs, you can force Data Science Studio to use an externally-maintained Python 2.7 installation by defining the DKUPYTHONBIN environment variable for the Linux user account running the Studio.

This variable points to the Python binary to use. It should be defined before running the installer, and for all subsequent runs of the Studio startup or management scripts. You would typically define it as follows:

$ echo "export DKUPYTHONBIN=/usr/local/bin/python" >>$HOME/.profile

When this variable is defined, the precompiled third-party Python packages shipped with DSS are not used. You must make sure that the interpreter started by $DKUPYTHONBIN contains all packages required by DSS. Please refer to the script INSTALL_DIR/scripts/install/install-python-packages.sh, found in the Data Science Studio installation directory, for this purpose.