Using unmanaged GKE clusters

Setup

Create your GKE cluster

To create a Google Kubernetes Engine (GKE) cluster, follow the Google Cloud Platform (GCP) documentation on creating a GKE cluster. We recommend that you allocate at least 16GB of memory for each cluster node. More memory may be required if you plan on running very large in-memory recipes.

You’ll be able to configure the memory allocation for each container and per-namespace in Dataiku DSS using multiple containerized execution configurations.

Prepare your local gcloud, docker, and kubectl commands

Follow the GCP documentation to ensure the following on your local machine (where DSS is installed):

  • The gcloud command has the appropriate permission and scopes to push to the Google Artifact Registry (GAR) service.

  • The kubectl command is installed and can interact with the cluster. This can be achieved by running the gcloud container clusters get-credentials your-gke-cluster-name command.

  • The docker command is installed, can build images and push them to GAR. The latter can be enabled by running the gcloud auth configure-docker command.

Note

Cluster management has been tested with the following versions of Kubernetes:
  • 1.23

  • 1.24

  • 1.25

  • 1.26

  • 1.27

  • 1.28

  • 1.29

  • 1.30

  • 1.31

  • 1.32

There is no known issue with other Kubernetes versions.

Create base images

Build the base image by following these instructions.

Create the execution configuration

Go to Administration > Settings > Containerized execution, and add a new execution configuration of type “Kubernetes”.

  • Configure the GAR repository URL to use, e.g. <region>-docker.pkg.dev/my-gcp-project/my-registry

  • Finish by clicking Push base images.

You’re now ready to run recipes and ML models in GKE.

Using GPUs

GCP provides GPU-enabled instances with NVidia GPUs. Using GPUs for containerized execution requires the following steps.

Building an image with CUDA support

The base image that is built by default does not have CUDA support and cannot use NVidia GPUs.

CUDA support can be added to an image by:

  • installing CUDA system-wide (in /usr/local/cuda/) in the base image (see below)

  • installing CUDA system-wide in the code env image using container runtime additions

  • installing CUDA in the code env (in /opt/dataiku/code-env/) by requiring CUDA libraries (including nvidia-cuda-runtime)

To enable CUDA system-wide in the base image add the --with-cuda option to the command line:

./bin/dssadmin build-base-image --type container-exec --with-cuda

We recommend that you give this image a specific tag using the --tag option and keep the default base image “pristine”. We also recommend that you add the DSS version number in the image tag.

./bin/dssadmin build-base-image --type container-exec --with-cuda --tag dataiku-container-exec-base-cuda:X.Y.Z

where X.Y.Z is your DSS version number

Note

  • This image contains CUDA 11.8 and CuDNN 8.7 by default on AlmaLinux 9. You can use --cuda-version X.Y to specify another DSS-provided version (9.0, 10.0, 10.1, 10.2, 11.0, 11.2 and 11.8 are available on AlmaLinux 8, 11.8 only on AlmaLinux 9). If you require other CUDA versions, you have to create a custom image.

  • Depending on which CUDA version is installed in the base image you will need to use the corresponding tensorflow version.

Warning

After each upgrade of DSS, you must rebuild all base images and update code envs.

Thereafter, create a new container configuration dedicated to running GPU workloads. If you specified a tag for the base image, report it in the “Base image tag” field.

Enable GPU support on the cluster

Follow the GCP documentation on how to create a GKE cluster with GPU accelerators. You can also create a GPU-enabled node pool in an existing cluster.

Be sure to run the “DaemonSet” installation procedure, which needs several minutes to complete.

Add a custom reservation

For your containerized execution task to run on nodes with GPUs, and for GKE to configure the CUDA driver on your containers, the corresponding pods must be created with a custom limit (in Kubernetes parlance). This indicates that you need a specific type of resource (standard resource types are CPU and memory).

You must configure this limit in the containerized execution configuration. To do this:

  • In the “Custom limits” section, add a new entry with key nvidia.com/gpu and value 1 (to request 1 GPU).

  • Add the new entry and save your settings.

Deploy

You can now deploy your GPU-based recipes and models.