Spark on Kubernetes

DSS is compatible with Spark on Kubernetes starting with version 2.4 of Spark. Only “client” deployment mode is supported. “cluster” deployment mode is not supported.

DSS can work “out of the box” with Spark on Kubernetes, meaning that you can simply add the relevant options to your Spark configuration. DSS also features a “managed Spark on Kubernetes” mode which is often necessary for multi-user-security deployments.

Out of the box

Guided installation procedures

The precise steps to follow for Spark-on-Kubernetes depend on which managed Kubernetes offering you are using and which cloud storage you want to use.

Please contact your Dataiku Sales Engineer or Customer Success Manager for more details on our guided installation procedures.

Main steps

Configure DSS

You first need to configure DSS to use your Spark 2.4

If you are using a Hadoop distribution that comes natively with Spark 2.4, this will be handled during the “Install Spark integration” phase. Else, you may need to specify your custom Spark. See Setting up Spark integration for more details.

Build your Docker images

Follow the Spark documentation to build Docker images from your Spark distribution and push it to your repository.

Note that depending on which cloud storage you want to connect to, it may be necessary to modify the Spark Dockerfiles. See our guided installation procedures for more details.

Create the Spark configuration

Create a named Spark configuration (see Spark configurations), and set at least the following keys:

  • spark.master: k8s://https://IP_OF_YOUR_K8S_CLUSTER
  • spark.kubernetes.container.image: the tag of the image that you pushed to your repository

Implications for multi-user-security deployments

When running in multi-user-security, the Spark driver process runs as the impersonated end-user. Thus, the interaction between Spark and Kubernetes also runs as the impersonated end-user.

This requires that each impersonated end-user has credentials to access the Kubernetes. While this deployment is completely possible, it is not typically the case (each user needs to have a ~/.kube/config file with proper credentials for the Kubernetes cluster).

To make it easier to run Spark on Kubernetes with multi-user-security, DSS features a “managed Spark on Kubernetes” mode

Managed mode

In the “managed Spark on Kubernetes” mode, DSS can automatically generate temporary service accounts for each job, pass these temporary credentials to the Spark job, and delete the temporary service account after the job is complete.

In Kubernetes, the granularity of security is the namespace: if a service account has the right to create pods in a namespace, it is theoretically possible for it to gain all privileges on that namespace. Therefore, it is recommended to use one namespace per user (or one namespace per team). The “managed Spark on Kubernetes” mode can automatically create dynamic namespaces, and associate service accounts to namespaces. This requires that the account running DSS has credentials on the Kubernetes cluster that allow it to create namespaces.

One-namespace-per-user setup

  • In the Spark configuration, enable the “Managed K8S configuration” checkbox
  • In “Target namespace”, enter something like dss-ns-${dssUserLogin}
  • Enable “Auto-create namespace”
  • Set Authentication mode to “Create service accounts dynamically”

Each time a user U starts a Job that uses this particular Spark configuration, DSS will:

  • Create if needed the dss-ns-U namespace
  • Create a service account, and grant it rights limited to dss-ns-U
  • Get the secret of this service account and pass it to the Spark driver
  • The Spark driver will use this secret to create and manage pods in the dss-ns-U namespace (but does not have access to any other namespace)
  • At the end of the job, destroy the service account

One-namespace-per-team setup

  • In the Spark configuration, enable the “Managed K8S configuration” checkbox
  • In “Target namespace”, enter something like ${adminProperty:k8sNS}
  • Set Authentication mode to “Create service accounts dynamically”

Then, for each user, you need to set an “admin property” named k8sNS, with the name of the team namespace to use for this user. This can be automated through the API. See above for how this will work.

With this setup, there may be a fixed number of namespaces so you don’t need to auto-create namespaces. The account running Dataiku only needs full access to these namespaces in order to create service accounts in them. This can be useful if you don’t have the ability to create namespaces. However, this leaves the possibility that skilled hostile users can try to attack other Spark jobs running in the same namespace