Setting up Spark integration

There are four major ways to setup Spark in Dataiku:

  • If you are using Dataiku Cloud Stacks installation, Spark on Elastic AI clusters is already setup and ready to use, you do not need any further action

  • If you are doing a custom installation with Elastic AI, this will configure and enable Spark on Elastic AI clusters

  • If you are doing a custom installation with Hadoop, Spark will be available through your Hadoop cluster. Please see Spark for more details.

  • Using “Unmanaged Spark on Kubernetes”

Unmanaged Spark on Kubernetes

Warning

This is a very custom setup. We recommend that you leverage Dataiku Elastic AI capabilities rather.

The precise steps to follow for Spark-on-Kubernetes depend on which managed Kubernetes offering you are using and which cloud storage you want to use.

We strongly recommend that you rather use managed K8S clusters.

The rest of this page provides indicative instructions for non-managed deployments

Main steps

Configure DSS

You first need to configure DSS to use your Spark 2.4

Build your Docker images

Follow the Spark documentation to build Docker images from your Spark distribution and push it to your repository.

Note that depending on which cloud storage you want to connect to, it may be necessary to modify the Spark Dockerfiles. See our guided installation procedures for more details.

Create the Spark configuration

Create a named Spark configuration (see Spark configurations), and set at least the following keys:

  • spark.master: k8s://https://IP_OF_YOUR_K8S_CLUSTER

  • spark.kubernetes.container.image: the tag of the image that you pushed to your repository