Setting up Spark integration¶

There are four major ways to setup Spark in Dataiku:

If you are using Dataiku Cloud Stacks installation, Spark on Elastic AI clusters is already setup and ready to use, you do not need any further action
If you are doing a custom installation with Elastic AI, this will configure and enable Spark on Elastic AI clusters
If you are doing a custom installation with Hadoop, Spark will be available through your Hadoop cluster. Please see Spark for more details.
Using “Unmanaged Spark on Kubernetes”

Unmanaged Spark on Kubernetes ¶

Warning

This is a very custom setup. We recommend that you leverage Dataiku Elastic AI capabilities rather.

The precise steps to follow for Spark-on-Kubernetes depend on which managed Kubernetes offering you are using and which cloud storage you want to use.

We strongly recommend that you rather use managed K8S clusters.

The rest of this page provides indicative instructions for non-managed deployments

Main steps ¶

Configure DSS ¶

You first need to configure DSS to use your Spark 3.4

Build your Docker images ¶

Follow the Spark documentation to build Docker images from your Spark distribution and push it to your repository.

Note that depending on which cloud storage you want to connect to, it may be necessary to modify the Spark Dockerfiles. See our guided installation procedures for more details.

Create the Spark configuration ¶

Create a named Spark configuration (see Spark configurations), and set at least the following keys:

spark.master: k8s://https://IP_OF_YOUR_K8S_CLUSTER
spark.kubernetes.container.image: the tag of the image that you pushed to your repository

Setting up Spark integration¶

Unmanaged Spark on Kubernetes¶

Main steps¶

Configure DSS¶

Build your Docker images¶

Create the Spark configuration¶