Managed Spark on K8S

Following this deployment will give you a full installation of DSS with Spark on Kubernetes, able to natively interact with S3, WASB, ADLS and GS (i.e. the cloud storages of the 3 major cloud providers).

Initial setup

  • Download the dataiku-dss-spark-standalone binary from your usual Dataiku DSS download site
  • Download the dataiku-dss-hadoop3-standalone-libs-generic binary from your usual Dataiku DSS download site
  • Stop DSS
./bin/dss stop
  • Run Hadoop integration to get Parquet and ORC support
./bin/dssadmin install-hadoop-integration -standalone generic-hadoop3 -standaloneArchive /PATH/TO/dataiku-dss-hadoop3-standalone-libs-generic...tar.gz
  • Run Spark integration
./bin/dssadmin install-spark-integration -standaloneArchive /PATH/TO/dataiku-dss-spark-standalone....tar.gz

Build container images

In order to run Spark workloads on Kubernetes, you need to build Docker images for the executors.

./bin/dssadmin build-container-exec-base-spark-image

Setup the named configurations

  • Go to Administration > Spark

  • Repeat the following operations for each named Spark configuration that you want to run on Kubernetes

    • Enable “Managed Spark on K8S”

    • Enter the image registry URL (See Running in containers for more details)

    • Dataiku recommends to create a namespace per user:

      • Set dssns_${dssUserLogin} as namespace
      • Enable “auto-create namespace”
    • Set “Authentication mode” to “Create service accounts dynamically”

  • Save

  • Click on “Push base images”

Use Kubernetes

Each Spark activity which is configured to use one of the K8S-enabled Spark configurations will automatically use Kubernetes. For more information about Kubernetes clusters, see Running in containers