Managed Spark on K8S

Following this deployment will give you a full installation of DSS with Spark on Kubernetes, able to natively interact with S3, WASB, ADLS and GS (i.e. the cloud storages of the 3 major cloud providers).

Please note that you can use managed Spark on K8S, even if you use unmanaged K8S clusters

Initial setup

  • Download the dataiku-dss-spark-standalone binary from your usual Dataiku DSS download site
  • Download the dataiku-dss-hadoop-standalone-libs-generic-hadoop3 binary from your usual Dataiku DSS download site
  • Stop DSS
./bin/dss stop
  • Run Hadoop integration to get Parquet and ORC support
./bin/dssadmin install-hadoop-integration -standaloneArchive /PATH/TO/dataiku-dss-hadoop3-standalone-libs-generic...tar.gz
  • Run Spark integration
./bin/dssadmin install-spark-integration -standaloneArchive /PATH/TO/dataiku-dss-spark-standalone....tar.gz -forK8S

Build container images

In order to run Spark workloads on Kubernetes, you need to build Docker images for the executors.

./bin/dssadmin build-base-image --type spark

For more details on building base images and customizing base images, please see Setting up (Kubernetes) and Customization of base images.

Setup the named configurations

  • Go to Administration > Spark

  • Repeat the following operations for each named Spark configuration that you want to run on Kubernetes

    • Enable “Managed Spark on K8S”

    • Enter the image registry URL (See Running in containers for more details)

    • Dataiku recommends to create a namespace per user:

      • Set dssns-${dssUserLogin} as namespace
      • Enable “auto-create namespace”
    • Set “Authentication mode” to “Create service accounts dynamically”

  • When deploying on AWS EKS the setting “Image pre-push hook” should be set to “Enable push to ECR”

  • Save

  • Click on “Push base images”

Use Kubernetes

Each Spark activity which is configured to use one of the K8S-enabled Spark configurations will automatically use Kubernetes. For more information about Kubernetes clusters, see Running in containers