Managed Spark on K8S¶
Following this deployment will give you a full installation of DSS with Spark on Kubernetes, able to natively interact with S3, WASB, ADLS and GS (i.e. the cloud storages of the 3 major cloud providers).
Please note that you can use managed Spark on K8S, even if you use unmanaged K8S clusters
Initial setup¶
- Download the dataiku-dss-spark-standalone binary from your usual Dataiku DSS download site
- Download the dataiku-dss-hadoop-standalone-libs-generic-hadoop3 binary from your usual Dataiku DSS download site
- Stop DSS
./bin/dss stop
- Run Hadoop integration to get Parquet and ORC support
./bin/dssadmin install-hadoop-integration -standaloneArchive /PATH/TO/dataiku-dss-hadoop3-standalone-libs-generic...tar.gz
- Run Spark integration
./bin/dssadmin install-spark-integration -standaloneArchive /PATH/TO/dataiku-dss-spark-standalone....tar.gz -forK8S
Build container images¶
In order to run Spark workloads on Kubernetes, you need to build Docker images for the executors.
./bin/dssadmin build-base-image --type spark
For more details on building base images and customizing base images, please see Setting up (Kubernetes) and Customization of base images.
Setup the named configurations¶
Go to Administration > Spark
Repeat the following operations for each named Spark configuration that you want to run on Kubernetes
Enable “Managed Spark on K8S”
Enter the image registry URL (See Running in containers for more details)
Dataiku recommends to create a namespace per user:
- Set
dssns-${dssUserLogin}
as namespace - Enable “auto-create namespace”
- Set
Set “Authentication mode” to “Create service accounts dynamically”
When deploying on AWS EKS the setting “Image pre-push hook” should be set to “Enable push to ECR”
Save
Click on “Push base images”
Use Kubernetes¶
Each Spark activity which is configured to use one of the K8S-enabled Spark configurations will automatically use Kubernetes. For more information about Kubernetes clusters, see Running in containers