Unmanaged Spark on Kubernetes¶
The precise steps to follow for Spark-on-Kubernetes depend on which managed Kubernetes offering you are using and which cloud storage you want to use.
We strongly recommend that you rather use managed Spark-on-K8S and managed K8S clusters.
The rest of this page provides indicative instructions for non-managed deployments
Main steps¶
Configure DSS¶
You first need to configure DSS to use your Spark 2.4
If you are using a Hadoop distribution that comes natively with Spark 2.4, this will be handled during the “Install Spark integration” phase. Else, you may need to specify your custom Spark. See Setting up (without Kubernetes) for more details.
Build your Docker images¶
Follow the Spark documentation to build Docker images from your Spark distribution and push it to your repository.
Note that depending on which cloud storage you want to connect to, it may be necessary to modify the Spark Dockerfiles. See our guided installation procedures for more details.
Create the Spark configuration¶
Create a named Spark configuration (see Spark configurations), and set at least the following keys:
spark.master
:k8s://https://IP_OF_YOUR_K8S_CLUSTER
spark.kubernetes.container.image
:the tag of the image that you pushed to your repository
Security for user-isolation deployments¶
When running with User Isolation Framework, the Spark driver process runs as the impersonated end-user. Thus, the interaction between Spark and Kubernetes also runs as the impersonated end-user.
This requires that each impersonated end-user has credentials to access the Kubernetes. While this deployment is completely possible, it is not typically the case (each user needs to have a ~/.kube/config
file with proper credentials for the Kubernetes cluster).
To make it easier to run Spark on Kubernetes with User Isolation Framework, DSS features a “managed Spark on Kubernetes” mode. For details and setup examples, please see our reference architecture.