Reference architecture: managed compute on GKE and storage on GCS¶
Overview¶
This architecture document explains how to deploy:
- A DSS instance running on a Google Compute Engine (GCE) virtual machine
- Dynamically-spawned Google Kubernetes Engine (GKE) clusters for computation (Python and R recipes/notebooks, in-memory visual ML, visual and code Spark recipes, Spark notebooks)
- Ability to store data in Google Cloud Storage
Security¶
The dssuser
needs to be authenticated on the GCE machine hosting DSS with a GCP Service Account that has sufficient permissions to:
- manage GKE clusters
- push Docker images to Google Container Registry (GCR)
Main steps¶
Prepare the instance¶
Setup a CentOS 7 GCE machine and make sure that:
- you select the right Service Account
- you set the access scope to “read + write” for the Storage API
Install and configure Docker CE
Install kubectl
Setup a non-root user for the
dssuser
Install DSS¶
- Download DSS, together with the “generic-hadoop3” standalone Hadoop libraries and standalone Spark binaries.
- Install DSS, see Installing DSS
- Setup Hadoop and Spark integrations, see Setting up Hadoop and Spark integration
- Build the base container-exec Docker image, see Setting up (Kubernetes)
- Build the base Spark-container-exec Docker image, see Managed Spark on K8S
Setup containerized execution configuration in DSS¶
- Create a new “Kubernetes” containerized execution configuration
- Set
gcr.io/your-gcp-project
as the “Image registry URL” - Push base images
Setup Spark and metastore in DSS¶
- Create a new Spark configuration and enable “Managed Spark-on-K8S”
- Set
gcr.io/your-gcp-project
as the “Image registry URL” - Push base images
- Set metastore catalog to “Internal DSS catalog”
Setup GCS connections¶
- Setup as many GCS connections as required, with appropriate credentials and permissions
- Make sure that “GS” is selected as the HDFS interface
Install GKE plugin¶
Install the GKE plugin
Create a new “GKE connections” preset and fill in :
- the GCP project key
- the GCP zone
Create a new “Node pools” preset and fill in:
- the machine type
- the number of nodes
Create your first cluster¶
- Create a new cluster, select “create GKE cluster” and enter the desired name
- Select the previously created presets and create the cluster
- Cluster creation takes around 5 minutes
Use your cluster¶
- Create a new DSS project and configure it to use your newly-created cluster
- You can now perform all Spark operations over Kubernetes
- GCS datasets that are built will sync to the local DSS metastore