Reference architecture: managed compute on GKE and storage on GCS¶
Overview¶
This architecture document explains how to deploy:
A DSS instance running on a Google Compute Engine (GCE) virtual machine
Dynamically-spawned Google Kubernetes Engine (GKE) clusters for computation (Python and R recipes/notebooks, in-memory visual ML, visual and code Spark recipes, Spark notebooks)
Ability to store data in Google Cloud Storage
Security¶
The dssuser
needs to be authenticated on the GCE machine hosting DSS with a GCP Service Account that has sufficient permissions to:
manage GKE clusters
push Docker images to Google Container Registry (GCR)
Main steps¶
Prepare the instance¶
Setup a CentOS 7 GCE machine and make sure that:
you select the right Service Account
you set the access scope to “read + write” for the Storage API
Install and configure Docker CE
Install kubectl
Setup a non-root user for the
dssuser
Install DSS¶
Download DSS, together with the “generic-hadoop3” standalone Hadoop libraries and standalone Spark binaries.
Install DSS, see Installing and setting up
Build base container-exec and Spark images, see Initial setup
Setup containerized execution configuration in DSS¶
Create a new “Kubernetes” containerized execution configuration
Set
gcr.io/your-gcp-project
as the “Image registry URL”Push base images
Setup Spark and metastore in DSS¶
Create a new Spark configuration and enable “Managed Spark-on-K8S”
Set
gcr.io/your-gcp-project
as the “Image registry URL”Push base images
Set metastore catalog to “Internal DSS catalog”
Setup GCS connections¶
Setup as many GCS connections as required, with appropriate credentials and permissions
Make sure that “GS” is selected as the HDFS interface
Install GKE plugin¶
Install the GKE plugin
Create a new “GKE connections” preset and fill in :
the GCP project key
the GCP zone
Create a new “Node pools” preset and fill in:
the machine type
the number of nodes
Create your first cluster¶
Create a new cluster, select “create GKE cluster” and enter the desired name
Select the previously created presets and create the cluster
Cluster creation takes around 5 minutes
Use your cluster¶
Create a new DSS project and configure it to use your newly-created cluster
You can now perform all Spark operations over Kubernetes
GCS datasets that are built will sync to the local DSS metastore