You are viewing the documentation for version 12 of DSS.

Reference architecture: managed compute on GKE and storage on GCS¶

Overview¶

This architecture document explains how to deploy:

A DSS instance running on a Google Compute Engine (GCE) virtual machine
Dynamically-spawned Google Kubernetes Engine (GKE) clusters for computation (Python and R recipes/notebooks, in-memory visual ML, visual and code Spark recipes, Spark notebooks)
Ability to store data in Google Cloud Storage

Security¶

The dssuser needs to be authenticated on the GCE machine hosting DSS with a GCP Service Account that has sufficient permissions to:

manage GKE clusters

push Docker images to Google Container Registry (GCR)

Main steps¶

Prepare the instance¶

Setup an AlmaLinux 8 GCE machine and make sure that:
- you select the right Service Account
- you set the access scope to “read + write” for the Storage API
Install and configure Docker CE
Install kubectl
Setup a non-root user for the dssuser

Install DSS¶

Download DSS, together with the “generic-hadoop3” standalone Hadoop libraries and standalone Spark binaries.
Install DSS, see Installing and setting up
Build base container-exec and Spark images, see Initial setup

Setup containerized execution configuration in DSS¶

Create a new “Kubernetes” containerized execution configuration
Set gcr.io/your-gcp-project as the “Image registry URL”
Push base images

Setup Spark and metastore in DSS¶

Create a new Spark configuration and enable “Managed Spark-on-K8S”
Set gcr.io/your-gcp-project as the “Image registry URL”
Push base images
Set metastore catalog to “Internal DSS catalog”

Setup GCS connections¶

Setup as many GCS connections as required, with appropriate credentials and permissions
Make sure that “GS” is selected as the HDFS interface

Install GKE plugin¶

Install the GKE plugin
Create a new “GKE connections” preset and fill in :
- the GCP project key
- the GCP zone
Create a new “Node pools” preset and fill in:
- the machine type
- the number of nodes

Create your first cluster¶

Create a new cluster, select “create GKE cluster” and enter the desired name
Select the previously created presets and create the cluster
Cluster creation takes around 5 minutes

Use your cluster¶

Create a new DSS project and configure it to use your newly-created cluster
You can now perform all Spark operations over Kubernetes
GCS datasets that are built will sync to the local DSS metastore