Reference architecture: manage compute on AKS and storage on ADLS gen2¶
Overview¶
This architecture document explains how to deploy:
A DSS instance running on an Azure virtual machine
Dynamically-spawned Azure Kubernetes Service (AKS) clusters for computation (Python and R recipes/notebooks, in-memory visual ML, visual and code Spark recipes, Spark notebooks)
Ability to store data in Azure DataLake Storage (ADLS) gen2
Security¶
We assume that all operations described here are done within a common Azure Resource Group (RG), in which the Service Principal (SP) you are using has sufficient permissions to:
Manage AKS clusters
Push Docker images to Azure Container Registry (ACR)
Read/write from/to ADLS gen 2
Main steps¶
Prepare the instance¶
Setup a CentOS 7 Azure VM in your target RG
Install and configure Docker CE
Install kubectl
Setup a non-root user for the
dssuser
Install DSS¶
Download DSS, together with the “generic-hadoop3” standalone Hadoop libraries and standalone Spark binaries.
Install DSS, see Installing and setting up
Build base container-exec and Spark images, see Initial setup
Setup containerized execution configuration in DSS¶
Create a new “Kubernetes” containerized execution configuration
Set
your-cr.azurecr.io
as the “Image registry URL”Push base images
Setup Spark and metastore in DSS¶
Create a new Spark configuration and enable “Managed Spark-on-K8S”
Set
your-cr.azurecr.io
as the “Image registry URL”Push base images
Set metastore catalog to “Internal DSS catalog”
Setup ADLS gen2 connections¶
Setup as many Azure blob storage connections as required, with appropriate credentials and permissions
Make sure that “ABFS” is selected as the HDFS interface
Install AKS plugin¶
Install the AKS plugin
Create a new “AKS connection” preset and fill in:
the Azure subscription ID
the tenant ID
the client ID
the password (client secret)
Create a new “Node pools” preset and fill in:
the machine type
the default number of nodes
Create your first cluster¶
Create a new cluster, select “Create AKS cluster” and enter the desired name
Select the previously created presets
In the “Advanced options” section, type an IP range for the Service CIDR (e.g. 10.0.0.0/16) and an IP address for the DNS IP (e.g. 10.0.0.10).
Click on “Start/attach”. Cluster creation takes between 5 and 10 minutes.
Use your cluster¶
Create a new DSS project and configure it to use your newly-created cluster
You can now perform all Spark operations over Kubernetes
ADLS gen2 datasets that are built will sync to the local DSS metastore.