Reference architecture: manage compute on AKS and storage on ADLS gen2¶
Overview¶
This architecture document explains how to deploy:
- A DSS instance running on an Azure virtual machine
- Dynamically-spawned Azure Kubernetes Service (AKS) clusters for computation (Python and R recipes/notebooks, in-memory visual ML, visual and code Spark recipes, Spark notebooks)
- Ability to store data in Azure DataLake Storage (ADLS) gen2
Security¶
We assume that all operations described here are done within a common Azure Resource Group (RG), in which the Service Principal (SP) you are using has sufficient permissions to:
- Manage AKS clusters
- Push Docker images to Azure Container Registry (ACR)
- Read/write from/to ADLS gen 2
Main steps¶
Prepare the instance¶
- Setup a CentOS 7 Azure VM in your target RG
- Install and configure Docker CE
- Install kubectl
- Setup a non-root user for the
dssuser
Install DSS¶
- Download DSS, together with the “generic-hadoop3” standalone Hadoop libraries and standalone Spark binaries.
- Install DSS, see Installing DSS
- Setup Hadoop and Spark integrations, see Setting up Hadoop and Spark integration
- Build the base container-exec Docker image, see Setting up (Kubernetes)
- Build the base Spark-container-exec Docker image, see Managed Spark on K8S
Setup containerized execution configuration in DSS¶
- Create a new “Kubernetes” containerized execution configuration
- Set
your-cr.azurecr.io
as the “Image registry URL” - Push base images
Setup Spark and metastore in DSS¶
- Create a new Spark configuration and enable “Managed Spark-on-K8S”
- Set
your-cr.azurecr.io
as the “Image registry URL” - Push base images
- Set metastore catalog to “Internal DSS catalog”
Setup ADLS gen2 connections¶
- Setup as many Azure blob storage connections as required, with appropriate credentials and permissions
- Make sure that “ABFS” is selected as the HDFS interface
Install AKS plugin¶
Install the AKS plugin
Create a new “AKS connection” preset and fill in:
- the Azure subscription ID
- the tenant ID
- the client ID
- the password (client secret)
Create a new “Node pools” preset and fill in:
- the machine type
- the default number of nodes
Create your first cluster¶
- Create a new cluster, select “Create AKS cluster” and enter the desired name
- Select the previously created presets
- In the “Advanced options” section, type an IP range for the Service CIDR (e.g. 10.0.0.0/16) and an IP address for the DNS IP (e.g. 10.0.0.10).
- Click on “Start/attach”. Cluster creation takes between 5 and 10 minutes.
Use your cluster¶
- Create a new DSS project and configure it to use your newly-created cluster
- You can now perform all Spark operations over Kubernetes
- ADLS gen2 datasets that are built will sync to the local DSS metastore.