Reference architecture: managed compute on EKS with Glue and Athena

Overview

This architecture document explains how to deploy:

  • DSS running on an EC2 machine
  • Computation (Python and R recipes, Python and R notebooks, in-memory visual ML, visual Spark recipes, coding Spark recipes, Spark notebooks) running over dynamically-spawned EKS clusters
  • Data assets produced by DSS synced to the Glue metastore catalog
  • Ability to use Athena as engine for running visual recipes, SQL notebooks and charts
  • Security handled by multiple sets of AWS connection credentials

Security

The dssuser needs to have an AWS keypair installed on the EC2 machine in order to manage EKS clusters. The AWS keypair needs all associated permissions to interact with EKS.

This AWS keypair will not be accessible to DSS users.

End-users use dedicated AWS keypairs to access S3 data

Main steps

Prepare the instance

  • Setup a Centos 7 EC2 machine
  • Make sure that the EC2 machine has the “default” security group assigned
  • Install and configure Docker CE
  • Install kubectl
  • Install the aws command line client
  • Setup a non-root user for the dssuser

Setup connectivity to AWS

  • As the dssuser, run aws configure to setup AWS credentials private to the dssuser. These AWS credentials require:
    • Authorization to push to ECR
    • Full control on EKS

Install DSS

Setup container configuration in DSS

  • Create a new container config, of type K8S
  • Set ACCOUNT.dkr.ecr.REGION.amazonaws.com as the Image URL
  • Set the pre-push hook to “Enable ECR”
  • Push base images

Setup Spark and Metastore in DSS

  • Enable “Managed Spark on K8S” in Spark configurations in DSS
  • Set ACCOUNT.dkr.ecr.REGION.amazonaws.com as the Image URL
  • Set the pre-push hook to “Enable ECR”
  • Push base images
  • Set metastore catalog to “Glue”

Setup S3 connections

  • Setup as many S3 connections as required, with credentials and appropriate permissions
  • Make sure that “S3A” is selected as the HDFS interface
  • Enable synchronization on the S3 connections, and select a Glue database

Setup Athena connections

  • For each S3 connection, setup an Athena connection
  • Setup the Athena connection to get credentials from the corresponding S3 connection
  • Setup the S3 connection to be associated to the corresponding Athena connection

Install EKS plugin

  • Install the EKS plugin
  • Create a new preset for “Connection”, leaving all empty
  • Create a new preset for “Networking settings”
    • Enter the identifiers of two subnets in the same VPC as DSS
    • Enter the same security group identifiers as DSS

Create your first cluster

  • Go to Admin > Clusters, Create cluster
  • Select “Create EKS cluster”, enter a name
  • Select the predefined connection and networking settings
  • Select the node pool size
  • Create the cluster
  • Cluster creation takes around 15-20 minutes

Use it

  • Configure a project to use this cluster
  • You can now perform all Spark operations over Kubernetes
  • Datasets built will sync to the Glue metastore
  • You can now create SQL notebooks on Athena
  • You can create SQL recipes over the S3 datasets, this will use Athena

Detailed steps

Warning

These steps are provided for information only. Dataiku does not guarantee the accuracy of these or the fact that they are still up to date