Reference architecture: managed compute on EKS with Glue and Athena

Warning

Use this reference already if you are not using a Dataiku Cloud Stacks for AWS installation. When using a Dataiku Cloud Stacks for AWS installation, these steps are already taken care of.

Overview

This architecture document explains how to deploy:

  • DSS running on an EC2 machine

  • Computation (Python and R recipes, Python and R notebooks, in-memory visual ML, visual Spark recipes, coding Spark recipes, Spark notebooks) running over dynamically-spawned EKS clusters

  • Data assets produced by DSS synced to the Glue metastore catalog

  • Ability to use Athena as engine for running visual recipes, SQL notebooks and charts

  • Security handled by multiple sets of AWS connection credentials

For information on calculating costs of AWS services, please see the AWS pricing documentation here: https://aws.amazon.com/pricing/.

Architecture diagram

  • DSS software deployed to EC2 instances inside customer subscription

  • DSS uses Glue as a metastore, and Athena for interactive SQL queries, against data stored in customers’s own S3*

  • DSS uses EKS for containerized Python, R and Spark data processing and Machine Learning, as well as API service deployment

  • DSS uses EMR as a Data Lake for in-cluster Hive and Spark processing

  • DSS uses Redshift for in-database SQL processing

  • DSS integrates with SageMaker, Rekognition ad Comprehend ML & AI services

Dataiku DSS Architecture on AWS

Security

The dssuser needs to have an AWS keypair installed on the EC2 machine in order to manage EKS clusters. The AWS keypair needs all associated permissions to interact with EKS.

This AWS keypair will not be accessible to DSS users.

End-users use dedicated AWS keypairs to access S3 data.

For information on IAM please see the AWS documentation at https://aws.amazon.com/iam/faqs/

Main steps

Prepare the instance

  • Setup a Centos 7 EC2 machine

  • Make sure that the EC2 machine has the “default” security group assigned

  • Install and configure Docker CE

  • Install kubectl

  • Install the aws command line client

  • Setup a non-root user for the dssuser

Setup connectivity to AWS

  • As the dssuser, run aws configure to setup AWS credentials private to the dssuser. These AWS credentials require:

    • Authorization to push to ECR

    • Full control on EKS

Install DSS

  • Download DSS, together with the “generic-hadoop3” standalone Hadoop libraries and standalone Spark binaries

  • Install DSS, see Custom Dataiku install on Linux

  • Build base container-exec and Spark images, see Initial setup

Setup container configuration in DSS

  • Create a new container config, of type K8S

  • Set ACCOUNT.dkr.ecr.REGION.amazonaws.com as the Image URL

  • Set the pre-push hook to “Enable ECR”

  • Push base images

Setup Spark and Metastore in DSS

  • Enable “Managed Spark on K8S” in Spark configurations in DSS

  • Set ACCOUNT.dkr.ecr.REGION.amazonaws.com as the Image URL

  • Set the pre-push hook to “Enable ECR”

  • Push base images

  • Set metastore catalog to “Glue”

Setup S3 connections

  • Setup as many S3 connections as required, with credentials and appropriate permissions

  • Make sure that “S3A” is selected as the HDFS interface

  • Enable synchronization on the S3 connections, and select a Glue database

Setup Athena connections

  • For each S3 connection, setup an Athena connection

  • Setup the Athena connection to get credentials from the corresponding S3 connection

  • Setup the S3 connection to be associated to the corresponding Athena connection

Install EKS plugin

  • Install the EKS plugin

  • Create a new preset for “Connection”, leaving all empty

  • Create a new preset for “Networking settings”

    • Enter the identifiers of two subnets in the same VPC as DSS

    • Enter the same security group identifiers as DSS

Create your first cluster

  • Go to Admin > Clusters, Create cluster

  • Select “Create EKS cluster”, enter a name

  • Select the predefined connection and networking settings

  • Select the node pool size

  • Create the cluster

  • Cluster creation takes around 15-20 minutes

Use it

  • Configure a project to use this cluster

  • You can now perform all Spark operations over Kubernetes

  • Datasets built will sync to the Glue metastore

  • You can now create SQL notebooks on Athena

  • You can create SQL recipes over the S3 datasets, this will use Athena

Warning