Dynamic AWS EMR clusters¶
Tier 2 support: Management of dynamic EMR clusters is provided through a plugin and is covered by Tier 2 support
Although DSS supports EMR, we strongly advise against setting up new deployments with EMR.
DSS does not support the latest EMR versions, security options are limited, and our experience is that EMR deployments are associated with a higher complexity and tend to generate higher administration workloads.
We recommend that you use a fully Elastic AI infrastructure based on EKS. Please see Elastic AI computation, or get in touch with your Dataiku Customer Success Manager, Technical Account Manager or Sales Engineer for more information and studying the best options.
DSS can create and manage multiple EMR clusters, allowing you to easily scale your workloads accross multiple clusters, use clusters dynamically for some scenarios, …
For more information on dynamic clusters and the usage of a dynamic cluster for a scenario, please see Multiple Hadoop clusters.
Support for dynamic EMR clusters is provided through the “EMR Dynamic clusters” plugin. You will need to install this plugin in order to use this feature.
Like for other kind of multi-cluster setups, the server that runs DSS needs to have the client libraries for the proper Hadoop distribution. In that case, your server needs to have the EMR client libraries for the EMR version you will use. Dataiku provides a ready-to-use AMI that already includes the required EMR client libraries
The previous requirement implies that the server that will run DSS and start the EMR clusters cannot be an edge node of a different kind of cluster. For example, you cannot manage dynamic EMR clusters from a DSS machine primarily connected to a MapR cluster.
When working with multiple EMR clusters, all clusters should run the same EMR version. If running different versions, some incompatibilities may occur.
It is not possible to create secure dynamic clusters
We strongly recommend that you use our “dataiku-emrclient” AMI which contains everything required for EMR support.
This AMI is named
dataiku-emrclient-EMR_VERSION-BUILD_DATE, where EMR_VERSION is the EMR version with which it is
compatible, and BUILD_DATE is its build date using format YYYYMMDD.
At the time of writing, the latest version of this AMI supports EMR 5.30.2 (“dataiku-emrclient-5.30.2-20220126”). It is available in the following AWS regions:
eu-west-1 (AMI id: ami-0a3edce0134083c4f)
us-east-1 (AMI id: ami-0cda7552753449447)
us-west-1 (AMI id: ami-0c4d7c638ac4a9097)
us-west-2 (AMI id: ami-00846b24187673dd2)
This AMI can be copied to other regions as desired.
EMR versions earlier than 5.30 are based on the Amazon Linux 1 Linux distribution, which is not supported by DSS any more. Although “dataiku-emrclient” images are still available for these EMR versions, they should not be used for new deployments.
Dataiku may periodically rebuild this image to incorporate new updates or support new EMR versions. It is recommended to check for its latest versions, as follows:
using the AWS EC2 console: select the “AMIs” display in the leftmost column, select “Public images” in the drop-down menu at the left of the search field, and search for “dataiku-emrclient”.
using the AWS CLI:
aws ec2 --region eu-west-1 describe-images \ --owners 067063543704 \ --filter 'Name=name,Values=dataiku-emrclient-*' \ --query 'Images.[ImageId,Name]' \ --output table
The AMI does not include DSS. You need to install DSS using the regular DSS installation procedure.
The user account running DSS must have the required credentials in order to create EMR clusters.
The two main ways to accomplish this are:
Make sure that your machine has an IAM role that grants sufficient rights to create EMR clusters
Make sure that your
~/.aws/credentialsfile has valid credentials. This can be achieved by running
aws loginprior to starting DSS
From Administration > Plugins, install the “EMR dynamic clusters” plugin.
Most of the time, when using dynamic EMR clusters, you will store all inputs and outputs of your flows on S3. Access to S3 from DSS and from the EMR cluster is done through EMRFS.
Go to Administration > Connections, and add a new HDFS connection
Enter “s3://your-bucket” or “s3://your-bucket/prefix” as the root path URI
Unless the DSS host has implicit access to the bucket through its IAM role or default credentials in
~/.aws/credentials, define connection-level S3 credentials in “Extra Hadoop conf”:
Add a property called “fs.s3.awsAccessKeyId” with your AWS access key id
Add a property called “fs.s3.awsSecretAccessKey” with your AWS secret key
Go to Administration > Cluster and click “Create cluster”
In “Type”, select “EMR cluster (create cluster)” and give a name to your new cluster. You are taken to the “managed cluster” configuration page, where you will set all of your EMR cluster settings.
The minimal settings that you need to set are:
Your AWS region (leave empty to use the same as the EC2 node running DSS)
The EC2 instance type for the master and worker nodes
The total number of instances you require (there will be 1 master and N-1 slaves in the CORE group)
The version of EMR you want to use. Beware, this should be consistent with the AMI you used.
The VPC Subnet identifier in which you want to create your EMR cluster. This should be the same VPC that the DSS machine is running. Leave empty to use the same as the EC2 node running DSS
The security groups to associate to all of the cluster machines. Make sure to add security groups that grant full access between the DSS host and the EMR cluster members.
Click on “Start/Attach”. Your EMR cluster is created. This phase generally lasts 5 to 10 minutes. When the progress modal closes, you have a working EMR cluster, and an associated DSS dynamic cluster definition that can talk to oit
In any project, go to Settings > Cluster, and select the identifier you gave to the EMR cluster. Any recipe or Hive notebook running in this project will now use your EMR cluster
Go to Administration > Clusters > Your cluster and click “Stop/Detach” to destroy the EMR cluster and release resources.
Note that the DSS cluster definition itself remains, allowing you to recreate the EMR cluster at a later time. Projects that are configured to use this cluster while it is in “Stopped/Detached” state will fail.
For a fully elastic approach, you can create EMR clusters at the beginning of a sequence of scenarios, run the scenarios and then destroy the EMR cluster, fully automatically.
Please see Multiple Hadoop clusters for more information. In the “Setup Cluster” scenario step, you will need to enter the EMR cluster configuration details
In addition to the basic “Start” and “Stop”, the Dynamic EMR clusters plugin provides the ability to scale up and down an attached EMR cluster.
Go to the “Actions” tab of your cluster, and select the “Scale” action. You will have to specify the target number of instances in the CORE and TASK groups. We recommend that you never scale down the CORE group (which contains a small HDFS needed for cluster operations), and instead scale up and down the TASK group
You can scale up/down a cluster as part of a scenario
Add a “Execute macro” step
Select the “Scale cluster up/down” step
Enter the DSS cluster identifier, either directly, or using a variable - the latter case is required if you setup your cluster as part of the scenario.
Select the settings
In this kind of setup you will generally scale up at the beginning of the scenario and scale down at the end. You will need two steps for that. Make sure to select “Run this step > Always” for the “scale down” step. This way, even if your scenario failed, the scale down operation will be executed.
In addition to the basic settings outlined above, the Dynamic EMR Clusters plugin provides some advanced settings
Path for logs