Amazon Elastic MapReduce

Warning

Deprecated: Support for Amazon EMR is Deprecated and will be removed in a future DSS version

Although DSS still supports EMR, we strongly advise against setting up new deployments with EMR.

DSS does not support the latest EMR versions, security options are limited, and our experience is that EMR deployments are associated with a higher complexity and tend to generate higher administration workloads.

We recommend that you use a fully Elastic AI infrastructure based on EKS. Please see Elastic AI computation, or get in touch with your Dataiku Customer Success Manager, Technical Account Manager or Sales Engineer for more information and studying the best options.

Supported versions

DSS supports EMR versions 5.18 to 5.30.

DSS is not compatible with EMR 5.31 / 5.32 and later.

DSS is not compatible with EMR version 6.x.

Security

  • DSS can connect to kerberized EMR clusters using the standard procedure.

  • User isolation is not supported on EMR.

Deployment scenarios

Let DSS dynamically manage one or several EMR clusters

See Dynamic AWS EMR clusters

Connect DSS to an existing EMR cluster

DSS can be connected to an EMR cluster using the standard Hadoop integration procedure, provided the underlying host is configured as a client to the cluster.

DSS running on one of the cluster nodes

You can install DSS on one of the EMR cluster nodes. In that case, you don’t require any specific additional steps, just follow the regular Hadoop installation steps

DSS can be directly installed on the EMR master node. Note that in order to ensure this host has enough resources (memory and CPU) to concurrently run EMR roles and DSS, you should configure a larger instance than the default one proposed by the EMR deployment tool.

DSS can also be installed on a dedicated EMR worker node, typically one which has been configured to provide the mimimally acceptable amount of resources to the cluster, in order to be entirely available to DSS.

Warning

EMR cluster nodes are volatile and only have volatile disks by default. Installing DSS on these volatile disks mean that you will lose all work in DSS if your cluster is stopped, or if your cluster node restarts for any reason.

You must manually attach an EBS to the cluster node and install DSS on this EBS. After restart of the cluster, you’ll need to reattach the EBS to a node of the new cluster, rerun Hadoop integration and restart DSS from here (this can be easily automated through EMR bootstrap actions).

EMR worker nodes may not have the full client configuration installed, and may in particular be missing the contents of /etc/spark/conf/. This can be copied over from the master node.

DSS outside of the cluster

DSS can be deployed on a regular EC2 instance, not part of the EMR cluster itself. This kind of deployment is called an “edge node” deployment. It requires copying EMR libraries and cluster configuration from the cluster master to the EC2 instance running DSS.

This operation is not officially documented by EMR nor officially supported by Dataiku. Please contact Dataiku for more information.

This deployment mode has the advantage of allowing you to completely shut down your cluster and recreate another one. You will then just need to resync some configuration.

Connect DSS to multiple existing EMR clusters

Follow the steps outlined above. You will then be able to declare multiple static Hadoop clusters in DSS (see Multiple Hadoop clusters) with the associated configuration keys

Using EMRFS

Most of the time, when using dynamic EMR clusters, you will store all inputs and outputs of your flows on S3. Access to S3 from DSS and from the EMR cluster is done through EMRFS, which is an HDFS variant:

  • Go to Administration > Connections, and add a new HDFS connection

  • Enter s3://your-bucket or s3://your-bucket/prefix as the root path URI

  • Optionally, enter S3 credentials as described below

In addition to this recommended way of doing with EMRFS (i.e., through HDFS connections pointing to s3://), the basic “S3 datasets” are also compatible with EMRFS. The “HDFS interface” connection parameter should be set to “Amazon EMRFS”. This allows you to run Spark jobs directly on S3 datasets and they will be able to directly access these datasets. However, this does not give access to Hadoop-specific file formats like Parquet and ORC. Using HDFS connections is the recommended way of dealing with EMRFS.

EMRFS credentials

If both the EC2 instance hosting DSS and the EMR nodes have an IAM role which grants access to the required S3 bucket, you do not have to configure explicit credentials for DSS to access EMRFS-hosted datasets.

Otherwise, you can specify S3 credentials at the connection level by defining the following properties in “Extra Hadop conf.” (refer to EMRFS documentation for details about available properties):

  • Add a property called fs.s3.awsAccessKeyId with your AWS access key id

  • Add a property called fs.s3.awsSecretAccessKey with your AWS secret key

These properties will be used by DSS whenever accessing files within this connection, and will be passed to Hadoop/Spark jobs as required.

Warning

In order to protect credentials, DSS will only pass these additional properties to Spark jobs from users who are allowed to read the connection details, as set in the connection security settings.

When this is not allowed, Spark jobs will fall back to reading and writing the datasets through the DSS backend, which may cause important performance and scalability loss.