Amazon Elastic MapReduce

Supported versions

DSS supports EMR versions 5.7 to 5.12

Security

  • Multi-user security is not supported on EMR
  • While DSS should be able to connect to a EMR cluster that is Kerberos-secure, this kind of deployment has not been validated. Deploying Kerberos security on a EMR cluster is a fully manual and complex operation

Deployment scenarios

Connect DSS to an existing EMR cluster

This is required if you have an existing EMR cluster, outside of DSS and just want to have DSS use it

DSS running on one of the cluster nodes

You can install DSS on one of the EMR cluster nodes. In that case, you don’t require any specific additional steps, just follow the regular Hadoop installation steps

Warning

EMR cluster nodes are volatile and only have volatile disks by default. Installing DSS on these volatile disks mean that you will lose all work in DSS if your cluster is stopped, or if your cluster node restarts for any reason.

You must manually attach an EBS to the cluster node and install DSS on this EBS. After restart of the cluster, you’ll need to reattach the EBS to a node of the new cluster, rerun Hadoop integration and restart DSS from here.

DSS outside of the cluster

DSS can be deployed on a regular EC2 instance, not part of the EMR cluster itself. This kind of deployment is called an “edge node” deployment. It requires copying EMR libraries and cluster configuration from the cluster master to the EC2 instance running DSS.

This operation is not officially documented by EMR nor officially supported by Dataiku. Please contact Dataiku for more information.

This deployment mode has the advantage of allowing you to completely shut down your cluster and recreate another one. You will then just need to resync some configuration.

Connect DSS to multiple existing EMR clusters

Follow the steps outlined above. You will then be able to declare multiple static Hadoop clusters in DSS (see Multiple Hadoop clusters) with the associated configuration keys

Using EMRFS

Most of the time, when using dynamic EMR clusters, you will store all inputs and outputs of your flows on S3. Access to S3 from DSS and from the EMR cluster is done through EMRFS, which is an HDFS variant

  • Go to Administration > Connections, and add a new HDFS connection
  • Enter “s3://your-bucket” as the Path
  • Add a property called “fs.s3.awsAccessKeyId” with your AWS access key id
  • Add a property called “fs.s3.awsSecretAccessKey” with your AWS secret key

In addition to this recommended way of doing with EMRFS (i.e., through HDFS connections pointing to “s3://”), the basic “S3 datasets” are also compatible with EMRFS. The “HDFS interface” connection parameter should be set to “Amazon EMRFS”. This allows you to run Spark jobs directly on S3 datasets and they will be able to directly access these datasets. However, this does not give access to Hadoop-specific file formats like Parquet and ORC. Using HDFS connections is the recommended way of dealing with EMRFS.