Amazon Elastic MapReduce

Supported versions

DSS supports EMR versions 4.7 to 5.7.

Warning

EMR 5.8 (latest version as of October 2017) is not supported. Trying to run DSS on EMR 5.8 will fail.

Security

  • Multi-user security is not supported on EMR
  • While DSS should be able to connect to a EMR cluster that is Kerberos-secure, this kind of deployment has not been validated. Deploying Kerberos security on a EMR cluster is a fully manual and complex operation

Connecting DSS to EMR

See “Deployment scenarios” for more details.

DSS running on one of the cluster nodes

You can install DSS on one of the EMR cluster nodes. In that case, you don’t require any specific additional steps, just follow the regular Hadoop installation steps

Warning

EMR cluster nodes are volatile and only have volatile disks by default. Installing DSS on these volatile disks mean that you will lose all work in DSS if your cluster is stopped, or if your cluster node restarts for any reason.

You must manually attach an EBS to the cluster node and install DSS on this EBS. After restart of the cluster, you’ll need to reattach the EBS to a node of the new cluster, rerun Hadoop integration and restart DSS from here.

DSS outside of the cluster

DSS can be deployed on a regular EC2 instance, not part of the EMR cluster itself. This kind of deployment is called an “edge node” deployment. It requires copying EMR libraries and cluster configuration from the cluster master to the EC2 instance running DSS.

This operation is not officially documented by EMR nor officially supported by Dataiku. Please contact Dataiku for more information.

EMRFS support

S3 dataset connections are compatible with EMRFS. The “HDFS interface” connection parameter should be set to “Amazon EMRFS” instead of the default “Hadoop S3A” for Spark jobs to directly access these datasets using EMRFS.

Deployment scenarios

One of the main advantages of EMR over other Hadoop distributions is the ability to easily shut down or scale the cluster in a few minutes, depending on the load. Here are some deployment scenarios for working with DSS on EMR while keeping the elasticity of EMR.

TODO