Amazon Elastic MapReduce

Supported versions

DSS supports EMR versions 4.7 to 5.4.

Note

Later 5.x versions may work, as they rarely introduce incompatible changes.

Security

  • Multi-user security is not supported on EMR
  • While DSS should be able to connect to a EMR cluster that is Kerberos-secure, this kind of deployment has not been validated. Deploying Kerberos security on a EMR cluster is a fully manual and complex operation

Connecting DSS to EMR

DSS running on one of the cluster nodes

You can install DSS on one of the EMR cluster nodes. In that case, you don’t require any specific additional steps, just follow the regular Hadoop installation steps

DSS outside of the cluster

DSS can be deployed on a regular EC2 instance, not part of the EMR cluster itself. This kind of deployment is called an “edge node” deployment. It requires copying EMR libraries and cluster configuration from the cluster master to the EC2 instance running DSS.

This operation is not officially documented by EMR

EMRFS support

S3 dataset connections are compatible with EMRFS. The “HDFS interface” connection parameter should be set to “Amazon EMRFS” instead of the default “Hadoop S3A” for Spark jobs to directly access these datasets using EMRFS.

HDFS vs S3

Deployment scenarios

One of the main advantages of EMR over other Hadoop distributions is the ability to easily shut down or scale the cluster in a few minutes, depending on the load. Here are some deployment scenarios for working with DSS on EMR while keeping the elasticity of EMR.

TODO