Amazon Elastic MapReduce¶
DSS supports EMR versions 4.7 to 5.4.
Later 5.x versions may work, as they rarely introduce incompatible changes.
- Multi-user security is not supported on EMR
- While DSS should be able to connect to a EMR cluster that is Kerberos-secure, this kind of deployment has not been validated. Deploying Kerberos security on a EMR cluster is a fully manual and complex operation
Connecting DSS to EMR¶
DSS running on one of the cluster nodes¶
You can install DSS on one of the EMR cluster nodes. In that case, you don’t require any specific additional steps, just follow the regular Hadoop installation steps
DSS outside of the cluster¶
DSS can be deployed on a regular EC2 instance, not part of the EMR cluster itself. This kind of deployment is called an “edge node” deployment. It requires copying EMR libraries and cluster configuration from the cluster master to the EC2 instance running DSS.
This operation is not officially documented by EMR
S3 dataset connections are compatible with EMRFS. The “HDFS interface” connection parameter should be set to “Amazon EMRFS” instead of the default “Hadoop S3A” for Spark jobs to directly access these datasets using EMRFS.
HDFS vs S3¶
One of the main advantages of EMR over other Hadoop distributions is the ability to easily shut down or scale the cluster in a few minutes, depending on the load. Here are some deployment scenarios for working with DSS on EMR while keeping the elasticity of EMR.