Please first get familiar with the documentation on using EMR with DSS: https://doc.dataiku.com/dss/latest/hadoop/distributions/emr.html
This document provides a procedure that allows you to setup an EC2 instance so that it can be used as an "edge node" to one or multiple EMR clusters.
This allows you to implement the "Connection to existing EMR clusters / DSS outside of the cluster" scenario.
Note: This procedure also technically permits the "Let DSS dynamically manage one or several EMR clusters" scenario, however, for this latter scenario, we recommend that you preferably use the AMI that Dataiku provides to this effect, referenced in the "Dynamic EMR clusters" documentation page.
Important note: This operation is not officially documented by EMR nor officially supported by Dataiku. To the best of our knowledge, this procedure is complete and there are no known issues with it at the time of writing. Furthermore, several customers currently use this kind of setup. It is however not part of the official DSS reference documentation.
This procedure has been tested with EMR 5.10 to 5.13, and 5.18 to 5.19 (versions 5.14 to 5.17 may also work but have not been qualified)
This procedure assumes good familiarity with AWS, AmazonLinux, cloud-init, yum and Hadoop.
This procedure requires that you already have an EMR cluster up and running, with SSH access to the master node
Start with an Amazon Linux instance with the same base version as the one used in the target version of EMR. This means Amazon Linux 2017.09 for EMR 5.10 to EMR 5.13, and Amazon Linux 2018.03 for EMR 5.18 to EMR 5.19.
Freeze the Amazon Linux version against updates by commenting out releasever=latest
in /etc/yum.conf
NB: it might be necessary to start the instance with a cloud-init stanza to disable the application of security patches on first boot
#cloud-config
repo_upgrade: none
Adjust the initial set of packages:
yum install java-1.8.0-openjdk-devel R
yum remove java-1.7.0-openjdk-\*
Copy the emr-applications repository definition from the EMR master, and its signing key:
/etc/yum.repos.d/emr-apps.repo
/var/aws/emr/repoPublicKey.txt
Install the following Hadoop and EMRFS packages (using yum):
hadoop-client
hadoop-lzo
spark-core
spark-python
spark-R
spark-datanucleus
hive
hive-hcatalog
pig
tez
openssl-devel
emrfs
emr-*
Copy the Hadoop and EMRFS configuration files from the EMR master (using scp / rsync):
/etc/hadoop/conf
/etc/hive/conf
/etc/pig/conf
/etc/spark/conf
/etc/tez/conf
/etc/zookeeper/conf
/usr/share/aws/emr/emrfs/conf
/usr/share/aws/emr/security # If present - TLS certificates
Create (mkdir -p) and allow write to (chmod 1777) the various scratch directories used by the EMR configuration (this list may change a bit with EMR versions)
/mnt/s3 # fs.s3.buffer.dir
/mnt1/s3 # fs.s3.buffer.dir
/mnt/var/lib/hadoop/tmp # hadoop.tmp.dir
/mnt/tmp # java.io.tmpdir
/var/log/hive/user # hive.log.dir
/var/log/pig # pig.logfile
Recent versions of EMRFS issue a number of warnings including a Java stack upon initialization, as they cannot read EMR configuration data from the local host. These warnings can be ignored, and the Java stack may be suppressed by editing file /var/aws/emr/userData.json
to contain an empty JSON object ({}
).
Recent versions of EMR issue warnings on Spark startup due to a missing variable definition in the Spark Streaming logger. This can be shut off by commenting out line log4j.logger.org.apache.spark.streaming
in /etc/spark/conf/log4j.properties
.
At that point, client connectivity to the cluster can be checked using the standard command line tools (hdfs, yarn, hive, pyspark).
DSS integration with Hadoop and Spark can be setup using the standard procedure.