Howto: Configure an EC2 instance as an EMR edge node

1. Overview

Please first get familiar with the documentation on using EMR with DSS: https://doc.dataiku.com/dss/latest/hadoop/distributions/emr.html

This document provides a procedure that allows you to setup an EC2 instance so that it can be used as an "edge node" to one or multiple EMR clusters.

This allows you to implement the "Connection to existing EMR clusters / DSS outside of the cluster" scenario.

Note: This procedure also technically permits the "Let DSS dynamically manage one or several EMR clusters" scenario, however, for this latter scenario, we recommend that you preferably use the AMI that Dataiku provides to this effect, referenced in the "Dynamic EMR clusters" documentation page.

Important note: This operation is not officially documented by EMR nor officially supported by Dataiku. To the best of our knowledge, this procedure is complete and there are no known issues with it at the time of writing. Furthermore, several customers currently use this kind of setup. It is however not part of the official DSS reference documentation.

2. Procedure

Initial instance creation

NB: it might be necessary to start the instance with a cloud-init stanza to disable the application of security patches on first boot

    #cloud-config
    repo_upgrade: none

Copy the repository

Copy the emr-applications repository definition from the EMR master, and its signing key:

Install packages

Install the following Hadoop and EMRFS packages (using yum):

    hadoop-client
    hadoop-lzo
    spark-core
    spark-python
    spark-R
    spark-datanucleus
    hive
    hive-hcatalog
    pig
    tez
    openssl-devel
    emrfs
    emr-*

Copy configuration

Copy the Hadoop and EMRFS configuration files from the EMR master (using scp / rsync):

    /etc/hadoop/conf
    /etc/hive/conf
    /etc/pig/conf
    /etc/spark/conf
    /etc/tez/conf
    /etc/zookeeper/conf
    /usr/share/aws/emr/emrfs/conf
    /usr/share/aws/emr/security     # If present - TLS certificates

Create working directories

Create (mkdir -p) and allow write to (chmod 1777) the various scratch directories used by the EMR configuration (this list may change a bit with EMR versions)

    /mnt/s3                    # fs.s3.buffer.dir
    /mnt1/s3                   # fs.s3.buffer.dir
    /mnt/var/lib/hadoop/tmp    # hadoop.tmp.dir
    /mnt/tmp                   # java.io.tmpdir
    /var/log/hive/user         # hive.log.dir
    /var/log/pig               # pig.logfile

Miscellaneous configuration patches

Recent versions of EMRFS issue a number of warnings including a Java stack upon initialization, as they cannot read EMR configuration data from the local host. These warnings can be ignored, and the Java stack may be suppressed by editing file /var/aws/emr/userData.json to contain an empty JSON object ({}).

Recent versions of EMR issue warnings on Spark startup due to a missing variable definition in the Spark Streaming logger. This can be shut off by commenting out line log4j.logger.org.apache.spark.streaming in /etc/spark/conf/log4j.properties.

Test

At that point, client connectivity to the cluster can be checked using the standard command line tools (hdfs, yarn, hive, pyspark).

Setup DSS

DSS integration with Hadoop and Spark can be setup using the standard procedure.