Amazon Elastic MapReduce¶
DSS supports EMR versions 5.18 to 5.24 (later versions may work but have not been yet qualified).
EMR clusters configured with AWS Glue Data Catalog for the Hive metastore are not supported.
- DSS can connect to kerberized EMR clusters using the standard procedure.
- Multi-user security is not supported on EMR.
DSS can be connected to an EMR cluster using the standard Hadoop integration procedure, provided the underlying host is configured as a client to the cluster.
You can install DSS on one of the EMR cluster nodes. In that case, you don’t require any specific additional steps, just follow the regular Hadoop installation steps
DSS can be directly installed on the EMR master node. Note that in order to ensure this host has enough resources (memory and CPU) to concurrently run EMR roles and DSS, you should configure a larger instance than the default one proposed by the EMR deployment tool.
DSS can also be installed on a dedicated EMR worker node, typically one which has been configured to provide the mimimally acceptable amount of resources to the cluster, in order to be entirely available to DSS.
EMR cluster nodes are volatile and only have volatile disks by default. Installing DSS on these volatile disks mean that you will lose all work in DSS if your cluster is stopped, or if your cluster node restarts for any reason.
You must manually attach an EBS to the cluster node and install DSS on this EBS. After restart of the cluster, you’ll need to reattach the EBS to a node of the new cluster, rerun Hadoop integration and restart DSS from here (this can be easily automated through EMR bootstrap actions).
EMR worker nodes may not have the full client configuration installed, and may in particular be missing the contents of
/etc/spark/conf/. This can be copied over from the master node.
DSS can be deployed on a regular EC2 instance, not part of the EMR cluster itself. This kind of deployment is called an “edge node” deployment. It requires copying EMR libraries and cluster configuration from the cluster master to the EC2 instance running DSS.
This operation is not officially documented by EMR nor officially supported by Dataiku. Please contact Dataiku for more information.
This deployment mode has the advantage of allowing you to completely shut down your cluster and recreate another one. You will then just need to resync some configuration.
Most of the time, when using dynamic EMR clusters, you will store all inputs and outputs of your flows on S3. Access to S3 from DSS and from the EMR cluster is done through EMRFS, which is an HDFS variant:
- Go to Administration > Connections, and add a new HDFS connection
s3://your-bucket/prefixas the root path URI
- Optionally, enter S3 credentials as described below
In addition to this recommended way of doing with EMRFS (i.e., through HDFS connections pointing to
the basic “S3 datasets” are also compatible with EMRFS. The “HDFS interface” connection parameter should be set
to “Amazon EMRFS”. This allows you to run Spark jobs directly on S3 datasets and they will be able to directly access these datasets. However, this does not give access to Hadoop-specific file formats like Parquet and ORC. Using HDFS connections is the recommended way of dealing with EMRFS.
If both the EC2 instance hosting DSS and the EMR nodes have an IAM role which grants access to the required S3 bucket, you do not have to configure explicit credentials for DSS to access EMRFS-hosted datasets.
Otherwise, you can specify S3 credentials at the connection level by defining the following properties in “Extra Hadop conf.” (refer to EMRFS documentation for details about available properties):
- Add a property called
fs.s3.awsAccessKeyIdwith your AWS access key id
- Add a property called
fs.s3.awsSecretAccessKeywith your AWS secret key
These properties will be used by DSS whenever accessing files within this connection, and will be passed to Hadoop/Spark jobs as required.
In order to protect credentials, DSS will only pass these additional properties to Spark jobs from users who are allowed to read the connection details, as set in the connection security settings.
When this is not allowed, Spark jobs will fall back to reading and writing the datasets through the DSS backend, which may cause important performance and scalability loss.