Multiple Hadoop clusters

DSS can connect to several “Hadoop clusters”, meaning:

  • Several YARN resource managers

  • Several HiveServer2 servers

  • Several sets of Impala servers

Warning

Apart from leveraging Dynamic EMR clusters or Dynamic DataProc clusters, we strongly advise against using Multiple Hadoop Clusters for “traditional” Hadoop clusters, given the very high associated complexity.

This capability to connect to multiple clusters doesn’t include multiple Hadoop filesystems, which is covered by Hadoop filesystems connections (HDFS, S3, EMRFS, WASB, ADLS, GS)

Concepts

Builtin cluster

A Hadoop cluster is mainly defined by the client-side configuration, usually found in /etc/hadoop/conf, which indicates (among others) the address of the YARN resource manager.

When Hadoop integration is setup in DSS, DSS will use a “system-level” Hadoop configuration, which is put in the classpath of DSS.

In DSS, this is called the “builtin cluster”, whose configuration is accessible in Administration > Settings > Hadoop.

The builtin cluster also has associated Hive, Impala and Spark configurations defined in the respective Administration > Settings screens.

Additional clusters

In addition to the builtin cluster, you can define additional clusters in the Administration > Cluster screens.

Each additional cluster is defined by:

  • A set of Hadoop configuration keys that indicate how to connect to the YARN of the additional cluster

  • HiveServer2 connection details for the additional cluster

  • Impala connection details for the additional cluster

  • A set of Spark configuration keys specific to the additional cluster

Managed dynamic clusters

In addition to “static” additional clusters, where you have to define all the connection settings, DSS has a notion of “managed dynamic clusters”. Through a plugin installed in DSS, dynamic clusters can be created by DSS, configured as additional clusters, and shutdown through DSS.

This capability is most often used in cloud deployments, either using the cloud provider’s native Hadoop cluster capability or dynamic Hadoop clusters directly created based on cloud provider’s virtual machines capabilities.

Use an additional cluster

Each project defines whether recipes/notebooks/… of this project run against the builtin cluster or one of the additional clusters.

Per-scenario additional clusters

In addition to project-level definition of which cluster to use, a scenario can:

  • Create a dynamic cluster (for example an EMR cluster)

  • Execute all of its steps on the dynamic cluster

  • Shutdown the dynamic cluster at end

This allows you to have a minimal cluster or no cluster at all for the “design” of the project, and to spawn clusters dynamically for execution of scenarios, leading to a fully elastic resource usage approach. This capabilitiy is more often used for automation nodes.

Restrictions

  • When using multiple Hadoop clusters, all clusters must use the same Hadoop distribution. It is not supported for example to have the builtin cluster running Cloudera, and an additional cluster running Hortonworks

  • All clusters must either be unsecure, or secure using the same Kerberos realms (DSS will only use the principal and keytab of the builtin cluster)

  • User mappings must be similar between clusters

  • Running multiple Spark versions (for example one cluster with Spark 1.6 and one cluster with Spark 2.2) is not supported

  • Multiple MapR clusters is not validated and not supported

Define an additional static cluster

This assumes that DSS is already properly connected and setup to work with primary cluster.

  • Go to Administration > Clusters and create a new cluster

  • Give an identifier to your new cluster

The configuration of an additional cluster is divided in a number of sections. For each section, you need to choose whether you want to inherit the settings of the builtin cluster, or override them for this particular cluster.

Hadoop

In this section, you’ll always override the “Hadoop config keys” section. You must enter here the keys used to define the YARN addresses.

These settings will be passed to:

  • Data preparation jobs running on MapReduce engine

  • Pig recipes

Although this varies, you’ll usually need to define the following keys:

  • fs.defaultFS usually pointing to the native HDFS of the cluster (used notably for various staging directories), ie something like hdfs://namenodeaddress:8020/

  • yarn.resourcemanager.address pointing to the host/port of your resource manager, i.e. something like resourcemanageraddress:8032

Warning

These Hadoop configuration keys are not passed to Spark jobs. See below.

Other settings are for advanced usage only

Hive

In this section, you’ll always override “connection settings” to point to the HiveServer2 of your additional cluster. Refer to Hive for more information about configuring this.

If your builtin cluster does not use “HiveServer2” as default recipe engine, override “creating settings” and select HiveServer2

Other settings are for advanced usage only

Impala

In this section, you’ll always override “connection settings” to point to the impalad nodes of your additional cluster. Refer to Impala for more information about configuring this.

Other settings are for advanced usage only

Spark

Note

Only the “yarn” master in “client” deployment mode really makes sense here.

In this section, you’ll need to add configuration keys to point Spark to your YARN. Note that the configurations defined in the “Hadoop” section do not apply to Spark as Spark uses different keys.

Choose to override runtime config.

The first section “Config keys added to all configurations” contains configuration keys that will be added to all Spark named configurations defined at the global level. The second section “Configurations” contains configuration keys that are added only to a single Spark named conf.

You cannot add new Spark named conf at the additional cluster level.

You’ll often need to add the following keys (note the spark.hadoop prefix used to pass Hadoop configuration keys to Spark):

  • spark.hadoop.fs.defaultFS usually pointing to the native HDFS of the cluster (used notably for various staging directories), ie something like hdfs://namenodeaddress:8020/

  • spark.hadoop.yarn.resourcemanager.address pointing to the host/port of your resource manager, i.e. something like resourcemanageraddress:8032

  • spark.hadoop.yarn.resourcemanager.scheduler.address pointing to the the host/port of the scheduler part of your resource manager, i.e. something like resourcemanageraddress:8030

Other settings are for advanced usage only.

Add a dynamic additional cluster

For this, you first need to install a plugin that provides management capabilities for a type of dynamic cluster.

  • Go to Administration > Clusters, and create a new cluster

  • Select the cluster type from the dropdown, and give an identifier

You are taken to the configuration page for this cluster. Settings will depend on the plugin that you used. When settings are done, click on the “Start/Attach” button. The plugin creates or attaches to the dynamic cluster, and sets up all required configuration keys. The cluster is then usable.

When you don’t need the cluster anymore, click the “Stop/Detach” button to release the associated resources.

Depending on the plugin, you may also have specific actions available, like getting various information about the dynamic cluster, or scaling it up and down.

Use a specific or dynamic cluster for scenarios

A common use case is to use, for running one or multiple scenarios:

  • either a specific static cluster (i.e. a cluster already defined in the DSS settings, but not the default cluster of the project)

  • or a dynamic cluster, created for the scenario and shutdown after the end of the scenario for fully elastic approaches

Use a specific static cluster

For this, you’ll use the variables expansion mechanism of DSS.

Instead of writing a cluster identifier as the contextual cluster to use at the project level, you can use the syntax ${variable_name}. At runtime, DSS will use the cluster denoted by the variable_name variable.

Your scenario will then use a scenario-scoped variable to define the cluster to use for the scenario.

For example, if you want to use the cluster regular1 for the “design” of the project, and all non-scenario-related activities, and the fast2 cluster for a scenario.

Setup your project as such:

  • Cluster: ${clusterForScenario}

  • Default cluster: regular1

With this setup, when the clusterForScenario variable is not defined (which will be the case outside of the scenario), DSS will fallback to regular1

In your scenario, add an initial step “Define scenario variables”, and use the following JSON definition:

{
        "clusterForScenario" : "fast2"
}

The steps of the scenario will execute on the fast2 cluster

Use a dynamic cluster

The idea here is to:

  • Create a dynamic cluster

  • Put the identifier of the dynamically-created cluster into a variable

  • Then use the variables expansion mechanism defined above

For example, if you want to use the cluster regular1 for the “design” of the project, and all non-scenario-related activities, and a dynamically-created cluster for a scenario.

Setup your project as such:

  • Cluster: ${clusterForScenario}

  • Default cluster: regular1

With this setup, when the clusterForScenario variable is not defined (which will be the case outside of the scenario), DSS will fallback to regular1

In your scenario, add an initial step “Setup cluster”:

  • Select the cluster type you want to create (depending on the plugin you are using)

  • Fill in the configuration form (depending on the plugin you are using)

  • Set clusterForScenario as the “Target variable”

When the step runs, DSS creates the cluster, and sets the id of the newly created cluster in the clusterForScenario variable. Given the project config, the steps of the scenario will automatically execute on the dynamically-created cluster.

At the end of the scenario (regardless of whether scenario succeeded or failed), DSS automatically stops the dynamic cluster (you can override this behavior in the scenario settings)

Warning

If DSS unexpectedly stops while the scenario is running, the cluster resources will keep running on your cloud provider. We recommend that you setup monitoring of your cloud resources created by DSS.

Permissions

Each cluster has an owner and groups who are granted access levels on the cluster:

  • Use cluster to be able to select the cluster and use it in a project

  • Operate cluster to be able to modify cluster settings

  • Manage cluster users to be able to manage the permissions of the cluster

In addition, each group can be granted global permissions to:

  • Create clusters and manage the clusters they created

  • Manage all clusters, including the ones they are not explicitly granted access to