Dynamic Google Dataproc clusters¶
- Prerequisites and limitations
- Create your first cluster
- Using dynamic Dataproc clusters for scenarios
- Cluster actions
- Advanced settings
Experimental feature: Management of dynamic clusters is provided through a plugin and has a best-effort support.
DSS can create and manage multiple Dataproc clusters, allowing you to easily scale your workloads across multiple clusters, use clusters dynamically for some scenarios, …
For more information on dynamic clusters and the usage of a dynamic cluster for a scenario, please see Multiple Hadoop clusters.
Support for dynamic clusters is provided through the “Dataproc Multicluster” plugin. You will need to install this plugin in order to use this feature.
- Like for other kinds of multi-cluster setups, the server that runs DSS needs to have the client libraries for the proper Hadoop distribution. In that case, your server needs to have the Dataproc client libraries for the version of Dataproc you will use. Dataiku provides a marketplace Image that will allow you to perform such an installation.
- When working with multiple clusters, all clusters should run the same Dataproc version. If running different versions, some incompatibilities may occur.
We strongly recommend that you use the marketplace image “dataiku-edge” which contains everything required for Dataproc support.
Dataiku will periodically rebuild this image to incorporate new updates and support new Dataproc versions.
The service account running instance MUST have the following AMI permissions to run properly:
- Dataproc Editor
- Dataproc Service Agent
Otherwise the user account running DSS must have the required credentials in order to create Dataproc clusters.
Most of the time, when using dynamic Dataproc clusters, you will store all inputs and outputs of your flows on GCS. Access to GCS from DSS and from the Dataproc cluster is done through GCS.
- Go to Administration > Connections, and add a new HDFS connection
- Enter “gs://your-bucket” or “gs://your-bucket/prefix” as the root path URI
Unless the DSS host has implicit access to the bucket through its IAM role or default credentials in environment variable
GOOGLE_APPLICATION_CREDENTIALS, define connection-level credentials in “Extra Hadoop conf”:
- Add a property called “spark.hadoop.google.cloud.auth.service.account.enable” with value true
- Add a property called “spark.hadoop.google.cloud.auth.service.account.json.keyfile” with path to your keyfile on the server
Go to Administration > Cluster and click “Create cluster”
In “Type”, select “Dataproc cluster (create cluster)” and give a name to your new cluster. You are taken to the “managed cluster” configuration page, where you will set all of your Dataproc cluster settings.
The minimal settings that you need to set are:
- The google project ID to build the EMR Cluster
- Your region
- The instance type for the master and slave nodes
- The total number of instances you require (there will be 1 master and N-1 slaves in the CORE group)
- The version of Dataproc you want to use. Beware, this should be consistent with the AMI you used.
- The VPC Subnet identifier in which you want to create your Dataproc cluster. This should be the same VPC that the DSS machine is running. Leave empty to use the same as the EC2 node running DSS
- The security groups to associate to all of the cluster machines. Make sure to add security groups that grant full access between the DSS host and the Dataproc cluster members.
Click on “Start/Attach”. Your Dataproc cluster is created. This phase generally lasts 5 to 10 minutes. When the progress modal closes, you have a working Dataproc cluster, and an associated DSS dynamic cluster definition that can talk to it.
In any project, go to Settings > Cluster, and select the identifier you gave to the Dataproc cluster. Any recipe or Hive notebook running in this project will now use your Dataproc cluster
Go to Administration > Clusters > Your cluster and click “Stop/Detach” to destroy the Dataproc cluster and release resources.
Note that the DSS cluster definition itself remains, allowing you to recreate the Dataproc cluster at a later time. Projects that are configured to use this cluster while it is in “Stopped/Detached” state will fail.
For a fully elastic approach, you can create Dataproc clusters at the beginning of a sequence of scenarios, run the scenarios and then destroy the Dataproc cluster, fully automatically.
Please see Multiple Hadoop clusters for more information. In the “Setup Cluster” scenario step, you will need to enter the Dataproc cluster configuration details
In addition to the basic “Start” and “Stop”, the Dynamic Dataproc clusters plugin provides the ability to scale up and down an attached Dataproc cluster.
Go to the “Actions” tab of your cluster, and select the “Scale” action. You will have to specify the target number of instances in the CORE and TASK groups. We recommend that you never scale down the CORE group (which contains a small HDFS needed for cluster operations), and instead scale up and down the TASK group
You can scale up/down a cluster as part of a scenario
- Add an “Execute macro” step
- Select the “Scale cluster up/down” step
- Enter the DSS cluster identifier, either directly, or using a variable - the latter case is required if you setup your cluster as part of the scenario.
- Select the settings
In this kind of setup you will generally scale up at the beginning of the scenario and scale down at the end. You will need two steps for that. Make sure to select “Run this step > Always” for the “scale down” step. This way, even if your scenario fails, the scale down operation will be executed.