Dynamic Google Dataproc clusters

Warning

Deprecated: Support for Dataproc is Deprecated and will be removed in a future DSS version

Although DSS has been successfully tested on Dataproc, we do not recommend setting up new deployments with Dataproc.

DSS does not support the latest Dataproc versions, security options are limited, and our experience is that Dataproc deployments are associated with a higher complexity and tend to generate higher administration workloads.

We recommend that you use a fully Elastic AI infrastructure based on GKE. Please see Elastic AI computation, or get in touch with your Dataiku Customer Success Manager, Technical Account Manager or Sales Engineer for more information and studying the best options.

DSS can create and manage multiple Dataproc clusters, allowing you to easily scale your workloads across multiple clusters, dynamically create and scale clusters for some scenarios, etc.

For more information on dynamic clusters and the usage of a dynamic cluster for a scenario, please see Multiple Hadoop clusters.

Support for dynamic clusters is provided through the “Dataproc clusters” DSS plugin. You will need to install this plugin in order to use this feature.

Prerequisites and limitations

  • Like for other kinds of multi-cluster setups, the server that runs DSS needs to have the client libraries for the proper Hadoop distribution. In that case, your server needs to have the Dataproc client libraries for the version of Dataproc you will use.

  • When working with multiple clusters, all clusters should run the same Dataproc version. If running different versions, some incompatibilities may occur.

Create your first cluster

Before Running

The Google Compute Engine (GCE) VM that hosts DSS is associated with a given service account, which MUST have the following AMI permissions to run properly:

  • Dataproc Editor

  • Dataproc Service Agent

Otherwise the user account running DSS must have the required credentials in order to create Dataproc clusters.

Install the plugin

From the “Plugins” section of the Administration panel, go to the store and install the Dataproc clusters plugin.

Define GCS connections

Most of the time, when using dynamic Dataproc clusters, you will store all inputs and outputs of your flows on GCS. Access to GCS from DSS and from the Dataproc cluster is done through GCS.

  • Go to Administration > Connections, and add a new HDFS connection

  • Enter “gs://your-bucket” or “gs://your-bucket/prefix” as the root path URI

Unless the DSS host has implicit access to the bucket through its IAM role or default credentials in environment variable GOOGLE_APPLICATION_CREDENTIALS, define connection-level credentials in “Extra Hadoop conf”:

  • Add a property called “spark.hadoop.google.cloud.auth.service.account.enable” with value true

  • Add a property called “spark.hadoop.google.cloud.auth.service.account.json.keyfile” with path to your keyfile on the server

Create the cluster and configure it

Go to Administration > Cluster and click “Create cluster”

In “Type”, select “Dataproc cluster (create cluster)” and give a name to your new cluster. You are taken to the “managed cluster” configuration page, where you will set all of your Dataproc cluster settings.

You will then need to fill in the following fields:

  • The GCP project ID in which to build the Dataproc Cluster

  • The cluster name on GCP

  • The DataProc version to use (MUST be the same Linux distribution and version as your GCE VM)

  • The desired region/zone

  • The network tags you want to attach to your cluster nodes

  • The instance type for the master and worker nodes

  • The total number of worker nodes you require

Click on “Start/Attach”. Your Dataproc cluster is created. This phase generally lasts 5 to 10 minutes. When the progress modal closes, you have a working Dataproc cluster, and an associated DSS dynamic cluster definition that can talk to it.

Use your cluster

In any project, go to Settings > Cluster, and select the identifier you gave to the Dataproc cluster. Any recipe or Hive notebook running in this project will now use your Dataproc cluster.

Stop the cluster

Go to Administration > Clusters > Your cluster and click “Stop/Detach” to destroy the Dataproc cluster and release resources.

Note that the DSS cluster definition itself remains, allowing you to recreate the Dataproc cluster at a later time. Projects that are configured to use this cluster while it is in “Stopped/Detached” state will fail.

Using dynamic Dataproc clusters for scenarios

For a fully elastic approach, you can create Dataproc clusters at the beginning of a sequence of scenarios, run the scenarios and then destroy the Dataproc cluster, fully automatically.

Please see Multiple Hadoop clusters for more information. In the “Setup Cluster” scenario step, you will need to enter the Dataproc cluster configuration details

Cluster actions

In addition to the basic “Start” and “Stop”, the Dataproc clusters plugin provides the ability to scale up and down an attached Dataproc cluster.

Manual run

Go to the “Actions” tab of your cluster, and select the “Scale cluster up or down” macro. You will then choose the new number of desired worker nodes, with the option of creating spot (pre-emptible) nodes.

As part of a scenario

You can scale up/down a cluster as part of a scenario

  • Add an “Execute macro” step

  • Select the “Scale cluster up or down” macro

  • Enter the DSS cluster identifier, either directly, or using a variable - the latter case is required if you setup your cluster as part of the scenario.

  • Select the settings

In this kind of setup you will generally scale up at the beginning of the scenario and scale down at the end. You will need two steps for that. Make sure to select “Run this step > Always” for the “scale down” step. This way, even if your scenario fails, the scale down operation will be executed.

Advanced settings

In addition to the basic settings outlined above, the Dataproc clusters plugin provides some advanced settings:

Hive metastore

  • Metastore database type (transient or persisted)

  • Databases to use

Labels

You can add labels to your cluster. Variables expansion is not supported here.