Google Cloud Dataproc

DSS has been successfully tested on Google Cloud Dataproc 1.2

Warning

Deprecated: Support for Dataproc is Deprecated and will be removed in a future DSS version

Although DSS has been successfully tested on Dataproc, we do not recommend setting up new deployments with Dataproc.

DSS does not support the latest Dataproc versions, security options are limited, and our experience is that Dataproc deployments are associated with a higher complexity and tend to generate higher administration workloads.

We recommend that you use a fully Elastic AI infrastructure based on GKE. Please see Elastic AI computation, or get in touch with your Dataiku Customer Success Manager, Technical Account Manager or Sales Engineer for more information and studying the best options.

Security

  • User isolation is not supported on Dataproc

  • While DSS should be able to connect to a Dataproc cluster that is Kerberos-secure, this kind of deployment has not been validated. Deploying Kerberos security on a Dataproc cluster is a fully manual and complex operation

Known limitations

  • The Spark-Scala notebook cannot be used. Pyspark notebook is supported.

Connecting DSS to Cloud Dataproc

DSS running on one of the cluster nodes

You can install DSS on one of the Dataproc cluster nodes. In that case, you don’t require any specific additional steps, just follow the regular Hadoop installation steps

Warning

Cloud Dataproc cluster nodes are volatile and only have volatile disks by default. Installing DSS on these volatile disks mean that you will lose all work in DSS if your cluster is stopped, or if your cluster node restarts for any reason.

You must manually attach a GCE persistent disk to the cluster node and install DSS on this PD. After restart of the cluster, you’ll need to reattach the PD to a node of the new cluster, rerun Hadoop integration and restart DSS from here.

DSS outside of the cluster

Warning

This deployment mode has not been tested by DSS.

DSS can be deployed on a regular GCE instance, not part of the Dataproc cluster itself. This kind of deployment is called an “edge node” deployment. It requires copying Dataproc libraries and cluster configuration from the cluster master to the GCE instance running DSS.

This operation is not documented by Google nor supported by Dataiku.