Google Cloud Dataproc¶
DSS has been succesfully tested on Google Cloud Dataproc 1.2
Warning
Although DSS has been successfully tested on Dataproc, we do not recommend setting up new deployments with Dataproc.
DSS does not support the latest Dataproc versions, security options are limited, and our experience is that Dataproc deployments are associated swith a higher complexity and tend to generate higher administration workloads.
We recommend that you use a fully Elastic AI infrastructure based on GKE. Please see Elastic AI computation, or get in touch with your Dataiku Customer Success Manager, Technical Account Manager or Sales Engineer for more information and studying the best options.
Security¶
User isolation is not supported on Dataproc
While DSS should be able to connect to a Dataproc cluster that is Kerberos-secure, this kind of deployment has not been validated. Deploying Kerberos security on a Dataproc cluster is a fully manual and complex operation
Known limitations¶
The Spark-Scala notebook cannot be used. Pyspark notebook is supported.
Connecting DSS to Cloud Dataproc¶
DSS running on one of the cluster nodes¶
You can install DSS on one of the Dataproc cluster nodes. In that case, you don’t require any specific additional steps, just follow the regular Hadoop installation steps
Warning
Cloud Dataproc cluster nodes are volatile and only have volatile disks by default. Installing DSS on these volatile disks mean that you will lose all work in DSS if your cluster is stopped, or if your cluster node restarts for any reason.
You must manually attach a GCE persistent disk to the cluster node and install DSS on this PD. After restart of the cluster, you’ll need to reattach the PD to a node of the new cluster, rerun Hadoop integration and restart DSS from here.
DSS outside of the cluster¶
Warning
This deployment mode has not been tested by DSS.
DSS can be deployed on a regular GCE instance, not part of the Dataproc cluster itself. This kind of deployment is called an “edge node” deployment. It requires copying Dataproc libraries and cluster configuration from the cluster master to the GCE instance running DSS.
This operation is not documented by Google nor supported by Dataiku.