Google Cloud Dataproc

Warning

DSS does not officially support Google Cloud Dataproc. Integration is provided on a best-effort basis.

DSS has been succesfully tested on Google Cloud Dataproc 1.2

Security

  • Multi-user security is not supported on Dataproc
  • While DSS should be able to connect to a Dataproc cluster that is Kerberos-secure, this kind of deployment has not been validated. Deploying Kerberos security on a Dataproc cluster is a fully manual and complex operation

Known limitations

  • The Spark-Scala notebook cannot be used. Pyspark notebook is supported.

Connecting DSS to Cloud Dataproc

DSS running on one of the cluster nodes

You can install DSS on one of the Dataproc cluster nodes. In that case, you don’t require any specific additional steps, just follow the regular Hadoop installation steps

Warning

Cloud Dataproc cluster nodes are volatile and only have volatile disks by default. Installing DSS on these volatile disks mean that you will lose all work in DSS if your cluster is stopped, or if your cluster node restarts for any reason.

You must manually attach a GCE persistent disk to the cluster node and install DSS on this PD. After restart of the cluster, you’ll need to reattach the PD to a node of the new cluster, rerun Hadoop integration and restart DSS from here.

DSS outside of the cluster

Warning

This deployment mode has not been tested by DSS.

DSS can be deployed on a regular GCE instance, not part of the Dataproc cluster itself. This kind of deployment is called an “edge node” deployment. It requires copying Dataproc libraries and cluster configuration from the cluster master to the GCE instance running DSS.

This operation is not documented by Google nor supported by Dataiku.