Microsoft Azure HDInsight

Warning

Experimental feature: DSS does not officially support Microsoft Azure HDInsight. Integration is provided on a best-effort basis.

Azure HDInsight is a fully-managed offering that provides Hadoop and Spark clusters, and related technologies, on the Microsoft Azure cloud. HDInsight is a cloud distribution of the Hadoop components based on the Hortonworks Data Platform (HDP), with a default filesystem configured either in Azure Blob Storage or Azure Data Lake.

HDInsight makes it easy, fast, and cost-effective to process massive amounts of data.

Tested versions

DSS has been tested on the following HDInsight configurations:

  • HDInsight “Hadoop” clusters version 3.5 and 3.6
  • HDInsight “Spark” clusters version 3.5 (Spark 1.6 and Spark 2.0) and 3.6 (Spark 2.1)

Security

  • Connecting DSS to a domain-joined HDInsight cluster is not supported
  • Multi-user security is not supported on HDInsight

Connecting Dataiku DSS to Azure HDInsight

DSS running on an edge node managed by HDInsight

Running Dataiku DSS on a edge node created and managed directly by the Azure HDInsight cluster is the recommended deployment mode. In this case, HDInsight will create and configure the edge node itself, and Dataiku DSS can be installed on this edge node.

Warning

Azure HDInsight managed edge nodes are not persistent. If the HDInsight cluster is stopped or restarted for any reason, all Dataiku DSS data and configuration files will be lost. Please make sure to perform very frequent backups of your DSS installation to overcome this issue.

Azure HDInsight managed edge nodes are not visible from the Azure resource manager and thus can not leverage Azure persistent disks or other Azure native tools to perform automated backups.

One-click deployment

It is possible to install Dataiku DSS directly from the HDInsight configuration panel in the Azure Portal, either for new or existing clusters. The one-click installation procedure can be found under the HDInsight “optional Applications” menu, and is also accessible directly from the Azure Marketplace.

Using an Azure Resource Management (ARM) template

To give more control over the deployment options (for instance, to adjust the size of the edge node VM, or the DSS version to deploy), it is also possible to proceed to the creation of the edge node and the installation of DSS by leveraging directly the underlying ARM template. An example of this ARM template can be found in this Github repository. This template can be directly deployed from Github, or the content of the azuredeploy.json file can be copied in your own templates, and adjusted as needed.

Manual DSS installation

For custom needs, it is possible to provision an empty HDInsight edge node using an ARM template and install DSS on it using the standard installation procedure.

Note

It is necessary to add the following stanza to the Installation configuration file, for compatibility with the HDInsight reverse proxy:

[server]
websocket_permessage_deflate = false

DSS running outside of HDInsight

Warning

This deployment mode is not officially supported by Microsoft nor by Dataiku

It is possible to configure access to a Azure HDInsight cluster when DSS is running on a regular Azure VM (not managed by the HDInsight cluster itself). This approach requires installing the proper HDInsight libraries and configuration files on the Azure VM hosting DSS.

This procedure is not documented by Azure. Please contact Dataiku should you need more information.

Using Dataiku DSS on Azure HDInsight

Accessing Dataiku DSS on managed HDInsight edge nodes

After DSS is installed on a managed HDInsight edge node with the Azure marketplace one-click install, it is accessible through an HTTPS link which can be retrieved from the “Applications” pane of the Azure portal page for the cluster.

When installing DSS on an edge node using an ARM template, it is necessary to configure a reverse proxy entry for it using an httpsEndpoint property in this template. This property defines the URL through which DSS will be accessible after installation, and defaults to https://CLUSTERNAME-dss.apps.azurehdinsight.net when using Dataiku-provided templates.

Operating Dataiku DSS when using managed edge nodes

The HDInsight edge node hosting Dataiku DSS can be accessed via ssh (please refer to the HDInsight documentation and to the information provided in the Azure portal). This is typically done with:

It is possible to perform all regular DSS operations through it:

  • maintenance: starting or stopping DSS, accessing the logs, performing manual backups…
  • upgrading DSS: it is possible to install a new release of Dataiku DSS after the initial deployment. Connect to the edge node host via ssh and follow the regular upgrade procedure.

Interacting with cluster storage

The cluster primary storage (WASB- or ADLS-based, defined when the cluster is created) is accessible from DSS under the standard predefined HDFS connections (hdfs_root and hdfs_managed).

Dataiku DSS can interact with additional Azure Blob Storage containers to read and write datasets. Please refer to the Azure Blob Storage as Hadoop filesystem documentation to get started. It is similarly possible to connect to Azure Datalake Store by configuring additional HDFS connections using the adl://... scheme instead of wasb://....