Microsoft Azure HDInsight¶
Experimental feature: DSS does not officially support Microsoft Azure HDInsight. Integration is provided on a best-effort basis.
Azure HDInsight is a fully-managed offering that provides Hadoop and Spark clusters, and related technologies, on the Microsoft Azure cloud. HDInsight is a cloud distribution of the Hadoop components based on the Hortonworks Data Platform (HDP), with a default filesystem configured either in Azure Blob Storage or Azure Data Lake.
HDInsight makes it easy, fast, and cost-effective to process massive amounts of data.
DSS has been tested on the following HDInsight configurations:
- HDInsight “Hadoop” clusters version 3.5 and 3.6
- HDInsight “Spark” clusters version 3.5 (Spark 1.6 and 2.0) and 3.6 (Spark 2.1 to 2.3)
- HDInsight “Spark” clusters version 4.0 (Spark 2.3 and 2.4) (experimental support, with cluster configuration adjustments described below)
- Connecting DSS to a domain-joined HDInsight cluster (Enterprise Security Package) is not supported
- Multi-user security is not supported on HDInsight
Connecting Dataiku DSS to Azure HDInsight¶
DSS running on an edge node managed by HDInsight¶
Running Dataiku DSS on a edge node created and managed directly by the Azure HDInsight cluster is the recommended deployment mode. In this case, HDInsight will create and configure the edge node itself, and Dataiku DSS can be installed on this edge node.
Azure HDInsight managed edge nodes are not persistent. If the HDInsight cluster is stopped or restarted for any reason, all Dataiku DSS data and configuration files will be lost. Please make sure to perform very frequent backups of your DSS installation to overcome this issue.
Azure HDInsight managed edge nodes are not visible from the Azure resource manager and thus can not leverage Azure persistent disks or other Azure native tools to perform automated backups.
It is possible to install Dataiku DSS directly from the HDInsight configuration panel in the Azure Portal, either for new or existing clusters. The one-click installation procedure can be found under the HDInsight “optional Applications” menu, and is also accessible directly from the Azure Marketplace.
Using an Azure Resource Management (ARM) template¶
To give more control over the deployment options (for instance, to adjust the size of the edge node VM, or the DSS version to deploy), it is also possible to proceed to the creation of the edge node and the installation of DSS by leveraging directly the underlying ARM template. An example of this ARM template can be found in this Github repository. This template can be directly deployed from Github, or the content of the azuredeploy.json file can be copied in your own templates, and adjusted as needed.
Manual DSS installation¶
It is necessary to add the following stanza to the Installation configuration file, for compatibility with the HDInsight reverse proxy:
[server] websocket_permessage_deflate = false
DSS running outside of HDInsight¶
This deployment mode is not officially supported by Microsoft nor by Dataiku
It is possible to configure access to a Azure HDInsight cluster when DSS is running on a regular Azure VM (not managed by the HDInsight cluster itself). This approach requires installing the proper HDInsight libraries and configuration files on the Azure VM hosting DSS.
This procedure is not documented by Azure. Please contact Dataiku should you need more information.
Using Dataiku DSS on Azure HDInsight¶
Accessing Dataiku DSS on managed HDInsight edge nodes¶
After DSS is installed on a managed HDInsight edge node with the Azure marketplace one-click install, it is accessible through an HTTPS link which can be retrieved from the “Applications” pane of the Azure portal page for the cluster.
When installing DSS on an edge node using an ARM template, it is necessary to configure a reverse proxy entry for it using an
in this template. This property defines the URL through which DSS will be accessible after installation, and defaults to
https://CLUSTERNAME-dss.apps.azurehdinsight.net when using Dataiku-provided templates.
Operating Dataiku DSS when using managed edge nodes¶
The HDInsight edge node hosting Dataiku DSS can be accessed via ssh (please refer to the HDInsight documentation and to the information provided in the Azure portal). This is typically done with:
It is possible to perform all regular DSS operations through it:
- maintenance: starting or stopping DSS, accessing the logs, performing manual backups…
- upgrading DSS: it is possible to install a new release of Dataiku DSS after the initial deployment. Connect to the edge node host via ssh and follow the regular upgrade procedure.
Interacting with cluster storage¶
The cluster primary storage (WASB- or ADLS-based, defined when the cluster is created) is accessible from DSS under the standard predefined HDFS connections (hdfs_root and hdfs_managed).
Dataiku DSS can interact with additional Azure Blob Storage containers to read and write datasets. Please refer to the Azure Blob Storage
as Hadoop filesystem documentation to get started. It is similarly possible to connect to Azure Datalake Store by configuring
additional HDFS connections using the
adl://... scheme instead of
Connecting DSS to HDInsight 4.0¶
DSS is compatible with HDInsight 4.0 clusters configured with a Azure Data Lake Gen2 (ABFS) filesystem.
DSS is not directly compatible with the default Hive security model deployed on HDInsight 4.0 clusters configured with a Azure Storage (WASB) filesystem, as:
- DSS expects to be able to create external Hive table definitions for its HDFS dataset files
- On HDInsight 4.0, Hive is configured to run with user impersonation disabled (where all file access from HiveServer2 is done with the “hive” user account) and storage-based security enabled (which checks that HDFS directories underlying Hive tables are writable by this account), in effect forbidding the creation of tables pointing to HDFS directories owned by the DSS user account.
It is however possible to use DSS Hive integration with HDInsight 4.0 / WASB by switching the Hive security mode in one of the following ways:
Disable storage-based authorization in Hive (reverting to the default mode used in HDInsight 3.0):
Using Ambari or custom cluster configuration directives, define:
hive.security.metastore.authorization.manager = org.apache.hadoop.hive.ql.security.authorization.MetaStoreAuthzAPIAuthorizerEmbedOnly
Enable user impersonation in HiveServer2 and the Hive metastore:
Using Ambari or custom cluster configuration directives, define:
hive.server2.enable.doAs = true hive.metastore.execute.setugi = true