Setup

Initial setup

In the rest of this document:

  • dssuser means the UNIX user which runs the DSS software
  • DATADIR means the directory in which DSS is running

Prerequisites and required information

Please read carefully the Prerequisites and limitations documentation and check that you have all required information.

The most important parts here are:

  • Having a keytab for the dssuser
  • Having administrator access to the Hadoop cluster
  • Having root access to the local machine
  • Having an initial list of end-user groups allowed to use the impersonation mechanisms.

Perform a regular DSS installation

Note

It is possible to setup Spark integration after setting up multi-user security, but this require more manual work so we strongly recommend that you start by setting up Spark integration.

Configure your Hadoop cluster

Note

This part must be performed by the Hadoop administrator. A restart of your Hadoop cluster may be required.

You now need to allow the dssuser user to impersonate all end-user groups that you have previously identified.

This is done by adding hadoop.proxyuser.dssuser.groups and hadoop.proxyuser.dssuser.hosts configuration keys to your Hadoop configuration (core-site.xml). These respectively specify the list of groups of users which DSS is allowed to impersonate, and the list of hosts from which DSS is allowed to impersonate these users.

The hadoop.proxyuser.dssuser.groups parameter should be set to a comma-separated list containing:

  • A list of end-user groups which collectively contain all DSS users
  • The group with which the hive user creates its files (generally: hive on Cloudera, hadoop on HDP)
  • In addition, on Cloudera, the group with which the impala user creates its files (generally: impala)

Alternatively, this parameter can be set to * to allow DSS to impersonate all cluster users (effectively disabling this extra security check).

The hadoop.proxyuser.dssuser.hosts parameter should be set to the fully-qualified host name of the server on which DSS is running. Alternatively, this parameter can be set to * to allow all hosts (effectively disabling this extra security check).

Make sure Hadoop configuration is properly propagated to all cluster hosts and to the host running DSS. Make sure that all relevant Hadoop services are properly restarted.

With Cloudera Manager

(NB: This information is given for information purpose only. Please refer to the official Cloudera documentation for your Cloudera version)

  • In Cloudera Manager, navigate to HDFS > Configuration and search for “proxyuser”
  • Add two new keys in the “Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml” section.
    • Name: hadoop.proxyuser.dssuser.groups
    • Value: comma-separated list of Hadoop groups of your end users, plus hive, impala
    • Name: hadoop.proxyuser.dssuser.hosts
    • Value: fully-qualified DSS host name, or *
  • Save changes
  • At the top of the HDFS page, click on the “Stale configuration: restart needed” icon and click on “Restart Stale Services” then “Restart now”

With Ambari

(NB: This information is given for information purpose only. Please refer to the official Hortonworks documentation for your HDP version)

  • In Ambari, navigate to HDFS > Configs > Advanced, and search for “proxyuser”
  • In “Custom core-site”, add two new properties:
    • Key: hadoop.proxyuser.dssuser.groups
    • Value: comma-separated list of Hadoop groups of your end users, plus hadoop
    • Key: hadoop.proxyuser.dssuser.hosts
    • Value: fully-qualified DSS host name, or *
  • Save changes, enter a description
  • On the “Restart required” warning that appears, click “Restart” and “Restart all affected”

Cloudera additional setup for Impala

If you plan on using Impala, you must perform an additional setup because Impala does not use the regular proxyuser mechanism.

  • In Cloudera Manager, go to Impala > Configuration
  • In the Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve) setting, add: authorized_proxy_user_config=dssuser=enduser_group_1,enduser_group_2,...

Initialize multi-user security

  • As dssuser, stop DSS
% cd DATADIR
% ./bin/dss stop
  • As root, run, from DATADIR
./bin/dssadmin install-impersonation dssuser

Please pay attention to the messages emitted by this procedure. In particular, you might need to manually add a snippet to your sudoers configuration.

As root, edit the DATADIR/security/security-config.ini file. In the [users] section, fill in the allowed_user_groups settings with the list of UNIX groups that your end users belong to. Only users belonging to these groups will be allowed to use the local code impersonation mechanism

Cloudera additional setup (Impala)

If you want to use Impala, you need to install the Cloudera Impala JDBC Driver.

Download the driver from Cloudera Downloads website. You should obtain a Zip file impala_jdbc_VERSION.zip, containing two more Zip files. Unzip the “JDBC 4.1” version of the driver (the “JDBC 4” version will not work).

Copy the ImpalaJDBC41.jar file to the lib/jdbc folder of DSS. Beware, you must not copy other JARs

Configure filesystem access on the DSS folders

You need to ensure that all end-user groups have read-only access to:

  • The DSS datadir (including all parent folders)
  • The DSS installation directory (including all parent folders)

dssadmin install-impersonation automatically sets up 711 permission on the DSS datadir, but you might need to ensure proper access to parent folders.

Configure identity mapping

  • As dssuser, start DSS
  • Log in as a DSS administrator, and go to Administration > Settings > Security
  • DSS has been preconfigured with simple identity mapping rules (one-to-one both on users and groups).
  • You can choose to configure this. For more information, see Concepts
  • Save settings if needed

Setup Hive and Impala access

  • Go to Administration > Settings > Hadoop
  • Fill in the HiveServer2 host and principal if needed, as described in Connecting to secure clusters
  • Fill in the “Hive user” setting with the name of the user running HiveServer2 (generally: hive)
  • Switch “Default execution engine” to “HiveServer2”

Cloudera additional setup (Impala)

  • Go to Administration > Settings > Hadoop
  • Fill in the Impala hosts and principal if needed, as described in Connecting to secure clusters
  • Fill in the “Impala user” setting with the name of the user running impalad (generally: impala)
  • Check the “Use Cloudera Driver” setting

Hive metastore storage-based authorization

If your Hive metastore is configured to use storage-based authorization (which is enabled by default on HDP), you also need to check the “Write ACL in datasets” in the Hive settings

Initialize ACLs on HDFS connections

Go to the settings of the hdfs_managed connection. Click on Resync Root permissions

If you have other HDFS connections, do the same thing for them.

Test

  • Grant to at least one of your user groups the right to create projects
  • Log in as an end user
  • Create a project with key PROJECTKEY
  • As a Hive administrator, create a database named dataiku_PROJECTKEY and use Sentry or Ranger to grant to the end-user group the right to use this database. Details on how to do that or alternative mode of deployments are in Operations
  • As the end user in DSS, check that you can:
    • Create external HDFS datasets
    • Create prepare recipes writing to HDFS datasets
    • Synchronize datasets to the Hive metastore
    • Create Hive recipes to write new HDFS datasets
    • Use Hive notebooks
    • Create Python recipes
    • Use Python notebooks
    • Create Spark recipes
    • If you have Impala, create Impala recipes
    • If you have Impala, use Impala notebooks
    • Create visual recipes and use all available execution engines