Configuring Hadoop security

Data Science Studio can connect to Hadoop clusters running in secure mode, where cluster users need to be authenticated by Kerberos in order to be authorized to use cluster resources.

When configured to use Hadoop security, Data Science Studio logs in to Kerberos upon startup, using a preconfigured identity (Kerberos principal) and a secret key stored in a local file (Kerberos keytab). Upon success, this initial authentication phase returns Kerberos credentials suitable for use with the Hadoop cluster.

Data Science Studio then uses these credentials whenever it needs to access Hadoop resources. This includes reading and writing HDFS files, running DSS preparation scripts over the cluster, running Pig or Hive recipes, accessing the Hive metastore, and using Hive or Impala notebooks.

As the credentials returned by the Kerberos login phase typically have a limited lifetime, Data Science Studio periodically renews them as long as it is running, in order to keep a continuous access to the Hadoop cluster.

Warning

Data Science Studio uses its own Kerberos credentials to access all Hadoop resources, regardless of the currently logged-in DSS user.

As a consequence, granting Data Science Studio access to a given user indirectly gives this user access to all cluster resources accessible through the DSS Kerberos identity. Make sure this is compatible with your cluster security policy, and to design accordingly the set of Hadoop permissions granted to the Kerberos identity used by DSS.

Setting up the DSS Kerberos account

The first steps in configuring Hadoop security support consist in setting up the Kerberos account which DSS will use for accessing cluster resources:

  • Create a Kerberos principal (user or service account) for this DSS instance in your Kerberos account database. You can choose any principal name for this, according to your local account management policy.

    Typical values include dataiku@MY.KERBEROS.REALM and dataiku/HOSTNAME@MY.KERBEROS.REALM, where dataiku is the name of the Unix user account used by DSS, MY.KERBEROS.REALM is the uppercase name of your Kerberos realm, and HOSTNAME is the fully-qualified name of the Unix server hosting DSS.

  • Create a Kerberos keytab for this account, and store it in a file accessible only to Data Science Studio.

  • Configure your Hadoop cluster to authorize this principal to access the cluster resources required for DSS operation, including:

    • read-write access to the HDFS directories used as managed dataset repositories (typically: /user/dataiku)
    • read-only access to any additional HDFS directories containing datasets
    • read-write access to the Hive metastore database used by DSS (typically named dataiku)
    • permission to launch map-reduce jobs
  • Install the Kerberos client software and configuration files on the DSS Unix server so that processes running on it can find and contact the Kerberos authorization service. In particular, the kinit Unix command must be in the execution PATH of the DSS user account, and must be functional.

You can check the above steps by attempting to access HDFS using the DSS Kerberos credentials, as follows:

[email protected]# # Open a session on the DSS Unix server using the DSS Unix user account
[email protected]# su - dataiku
[email protected]> # Log in to Kerberos using the DSS principal and keytab
[email protected]> kinit -k -t DSS_KEYTAB_FILE DSS_KERBEROS_PRINCIPAL
[email protected]> # Check the Kerberos credentials obtained above
[email protected]> klist
[email protected]> # Attempt to read DSS's HDFS home directory using the Kerberos credentials
[email protected]> hdfs dfs -ls /user/dataiku
[email protected]> # Log out the Kerberos session
[email protected]> kdestroy

Configuring DSS for Hadoop security

You enable Hadoop security support in Data Science Studio by adding the following lines to the global configuration file DATA_DIR/config/dip.properties:

hadoop.security.kerberos = true
hadoop.kerberos.principal = DSS_KERBEROS_PRINCIPAL
hadoop.kerberos.keytab = ABSOLUTE_PATH_TO_DSS_KEYTAB_FILE

You need to restart Data Science Studio for this configuration update to be taken into account:

[email protected]> DATA_DIR/bin/dss restart

Configuring Kerberos credentials periodic renewal (optional)

When Data Science Studio logs in to the Kerberos authentication service using its keytab, it typically receives credentials with a limited lifetime. In order to be able to permanently access the Hadoop cluster, Data Science Studio continuously renews these credentials by logging again to the Kerberos service, on a configurable periodic basis.

The default renewal period is one hour, which should be compatible with most Kerberos configurations (where credential lifetimes are typically on the order of one day). It is possible to adjust this behaviour however, by way of two more configuration keys to add to the same file DATA_DIR/config/dip.properties:

# Kerberos login period, in seconds - default 1 hour
hadoop.kerberos.ticketRenewPeriod = 3600
# Delay after which to retry a failed login, in seconds - default 5 mn
hadoop.kerberos.ticketRetryPeriod = 300

Configuration file example

The following shows a complete Data Science Studio configuration file, with Hadoop security enabled:

[email protected]> cat DATA_DIR/config/dip.properties
mail.smtp.host = localhost
hadoop.security.kerberos = true
hadoop.kerberos.principal = [email protected]
hadoop.kerberos.keytab = /home/dataiku/dataiku.keytab
hadoop.kerberos.ticketRenewPeriod = 1800
hadoop.kerberos.ticketRetryPeriod = 60