Interaction with Spark

Architecture

When running a Spark job as user A in DSS:

  • DSS acquires Hadoop delegation tokens on behalf of A
  • DSS starts the Spark driver using the sudo mechanism, as A user, passing the Hadoop delegation tokens
  • The Spark driver can then start its executors as A, using its Yarn delegation tokens.

Thus, DSS in multi-user-security mode only supports the yarn-client deployment of YARN. Running a standalone master or local mode is not recommended, because it is the YARN application manager who is responsible for renewing the delegation tokens.

DSS does not support the yarn-cluster mode.

Hive Metastore

On Hadoop, it is possible to restrict the access to the Hive metastore server so that only HiveServer2 can talk to the metastore server.

In a “regular” setup, any user can authenticate (using Kerberos) to the metastore server and issue DDL commands. If the metastore is secured, only Hiveserver2 may do so.

Spark does not use Hiveserver2 and when you create a HiveContext in Spark, it always talks directly to the Hive metastore. Thus, when the Hive metastore is configured for restricted access, Spark access to the metastore will fail. This has the following consequences:

  • Using a HiveContext in Spark code recipes fails (SQLContext remains available)
  • Using table definitions from the Hive metastore in Spark code recipes is not possible (including SparkSQL)
  • Running some visual recipes on Spark (since they require HiveContext-only features) will fail