Cloudera CDH

CDH includes Spark and Impala.

DSS supports CDH 5.9 to 5.15

Security

  • Connecting to secure clusters is fully supported
  • Multi-user security is supported with Sentry

DSS regular security and Sentry

When using DSS in regular security mode to connect to a Sentry-secured cluster, you need to make some configuration adjustments. See DSS and Hive for more information

Scala notebook

CDH’s packaging of Spark 1.6 replaces some of the libraries normally used by Spark by older versions. This makes the Spark version bundled with CDH incompatible with the Spark-Scala notebook of DSS.

The only way to have Spark-scala notebooks working on Spark 1.6 on CDH is to perform a standalone Spark installation. Note that you’ll need to add some configuration keys to your standalone Spark to make it work with YARN.

S3 datasets and Spark 2

The CDH version of Spark 2 repackages some libraries, causing some incompatibilities with DSS S3 code.

Trying to access S3 datasets with CDH Spark 2 will raise errors like `AmazonS3Exception: AWS authentication requires a valid Date or x-amz-date header`

To work around, you need to add a configuration key to your Spark configurations:

  • Key: `spark.driver.extraClassPath`
  • Value: `INSTALL_DIR/lib/ivy/common-run/joda-time-2.9.2.jar` (replace INSTALL_DIR by the full path to your DSS installation directory)

Impala

On cdh 5.15, if kerberos is enabled the following error can appear `Server impala/X.X.X.X@KERBEROS_DOMAIN not found in Kerberos database` where X.X.X.X is an ip adress. This is an known impala bug (https://issues.apache.org/jira/browse/IMPALA-7298). To work around it you need to have `rdns=true` in your kerberos configuration as documented in the above link.