Setting up Spark integration

For an introduction to the support of Spark in DSS, see DSS and Spark

Warning

Spark support in DSS is not restricted to Hadoop. You can install Spark and the Spark integration in DSS without a Hadoop cluster.

However, optimal performance will only be achieved by using HDFS, Hive, S3, Azure Storage and Google Cloud Storage datasets.

It is therefore highly recommended that you use Spark mainly for DFS, Hive, S3, Azure Storage and Google Cloud Storage datasets and install the Hadoop integration.

Dataiku DSS supports Spark versions 1.6, 2.0 to 2.3, 2.4 (experimental)

Note

DSS has experimental support for Databricks. Please contact your Dataiku Sales Engineer or Customer Success Manager for more information.

Spark provided by your Hadoop distribution

This section applies if Spark (1.6, 2.0 to 2.3, 2.4 (experimental)) is included in your Hadoop distribution.

  • Go to the Dataiku DSS data directory
  • Stop DSS
./bin/dss stop
  • Run the setup
./bin/dssadmin install-spark-integration
  • Start DSS
./bin/dss start

Verify the installation

Go to the Administration > Settings section of DSS. The Spark tab must be available.

Running managed Spark on Kubernetes

If you plan to run Spark on Kubernetes using Dataiku managed Spark on K8S capabilities, please see :doc:kubernetes/managed.rst

Manual Spark setup

Warning

Dataiku cannot provide full support for manual Spark setups. This information is provided as best-effort only.

Prepare your Spark environment

If that version is not included in your distribution, you can download pre-built Spark binaries for the relevant Hadoop version. You should not choose the “Pre-built with user-provided Hadoop” packages, as these do not have Hive support, which is needed for advanced SparkSQL features used by DSS.

You’ll then need to configure this Spark installation to point it to your existing Hadoop installation.

  • If you are using CDH or MapR, copy spark-env.sh.template as a new executable file conf/spark-env.sh and set HADOOP_CONF_DIR to the location of your Hadoop configuration directory (typically to /etc/hadoop/conf).
  • For HDP, see their tutorial.

Test your Spark installation by going in the Spark directory and running

./bin/spark-shell --master yarn-client

After a little while (and possibly a lot of log messages), you should see a Scala prompt, preceded by the mention SQL context available as sqlContext. Type in the following test code:

sc.parallelize(Seq(1, 2, 3)).sum()

You should then see some more log messages and the expected 6 result. Type :quit to exit.

Set up Spark integration with DSS

Here, we assume that you installed and configured spark in the /opt/myspark folder

  • Go to the Data Science Studio data directory
  • Stop DSS
./bin/dss stop
  • Run the setup
./bin/dssadmin install-spark-integration -sparkHome /opt/myspark
  • Start DSS
./bin/dss start

Verify the installation

Go to the Administration > Settings section of DSS. The Spark tab must be available.

Additional topics

Caveat for RedHat / CentOS 6.x clusters

Using PySpark from DSS requires that the cluster executor nodes have access to a Python 2.7 interpreter. On RedHat / CentOS 6.x systems this may not be the case as the system’s default Python is 2.6 (and cannot be upgraded).

You should make sure an additional Python 2.7 is available on all cluster members, and to specify its location through an additional argument to the above command, as follows:

./bin/dssadmin install-spark-integration [-sparkHome SPARK_HOME] -pysparkPython PATH_TO_PYTHON2.7_ON_CLUSTER_EXECUTORS

Metastore security

Spark requires a direct access to the Hive metastore, to run jobs using a HiveContext (as opposed to a SQLContext) and to access table definitions in the global metastore from Spark SQL.

Some Hadoop installations restrict access to the Hive metastore to a limited set of Hadoop accounts (typically, ‘hive’, ‘impala’ and ‘hue’). In order for SparkSQL to fully work from DSS, you have to make sure the DSS user account is authorized as well. This is typically done by adding a group which contains the DSS user account to Hadoop key hadoop.proxyuser.hive.groups.

Note

On Cloudera Manager, this configuration is accessible through the Hive Metastore Access Control and Proxy User Groups Override entry of the Hive configuration.

Configure Spark logging

Spark has DEBUG logging enabled by default; When reading non-HDFS datasets, this will lead Spark to log the whole datasets by default in the “org.apache.http.wire”.

We strongly recommend that you modify Spark logging configuration to switch the org.apache.http.wire logger to INFO mode. Please refer to Spark documentation for information about how to do this.