Setting up Hadoop integration

Data Science Studio is able to connect to a Hadoop cluster and to:

  • Read and write HDFS datasets
  • Run Hive queries and scripts
  • Run Impala queries
  • Run Pig scripts
  • Run preparation recipes on Hadoop

In addition, if you setup Spark integration, you can:

  • Run SparkSQL queries
  • Run preparation, join, stack and group recipes on Spark
  • Run PySpark & SparkR scripts
  • Train & use Spark MLLib models

Data Science Studio supports the following Hadoop distributions:

  • Cloudera’s Distribution Including Apache Hadoop (CDH) versions 5.3 to 5.7
  • MapR Converged Data Platform version 4.1 to 5.1
  • Hortonworks Data Platform (HDP) version 2.1 to 2.4

For more information about Hadoop and DSS, see DSS and Hadoop.

Running Data Preparation recipes on Hadoop is only supported if the cluster runs a compatible version of Java, which might not be the case for older installations.

Prerequisites

The host running DSS should have client access to the cluster (it can, but it does not need to host any cluster role like a datanode). Getting client access to the cluster normally involves installing:

  • the Hadoop client libraries (Java jars) matching the Hadoop distribution running on the cluster.
  • the Hadoop configuration files so that client processes (including DSS) can find and connect to the cluster.

Both of the above operations are typically best done through your cluster manager interface, by adding the DSS machine to the set of hosts managed by the cluster manager, and configuring “client” or “gateway” roles for it. If not possible, installing the client libraries usually consists in installing software packages from your Hadoop distribution, and the configuration files can be typically be downloaded from the cluster manager interface, or simply copied from another server connected to the cluster. See the documentation of your cluster distribution.

The above should be done at least for the HDFS and Yarn/MapReduce subsystems, and optionally for Hive and Pig if you plan to use these with DSS.

You may also need to setup a writable HDFS home directory for DSS (typically “/user/dataiku”) if you plan to store DSS datasets in HDFS, as well as a writable Hive metastore database (default “dataiku”) so that DSS can create Hive table definitions for the datasets it creates on HDFS.

Testing Hadoop connnectivity prior to installation

A prerequisite is to have the “hadoop” binary in your PATH. To test it, simply run:

hadoop version

It should display version information for your Hadoop distribution.

You can check HDFS connectivity by running the following command from the DSS account:

hdfs dfs -ls /
# Or the following alternate form for older installations, and MapR distributions
hadoop fs -ls /

Hive support requires to have the “hive” binary in your PATH. You can check Hive connectivity by running the following command from the DSS account:

hive -e "show databases"

If it succeeds, and lists the databases declared in your global Hive metastore, your Hive installation is correctly set up for Data Science Studio to use it.

Setting up DSS Hadoop integration

Data Science Studio checks for Hadoop connectivity at installation time, and automatically configures Hadoop integration if possible. You have nothing more to do in this case.

You can configure or reconfigure DSS Hadoop integration at any further time:

  • Go to the Data Science Studio data directory
cd DATADIR
  • Stop Data Science Studio:
./bin/dss stop
  • Run the setup script
./bin/dssadmin install-hadoop-integration
  • Restart Data Science Studio
./bin/dss start

Warning

You should reconfigure Hadoop integration using the above procedure whenever your cluster installation changes, such as after an upgrade of your cluster software.

Test it

To test HDFS connectivity, try to create an HDFS dataset:

../_images/new-hdfs-dataset.png

Note

If the Hadoop HDFS button does not appear, Data Science Studio has not properly detected your Hadoop installation.

You can then select the “hdfs_root” connection (which gives access to the whole HDFS hierarchy) and click the Browse button and verify that you can see your HDFS data.

Configure it if needed

Upon first setup of DSS Hadoop integration, a number of configuration parameters are defined with default values. You can review and adjust them in Data Science Studio “administration” screens:

  • HDFS connections:

    By default, two HDFS connections are defined: “hdfs_root” for read-only access to the entire HDFS filesystem, “hdfs_managed” to store DSS-generated datasets. You can edit these default connections, in particular their HDFS root path and Hive database name, to match your installation. You can delete them or define additional ones as needed.

  • Impala connectivity:

    Impala connectivity settings are defined in the “Settings / Impala” administration screen. You probably need to adjust at the list of Impala servers for your cluster, as this parameter cannot be automatically detected.

Hive connectivity

To run Hive recipes, you need to have the Hive packages installed on the host running Data Science Studio, and a properly configured “hive” command line client should be available to the DSS user account. A Hive Server or HiveServer2 is not required nor used by DSS.

If Hive was already installed when you installed Data Science Studio, connectivity to Hive is automatically set up.

If you installed Hive after installing Data Science Studio, you need to perform a new setup. See Setting up DSS Hadoop integration.

Main metastore

If enabled, Data Science Studio can create tables for the HDFS datasets into the global Hive metastore, ie the metastore that is used when the “hive” command is launched without arguments. These tables are defined in the database namespace configured in the corresponding HDFS connection.

Note

It is strongly recommended that the default metastore use the “Shared metastore” deployment mode.

Pig connectivity

To run Pig recipes, you need to have the Pig packages installed on the host running Data Science Studio. It is required to have the “pig” command in the PATH.

If Pig was already installed when you installed Data Science Studio, connectivity to Pig is automatically set up.

If you installed Pig after installing Data Science Studio, you need to perform a new setup. See Setting up DSS Hadoop integration

Impala connectivity

Data Science Studio connects to Impala through a JDBC connection to one of the server(s) configured in the “Settings / Impala” administration screen.

Hive connectivity is mandatory for Impala use, as Impala connections use the Hive JDBC driver, and Impala table definitions are stored in the Hive metastore.

Secure Hadoop connectivity

Connecting to secure Hadoop clusters requires additional configuration steps described in Configuring Hadoop security.