Connecting to Hadoop

Data Science Studio is able to connect to a Hadoop cluster and to:

  • Read and write HDFS datasets
  • Run Hive queries and scripts
  • Run Pig scripts

Data Science Studio supports the following Hadoop distributions:

  • Cloudera CDH versions 5.3 and 5.4
  • MapR Distribution including Apache Hadoop version 4.0
  • Hortonworks HDP version 2.1
  • Hortonworks HDP version 2.2.0

Warning

Despite its version name, HDP 2.2.4 is a major upgrade to HDP 2.2.0 At the present time, using DSS on HDP 2.2.4 is not officially supported, and there are known issues

For more information about Hadoop and DSS, see DSS and Hadoop

Running Data Preparation recipes on Hadoop is only supported if the cluster runs a compatible version of Java, which might not be the case for older installations of CDH 4.x.

Prerequisites

The host running DSS should have client access to the cluster (it can, but it does not need to host any cluster role like a datanode). Getting client access to the cluster normally involves installing:

  • the Hadoop client libraries (Java jars) matching the Hadoop distribution running on the cluster
  • the Hadoop configuration files so that client processes (including DSS) can find and connect to the cluster

This should be done at least for the HDFS and Yarn/MapReduce subsystems, and optionally for Hive / Pig / Impala if you plan to use these with DSS. See the documentation of your cluster distribution.

You may also need to setup a writable HDFS home directory for DSS (typically /user/dataiku) if you plan to store DSS datasets in HDFS.

Both for Cloudera and MapR, a prerequisite is to have the “hadoop” binary in your PATH. To test it, simply run:

hadoop version

If it displays version information, your hadoop installation is correctly set up for Data Science Studio to use it.

Setting up HDFS connectivity

If Hadoop was installed prior to installation of Data Science Studio, connection to HDFS is automatically configured upon installation of Data Science Studio.

If Hadoop was installed after installing Data Science Studio, you need DSS to re-detect your cluster:

  • Go to the Data Science Studio data directory
  • Stop Data Science Studio:
./bin/dss stop
  • Run the post-install script
./bin/post-install
  • Restart Data Science Studio
./bin/dss start

Test it

To test HDFS connectivity, try to create an HDFS dataset:

../_images/new-hdfs-dataset.png

Note

If the Hadoop HDFS button does not appear, Data Science Studio has not properly detected your Hadoop installation.

You can then select the “hdfs_root” connection (which gives access to the whole HDFS hierarchy) and click the Browse button and verify that you can see your HDFS data.

Setting up Hive connectivity

To run Hive recipes, you need to have the Hive packages installed on the host running Data Science Studio. A Hive Server is not required.

If Hive was already installed when you installed Data Science Studio, connectivity to Hive is automatically set up.

If you installed Hive after installing Data Science Studio, you need to perform a new detection. See the HDFS connectivity section

Main metastore

If enabled, Data Science Studio can create tables for the HDFS datasets into the “default” Hive metastore, ie the metastore that is used when the “hive” command is launched without arguments.

Note

It is strongly recommended that the default metastore use the “Shared metastore” deployment mode.

Setting up Pig connectivity

To run Pig recipes, you need to have the Pig packages installed on the host running Data Science Studio. It is required to have the “pig” command in the PATH.

If Pig was already installed when you installed Data Science Studio, connectivity to Pig is automatically set up.

If you installed Pig after installing Data Science Studio, you need to perform a new detection. See the HDFS connectivity section