Setup a new HDFS connection

“Hadoop connection” can mean two things

  • Let DSS notice that there is a Hadoop cluster to which it can connect and run queries. For this, see Setting up Hadoop integration.
  • Once DSS is aware of Hadoop, configure one or many connections, to read and write datasets. This is akin to the FTP, SSH or filesystem connections. This is what we talk about here.

To setup a new connection to HDFS, go to Administration → Connections → New connection → HDFS.

We suggest to have two connections to HDFS:

  • A read-only connection to all data:

    • root: /           (This is a path on HDFS, the Hadoop file system.)
    • Hive database name: dataiku
    • allow write, allow managed datasets: unchecked
    • max nb of activites: 0
    • name: hdfs_root
  • A read-write connection, to allow DSS to create and store managed datasets:

    • root: /user/dataiku/dss_managed_datasets
    • allow write, allow managed datasets: checked
    • Hive database name, max nb of activites: same as above
    • name: hdfs_managed

When “Hive database name” is configured, DSS declares its HDFS datasets in the Hive metastore, in this database namespace. This allows you to refer to DSS datasets in external Hive programs, or in Hive notebooks within DSS.