Hive datasets

Most of the time, to read and write data in the Hadoop ecosystem, DSS handles HDFS datasets, that is file-oriented datasets pointing to files residing on one or several HDFS-like filesystems.

DSS can also handle Hive datasets. Hive datasets are pointers to Hive tables already defined in the Hive metastore.

  • Hive datasets can only be used for reading, not for writing

  • To read data from Hive datasets, DSS uses HiveServer2 (using a JDBC connection). In essence a Hive dataset is a SQL-like dataset

Use cases

HDFS dataset remains the “to-go” dataset for interacting with Hadoop-hosted data. The HDFS dataset provides the most features, the most ability to paralellize work and execute it on the cluster.

However, there are some cases of (existing) source data for which the HDFS isn’t able to read them properly. In that case, using a Hive dataset as the source of your Flow will allow you to read your data. Since Hive dataset is read-only, only the sources of the Flow use a Hive dataset, and subsequent parts of the Flow revert to regular HDFS datasets.

Hive views

If you have existing data which is available through a Hive view, there are no HDFS files materializing this particular data. In that case, you cannot use a HDFS dataset and should use a Hive dataset.

No read access on source files

Through the Hive security mechanisms (Sentry and Ranger), it is possible to have existing tables in the Hive metastore, with read access to these tables using HiveServer2, but not read access to the underlying HDFS files.

In that case, you cannot use a HDFS dataset and should use a Hive dataset.

ACID tables (ORC)

You can create ACID tables in Hive (in the ORC format). These tables support UPDATE statements that regular Hive tables don’t support. These tables are stored in a very specific format that only HiveServer2 can read. DSS cannot properly read the underlying files of these tables.

In that case, you cannot use a HDFS dataset and should use a Hive dataset.

DATE and DECIMAL data types

There are various difficulties in reading tables containing these kind of columns. It is recommended to use Hive datasets preferably when reading these tables.

Creating a Hive dataset

You do not need to setup a connection to create a Hive dataset. As soon as connectivity with Hadoop (and your HiveServer2) is established, you can create Hive datasets

New dataset

  • Select New Dataset > Hive

  • Select the database and the table

  • Click on test to retrieve the schema

  • Your Hive dataset is ready to use

Import

Either from the catalog or connections explorer, when selecting an existing Hive table, you will have the option to import it either as a HDFS dataset or Hive dataset.

Using a Hive dataset

A Hive dataset can be used with most kinds of DSS recipes.

Hive recipes

You can create Hive recipes with Hive datasets as inputs.

Note

The recipe MUST be in “Hive CLI (global metastore)” or HiveServer2 mode for this to work. Please see Hive for more information.

Visual recipes with Hive as execution engine

You can create visual recipes and select the Hive execution engine (when available) with Hive datasets as inputs.

Note

The recipe MUST be in “Hive CLI (global metastore)” or HiveServer2 mode for this to work. Please see Hive for more information.

Spark recipes

You can create Spark (code) recipes with Hive datasets as inputs.

Note

The recipe MUST be in “Use global metastore)” mode for this to work.

Note

You must have filesystem-level access to the underlying files of this Hive table for this to work.

Visual recipes with Spark as execution engine

You can create visual recipes and select the Spark execution engine (when available) recipes with Hive datasets as inputs.

Note

The recipe MUST be in “Use global metastore)” mode for this to work.

Note

You must have filesystem-level access to the underlying files of this Hive table for this to work.

Limitations

  • SQL recipes cannot be used. Use a Hive recipe instead

  • Spark engine (and Spark recipes) cannot be used if you don’t have filesystem access to the underlying tables.