DSS and Hive

Hive is a tool of the Hadoop environment that allows running SQL queries on top of large amounts of HDFS data by leveraging the computation capabilities of Hadoop MapReduce. It can be used either as a semi-interactive SQL query interface to obtain query results, or as a batch tool to compute new datasets.

Hive maps datasets to virtual SQL tables.

Data Science Studio provides the following integration points with Hive :

  • The Hive Recipe allows you to compute HDFS datasets as the results of Hive scripts
  • All HDFS datasets can be made available in the Hive environment, where they can be used by any Hive-capable tool, even if these datasets were not computed using a Hive recipe
  • The “Hive notebook” allows you to run Hive queries on any Hive database, whether they have been created by DSS or not

Metastore synchronization

Each HDFS connection declared in DSS references a single Hive database. When a managed dataset is built, DSS automatically “synchronizes” its definition in the Hive metastore for this database.

This means that as soon as a compatible HDFS dataset has been built, you can use the Hive notebook or any Hive query tool (like Cloudera Hue)

Note

Whenever possible, metastore synchronization also ensures that the dataset is used by Impala, ie. you can use the Impala Notebook, or perform data visualization.

For more details, please see DSS and Impala

If the schema of the DSS dataset has changed, DSS automatically updates it in the Hive metastore.

Metastore synchronization happens as part of the normal job run, but you can also force it manually by following the procedure outlined below.

For external datasets

Only managed datasets are automatically synchronized to the Hive metastore. However, you can also manually synchronize an external HDFS dataset.

  • Make sure that the HDFS connection referenced by the external HDFS dataset defines a target Hive database
  • Go to the settings of the dataset
  • Go to the Advanced tab
  • Click on the “Synchronize” button

Supported formats

Hive only recognizes some formats, so not all HDFS datasets can be synchronized to Hive or used in a Hive recipe.

The following formats are handled:

  • CSV, only in “Escaping only” or “No escaping nor quoting” modes
  • Parquet. If the dataset has been built by DSS, it should use the “Hive flavor” option of the Parquet parameters.
  • Hive Sequence File
  • Hive RC File
  • Hive ORC File
  • Avro

Limitations

  • Hadoop does not support at all CSV files with newlines embedded within fields. Trying to parse such files with Hadoop or any Hadoop tool like Hive will fail and generate invalid data

Version specific issues

Hive 0.10 (CDH4 Version)

  • Parquet is not supported

Hive 0.12 (CDH5 Version)

  • Dates are not supported in Parquet tables

Internal details

Data Science Studio creates all tables as EXTERNAL tables in the Hive meaning of the term.

Importing a Hive database into DSS

In addition to the ability to declare all datasets into the Hive Metastore, DSS can also read the table definitions from the metastore to create associated datasets in DSS.

This is done using the ./bin/dku hive-import tool (run from the DATA_DIR of your DSS instance).

The tool will report which of the Hive tables it managed to import. The following limitations apply:

  • it doesn’t detect the compression settings, notably on files in the Parquet format. As a result, the output compression is not preserved.
  • for partitioned tables, it tries to detect the partitioning scheme, and will import those tables whose partitioning scheme can be handled by DSS. This excludes notably tables where the partition locations can’t all be translated into a concatenation of the partitioning columns’ values.
  • the table definitions are imported ‘as is’ and the user’s HDFS rights on the table’s files are not checked, so that an imported table can not necessarily be read from or written to in DSS.

Usage examples:

# Import all tables from the "mysrcdb" Hive database into the "INPUTDATA" DSS project.
# Note that the DSS project must already exist
./bin/dku hive-import -a mysrcdb INPUTDATA
# List all tables from the "mysrcdb" Hive database
./bin/dku hive-import -l mysrcdb
# Import tables "t1" and "t2" the "mysrcdb" Hive database into the "INPUTDATA" DSS project.
# Note that the DSS project must already exist
./bin/dku hive-import mysrcdb INPUTDATA t1 t2

Note

The name of the created datasets default to the Hive table name. In case of conflict, DSS adds a distinctive suffix to the dataset name.