DSS and Hive¶
- Interaction with the Hive global metastore
- Synchronisation to the Hive metastore
- Importing from the Hive metastore
- Hive execution engines
- Support for Hive authentication modes
- Support for Hive authorization modes
- No Hive security (No DSS User Isolation)
- Sentry (No DSS User Isolation)
- Sentry (DSS User Isolation enabled)
- Ranger (No DSS User Isolation)
- Ranger (DSS User Isolation enabled)
- Storage-based security (No DSS User Isolation)
- Cloudera-specific note
- Supported file formats
- Internal details
Hive is a tool of the Hadoop environment that allows running SQL queries on top of large amounts of HDFS data by leveraging the computation capabilities of the cluster. It can be used either as a semi-interactive SQL query interface to obtain query results, or as a batch tool to compute new datasets.
Hive maps datasets to virtual SQL tables.
DSS provides the following integration points with Hive :
- The Hive Recipe allows you to compute HDFS datasets as the results of Hive scripts
- All HDFS datasets can be made available in the Hive environment, where they can be used by any Hive-capable tool, even if these datasets were not computed using a Hive recipe. This is called “Hive metastore synchronization”
- The “Hive notebook” allows you to run Hive queries on any Hive database, whether they have been created by DSS or not
- DSS can import table definitions from Hive, and convert them to DSS HDFS dataset definitions
HDFS datasets in DSS are always true “HDFS datasets”. They are primarily a path on HDFS and may have an associated Hive table. DSS does not have “Hive-only datasets”, and accessing Hive tables as SQL datasets using “Other SQL databases” option is not supported.
The global metastore is the metastore that is used when the “hive” command is launched without arguments. These tables are defined in the database namespace configured in the corresponding HDFS connection.
- Create tables for the HDFS datasets into the global Hive metastore
- Import table definitions from the global Hive metastore as HDFS datasets
It is strongly recommended that your Hadoop cluster uses the “Shared metastore” mode for the global metastore.
This is the default behavior for Cloudera and Hortonworks Hadoop distributions
HDFS datasets in DSS are primarily what their name implies: HDFS datasets. In other words, a HDFS dataset in DSS is a reference to a folder on HDFS. It is not directly a reference to a Hive table.
However, each HDFS dataset in DSS can point to a Hive table. When a managed dataset is built, DSS automatically “pushes” its definition as the corresponding Hive table in the Hive metastore.
This means that as soon as a compatible HDFS dataset has been built, you can use the Hive notebook or any Hive query tool (like Cloudera Hue)
Whenever possible, metastore synchronization also ensures that the dataset is usable by Impala, ie. you can use the Impala Notebook, perform data visualization, or use with any Impala query tool (like impala-shell)
For more details, please see DSS and Impala
Metastore synchronization normally happens as part of the normal job run, after the dataset is built, but you can also force it manually by following the procedure outlined below.
If the schema of the DSS dataset has changed, DSS automatically updates it in the Hive metastore.
The Hive database and table associated to each dataset is configured in the settings of this dataset.
Only managed datasets are automatically synchronized to the Hive metastore. However, you can also manually synchronize an external HDFS dataset.
- Go to the settings of the HDFS dataset
- Fill in the Hive database and table information in the dataset
- Save the dataset settings
- Go to the Advanced tab
- Click on the “Synchronize” button
In addition to the ability to “push” datasets’ definition into the Hive Metastore, DSS can also read preexisting table definitions from the metastore to create associated HDFS datasets in DSS.
To import Hive tables as HDFS datasets:
- Go to the datasets list
- Click “New dataset”, then “Import from connection”
- In the list, select your Hive database
Import lists all tables in the Hive database. If there is already a dataset corresponding to each table, you get a link to the existing dataset.
Select the tables that you want to import. If needed, customize the resulting dataset name, then click “Create”.
The tool will report which of the Hive tables it managed to import.
The following limitations apply:
- Existing compression settings are not detected, notably on files in the Parquet format. As a result, the output compression is not preserved (if you plan on using this dataset in write mode).
- For partitioned tables, it tries to detect the partitioning scheme, and will import those tables whose partitioning scheme can be handled by DSS. This excludes notably tables where the partition locations can’t all be translated into a concatenation of the partitioning columns’ values.
- The table definitions are imported ‘as is’ and the user’s HDFS rights on the table’s files are not checked, so that an imported table can not necessarily be read from or written to in DSS.
The name of the created datasets default to the Hive table name. In case of conflict, DSS adds a distinctive suffix to the dataset name.
Hive notebooks and metrics computations are always executed using Hiveserver2 (and therefore using the global metastore).
If you encounter issues with tables not found, you can check that the datasets that you try to reach have properly been synchronized to the Hive metastore.
There are three ways to run Hive recipes in DSS
In this mode, recipes use Hiveserver2. DSS automatically synchronizes the recipe’s inputs and outputs to the global metastore when running such a recipe.
In this mode, DSS uses the
hive command-line, targeting the global mode. DSS automatically synchronizes the recipe’s inputs and outputs to the global metastore when running such a recipe.
In this mode, DSS uses the
hive command-line, but creates a specific metastore for running each recipe.
This mode ensures that your query only uses the proper input and output datasets, since only these ones will be added to the isolated metastore.
When DSS User Isolation is enabled, only Hiveserver2 mode is supported.
In some setups, running the Hive CLI is not possible. For these setups, only Hiveserver2 mode is possible.
“Hive CLI (isolated metastore)” mode has interesting safety advantages: because the isolated metastore only contains the requested datasets and partitions, you cannot accidentally access data which is not properly declared in your Flow, thus improving the reproducibility.
However, the isolated metastore does not have dataset stats. When Hive runs on Tez, dataset stats are used to compute an optimal execution plan. Not having dataset stats can lead to worse performance. In that case, we recommend using “Hive CLI (global metastore)” or HiveServer2 modes.
In addition, depending on the Hive authorization mode, only some recipe modes might be possible. Check below for more information.
The execution mode can be configured in each Hive recipe (and also in visual recipes running with the Hive engine), in the “Advanced tab”.
In addition, you can configure in Administration > Settings > Hadoop the “Default execution engine”, which will select the initial value for newly created recipes. This global setting has no impact on existing recipes.
DSS supports the following authentication modes for HiveServer2:
- PLAIN authentication for non-secure Hadoop clusters
- KERBEROS authentication for secure Hadoop clusters
Hive only recognizes some formats, so not all HDFS datasets can be synchronized to Hive or used in a Hive recipe.
The following formats are handled:
- CSV, only in “Escaping only” or “No escaping nor quoting” modes
- Parquet. If the dataset has been built by DSS, it should use the “Hive flavor” option of the Parquet parameters.
- Hive Sequence File
- Hive RC File
- Hive ORC File