DSS and Hive¶
Hive is a tool of the Hadoop environment that allows running SQL queries on top of large amounts of HDFS data by leveraging the computation capabilities of Hadoop MapReduce. It can be used either as a semi-interactive SQL query interface to obtain query results, or as a batch tool to compute new datasets.
Hive maps datasets to virtual SQL tables.
Data Science Studio provides the following integration points with Hive :
- The Hive Recipe allows you to compute HDFS datasets as the results of Hive scripts
- All HDFS datasets can be made available in the Hive environment, where they can be used by any Hive-capable tool, even if these datasets were not computed using a Hive recipe
- The “Hive notebook” allows you to run Hive queries on any Hive database, whether they have been created by DSS or not
Each HDFS connection declared in DSS references a single Hive database. When a managed dataset is built, DSS automatically “synchronizes” its definition in the Hive metastore for this database.
This means that as soon as a compatible HDFS dataset has been built, you can use the Hive notebook or any Hive query tool (like Cloudera Hue)
Whenever possible, metastore synchronization also ensures that the dataset is used by Impala, ie. you can use the Impala Notebook, or perform data visualization.
For more details, please see DSS and Impala
If the schema of the DSS dataset has changed, DSS automatically updates it in the Hive metastore.
Metastore synchronization happens as part of the normal job run, but you can also force it manually by following the procedure outlined below.
Only managed datasets are automatically synchronized to the Hive metastore. However, you can also manually synchronize an external HDFS dataset.
- Make sure that the HDFS connection referenced by the external HDFS dataset defines a target Hive database
- Go to the settings of the dataset
- Go to the Advanced tab
- Click on the “Synchronize” button
Hive only recognizes some formats, so not all HDFS datasets can be synchronized to Hive or used in a Hive recipe.
The following formats are handled:
- CSV, only in “Escaping only” or “No escaping nor quoting” modes
- Parquet. If the dataset has been built by DSS, it should use the “Hive flavor” option of the Parquet parameters.
- Hive Sequence File
- Hive RC File
- Hive ORC File
Data Science Studio creates all tables as EXTERNAL tables in the Hive meaning of the term.
In addition to the ability to declare all datasets into the Hive Metastore, DSS can also read the table definitions from the metastore to create associated datasets in DSS.
This is done using the
./bin/dku hive-import tool (run from the DATA_DIR of your DSS instance)
# Import all tables from the "mysrcdb" Hive database into the "INPUTDATA" DSS project. # Note that the DSS project must already exist ./bin/dku hive-import -a mysrcdb INPUTDATA
# List all tables from the "mysrcdb" Hive database ./bin/dku hive-import -l mysrcdb
# Import tables "t1" and "t2" the "mysrcdb" Hive database into the "INPUTDATA" DSS project. # Note that the DSS project must already exist ./bin/dku hive-import mysrcdb INPUTDATA t1 t2
The name of the created datasets default to the Hive table name. In case of conflict, DSS adds a distinctive suffix to the dataset name.