SQL queries through DSS

The python API makes it possible to reuse the setup work done in DSS to connect to various databases and execute queries on them. For databases supporting commit, the transaction in which the queries are executed is not rolled back. This API is indeed meant for SELECT queries.

Running a query agains DSS is a 3-step process:

  • create the query
  • run it and fetch the data
  • verify that the streaming of the results wasn’t interrupted

The verification will make DSS release the resources taken for the query’s execution, so the verify() call has to be done once the results have been streamed.

An example of a SQL query on a connection configured in DSS is:

client = DSSClient(host, apiKey)
streamed_query = client.sql_query('select * from train_set', connection='local_postgres', type='sql')
row_count = 0
for row in streamed_query.iter_rows():
        row_count = row_count + 1
streamed_query.verify() # raises an exception in case something went wrong
print('the query returned %i rows' % count)

Queries against Hive and Impala are also possible. In that case, the type must be set to ‘hive’ or ‘impala’ accordingly, and instead of a connection it is possible to pass a database name:

client = DSSClient(host, apiKey)
streamed_query = client.sql_query('select * from train_set', database='test_area', type='impala')
...

In order to run queries before or after the main query, but still in the same session, for example to set variables in the session, the API provides 2 parameters pre_queries and post_queries which take in arrays of queries:

client = DSSClient(host, apiKey)
streamed_query = client.sql_query('select * from train_set', database='test_area', type='hive', pre_queries=['set hive.execution.engine=tez'])
...