Google BigQuery

DSS can connect to Google BigQuery through a JDBC driver developed by Simba.

BigQuery uses wording that sometime clashes with DSS concepts. Here are a few hints:

  • a DSS Dataset corresponds to a BigQuery table

  • the concept of dataset in BigQuery is named schema in DSS

  • a DSS partition might not translate into a BigQuery table partition and vice-versa (see below for more explanations)

Supported

  • creating external datasets targeting BigQuery projects

  • reading data across all datasets and tables.

  • reading all BigQuery data types

Note

DATETIME and DATE are read and transformed into either TIMESTAMP or strings depending on the DSS dataset settings.

BYTES, TIME, GEOGRAPHY are read as strings.

NESTED and REPEATED fields are supported from DSS 9.0.4

  • SQL notebook for interactive querying: SELECT and DDL (CREATE, DROP, …)

  • SQL query datasets

  • SQL query recipes, with inputs and outputs in BigQuery

  • sync query recipes, with output in BigQuery and input in either Google Cloud Storage or BigQuery.

  • all visual recipes (Group, Join, VStack, Window, Filter executed in BigQuery), with inputs and outputs in BigQuery

  • python code recipes with inputs and outputs in BigQuery if you’re using SQLExecutor2 to generate the results.

  • charts with DSS and In-Database engine modes.

  • BigQuery native table partitioning by Date or by Range, and clustering.

Note

If your data are not stored in BigQuery, you first need to upload them into a Google Cloud Storage (GCS) bucket. Once they are into GCS, you can use a Sync recipe to transfer them into BigQuery.

Not supported

  • writing BigQuery BYTES, DATETIME, DATE, TIME, GEOGRAPHY data types

  • SQL script recipes

  • code recipes (except Python with SQLExecutor2)

  • creation of BigQuery tables partitioned by ingestion time

  • BigQuery wildcard tables

Installing the JDBC driver

Warning

We recommend that you use the latest version of the driver (1.2.16.1020 or above).

The JDBC driver is made of many JAR files, all of which are contained in the Zip file.

  • in your DSS data directory, create a subdirectory of lib/jdbc folder named bigquery. If you are updating the driver from an old version and the directory already exist, remove all files inside the bigquery folder

  • copy all the JAR files to the lib/jdbc/bigquery folder

  • if the file slf4j-api-1.7.30.jar (or a later version) is present in lib/jdbc/bigquery, remove the file

Note

If you are updating your BigQuery driver to a newer version and the old version had been copied into lib/jdbc, carrefully remove all the files of the old BigQuery driver in lib/jdbc, and then copy the files of the new driver into lib/jdbc/bigquery (except slf4j-api-1.7.30.jar).

Connecting to BigQuery

DSS supports connecting to BigQuery using a Service Account or OAuth2.

With service account credentials, DSS will be able to access all resources associated with this service account, independently of the user initiating the connection. This also means that in the GCP audit logs, you won’t have a tracing of the user behind this connection.

OAuth2 connection access means DSS will use the OAuth2 protocol to access the resources in GCP. DSS will be registered as an OAuth2 client, authorized to request and gain access on behalf of your DSS users.

Use a service account if:

  • your DSS users don’t have direct access to the resources in GCP

  • you don’t need resources access filtering per user

Use OAuth2 if:

  • your DSS users got access to your GCP project and particularly to BigQuery

  • you don’t want your users to access resources via DSS in BigQuery which they don’t have permission for

  • you want an audit in GCP of your users accesses

Using Service Account

  • you first need to create a Google Service Account

  • create a private key for this account, and download the corresponding JSON file

  • upload the JSON file somewhere on the DSS server

In the connection settings in DSS:

  • select the credential type Private key

  • in the Secret key field, enter the absolute path (on the DSS server) to the credentials JSON file

Alternatively, you can directly enter the content of the JSON file in the Secret key field to avoid storing the file on the server. Keep in mind that in this latter case, any DSS administrator will be able to see the content of this private file.

Using OAuth2

The OAuth2 connection is performed using per-user credentials. Each user must grant DSS permission to access BigQuery on their behalf. You will need to create an OAuth2 client in your GCP project and configure the credentials in your DSS BigQuery connection.

To create an OAuth 2.0 client ID in the console, please refer to the following documentation . When creating your OAuth2 client in google, you will need to:

  • Select the application type Web application

  • Add the following redirect URI DSS_BASE_URL/dip/api/oauth2-callback

Note

For example if DSS is accessed at https://dss.mycompany.corp/, the OAuth2 redirect URL is https://dss.mycompany.corp/dip/api/oauth2-callback

Once created, configure DSS to use this OAuth2 client. Do in DSS the following:

  • create a new BigQuery connection

  • fill in the basic params as usual

  • select “OAuth” as the “credentials”. Note that this will force you to use per-user credential

  • fill the “Client id”, “Client secret” (if there is one) with the information from your OAuth app

  • create the connection

Note

At this point, although the connection is operational, you can’t test it yet as your user hasn’t authorized DSS to access BigQuery on their behalf.

Each user, including you, will need to follow these steps to allow DSS to access GCP on their behalf:

  • go to user profile > credentials

  • the user will see that no authorization was given yet to DSS for this connection

  • click the “Edit” button next to the new connection name

  • follow the instructions that appear: Google will authenticate and get the user consent to authorize DSS to access BigQuery

  • the user will be redirected automatically to DSS and will notice that credentials have successfully been obtained for the connection

If you did these steps with a user allowed to modify the connection, like an admin user, you should now be able to test the connection:

  • go back to your connection settings

  • click on the Test button which should be successful

Setup JDBC

In the connection settings, you must also specify the location of the JDBC driver jar: Enter lib/jdbc/bigquery in the Driver jars directory field.

Finally we recommend that you add an Advanced JDBC property:

  • Key: Timeout

  • Value: 180

If you need to connect to BigQuery via a Proxy, you must add the following advanced JDBC properties (and also configure the global proxy for DSS in the “Settings” tab of the Administration page):

  • Key: ProxyHost

  • Value: The IP address or host name of your proxy server.

  • Key: ProxyPort

  • Value: The listening port of your proxy server.

  • Key: ProxyUser

  • Value: The user name, if needed, for proxy server settings.

  • Key: ProxyPassword

  • Value: The password, if needed, for proxy server settings.

BigQuery native partitioning and clustering

By default, DSS does not create BigQuery tables that are partitioned at the database level, even when creating a DSS partitioned dataset.

To create a BigQuery table that is natively partitioned:

  • go to the settings for your Dataset and open the Advanced tab

  • check the “Create BigQuery partitioned table” checkbox and indicate the column to use to partition the table. This column must be of type DATE, INT, TINYINT, SMALLINT or BIGINT (in the DSS semantic)

BigQuery partitioned tables can also be clustered by up to 4 clustering columns of STRING, DATE, BOOLEAN, INT, TINYINT, SMALLINT or BIGINT (in DSS semantic).

Warning

You must manually configure the BigQuery native partioning and clustering for each and every DSS dataset. DSS does not automatically propagate these settings when creating new datasets.

Note

You can create a BigQuery partitioned table even for a DSS dataset that is not partitioned (in DSS semantic).

BigQuery experimental built-in driver

Starting with DSS 9.0.4, you can connect to BigQuery using a built-in experimental driver rather than the JDBC driver provided by Google and developed by Simba.

The benefits from using the built-in experimental driver are the following:

  • dataset preview and explore (when there is no filter or selected partition) are done using tabledata.list which does not incur any billing from Google

  • in SQL notebooks, you can preview the cost of a query before executing it (using the Explain Plan button)

  • in SQL notebooks, you can view the cost of a query after its execution

  • in Job logs, you can view all statistics about the executed queries (including their costs and execution plans)

  • some complex NESTED fields are not read correctly using the Simba driver. The built-in experimental driver is able to read these correctly

The limitations of the built-in experimental driver are the following:

  • streaming large datasets out of BigQuery is slower than with the Simba driver

  • the built-in driver waits for queries completion without any configurable time out