Google Cloud Storage

DSS can interact with Google Cloud Storage to:

  • Read and write datasets
  • Read and write managed folders

GCS is an object storage service: you create “containers” that can store arbitrary binary content and textual metadata under a specific key, unique in the container.

While not technically a hierarchical file system with folders, sub-folders and files, that behavior can be emulated by using keys containing /. For instance, you can store your daily logs using keys like 2015/01/24/app.log.

DSS uses the same filesystem-like mechanism when accessing GCS: when you specify a container, you can browse it to quickly find your data, or you can set the prefix in which DSS may output datasets. Datasets on GCS thus must be in one of the supported filesystem formats.

Note

Besides GCS naming guide lines GCS as a filesystem-like storage comes with a few limitations:
  • keys must not start with a /
  • “files” with names containing / are not supported
  • “folders” (prefixes) . and .. are not supported
  • like on a filesystem, a file and a folder with the same name are not supported: if a file some/key exists, it takes precedence over a some/key/ prefix / folder
  • multiple successive / are not supported

Create a GCS connection

Before connecting to Google Cloud via DSS you will have to:

  • Make sure “Google Cloud Storage” and “Google Cloud Storage JSON API” are enabled in Google Cloud console’s API Manager
  • Create at lea one Storage bucket in your Google cloud account
  • Create a service account and export your private key in JSON format.

Note

In order to let DSS create new datasets, your storage account will have to be granted “project editor” role . A “project viewer” should be sufficient for read-only connection

(See the official documentation for more details)

To configure your connection you must specify :

  • your project ID (you can find it in your GCLOUD project list next to your project name )
  • the entire content of your service account private key in JSON format.

Creating GCS datasets

After creating your GCS connection in Administration, you can create GCS datasets.

From either the Flow or the datasets list, click on New dataset > GCS.

  • Select the connection in which your files are located
  • If available, select the container (either by listing or entering it)
  • Click on “Browse” to locate your files.

Connections path handling

The GCS connection can be either in “free selection” mode, or in “path restriction mode”.

In “free selection” mode, users can select the container in which they want to read, and the path within the container. If the credentials have the permission to list containers, a container selector will be available for users.

In “path restriction mode”, you choose a container, and optionally a path within the container. Users will only be able to read and write data within that “base container + path”.

To enable “path restriction mode”, simply write a container name (and optionally a path in container) in the “Path restrictions” section of the connection settings

Location of managed datasets and folders

For a “free selection” connection

When you create a managed dataset or folder in a GCS connection, DSS will automatically create it within the “Default container” and the “Default path”.

Below that root path, the “naming rule” applies. See Making relocatable managed datasets for more information.

For a “path restriction” connection

When you create a managed dataset or folder in a GCS connection, DSS will automatically create it within the container and Path selected in the “Path restrictions” section, and will append the “Default path” from the “managed datasets & folders” section.

Below that root path, the “naming rule” applies. See Making relocatable managed datasets for more information.