Amazon S3

DSS can interact with Amazon Web Services’ Simple Storage Service (AWS S3) to:

  • Read and write datasets
  • Read and write managed folders

S3 is an object storage service: you create “buckets” that can store arbitrary binary content and textual metadata under a specific key, unique in the container.

While not technically a hierarchical file system with folders, sub-folders and files, that behavior can be emulated by using keys containing /. For instance, you can store your daily logs using keys like 2015/01/24/app.log. S3 lets you list all objects with a specific prefix, say 2015/01/ and many S3 clients (including the AWS Web Console) can “browse” your buckets this way.

DSS uses the same filesystem-like mechanism when accessing S3: when you specify a bucket, you can browse it to quickly find your dataset, or you can set the prefix in which DSS may output datasets. Datasets on S3 thus must be in one of the supported filesystem formats.

Note

Using S3 as a filesystem comes with a few limitations:
  • keys must not start with a /
  • “files” with names containing / are not supported
  • “folders” (prefixes) . and .. are not supported
  • like on a filesystem, a file and a folder with the same name are not supported: if a file some/key exists, it takes precedence over a some/key/ prefix / folder

Create a EC2/S3 connection

Note

In DSS, connections that can store S3 data are called “EC2 connections”

To create a S3 connection, you need to have a AWS keypair (access key and secret key). Note that if your DSS server is running on AWS itself, DSS can automatically retrieve credentials from the environment. In that case, you can check the “Use default credentials” box

Required EC2 permissions

The access you specify in an AWS connection must have the following permissions:

  • To read data from a bucket, it must at least have listing and reading permissions on that bucket:

    s3:ListBucket arn:aws:s3:::examplebucket
    s3:GetObject arn:aws:s3:::examplebucket/*
    
  • To write data to a bucket, it must also have writing, deletion and multipart-upload-aborting permissions on that bucket:

    s3:ListBucket arn:aws:s3:::examplebucket
    s3:GetObject arn:aws:s3:::examplebucket/*
    s3:PutObject arn:aws:s3:::examplebucket/*
    s3:DeleteObject arn:aws:s3:::examplebucket/*
    s3:AbortMultipartUpload arn:aws:s3:::examplebucket/*
    
  • Not mandatory, but bucket listing permission allows you to pick a bucket from a dropdown list instead of manually typing its name when creating a dataset:

    s3:ListAllMyBuckets arn:aws:s3:::*
    
  • Also not mandatory, bucket locating permission can improve performance by making DSS access a bucket via its preferred AWS endpoint:

    s3:GetBucketLocation arn:aws:s3:::*
    

Creating S3 datasets

After creating your EC2 connection in Administration, you can create S3 datasets.

From either the Flow or the datasets list, click on New dataset > S3.

  • Select the connection in which your files are located
  • If available, select the bucket (either by listing or entering it)
  • Click on “Browse” to locate your files.

Connections path handling

The EC2 connection can be either in “free selection” mode, or in “path restriction mode”.

In “free selection” mode, users can select the bucket in which they want to read, and the path within the bucket. If the AWS keypair has the permission to list buckets, a bucket selector will be available for users.

In “path restriction mode”, you choose a bucket, and optionally a path within the bucket. Users will only be able to read and write data within that “base bucket + path”.

To enable “path restriction mode”, simply write a bucket name (and optionally a path in bucket) in the “Path restrictions” section of the connection settings

Location of managed datasets and folders

For a “free selection” connection

When you create a managed dataset or folder in a EC2/S3 connection, DSS will automatically create it within the “Default bucket” and the “Default path”.

Below that root path, the “naming rule” applies. See Making relocatable managed datasets for more information.

For a “path restriction” connection

When you create a managed dataset or folder in a EC2/S3 connection, DSS will automatically create it within the Bucket and Path selected in the “Path restrictions” section, and will append the “Default path” from the “managed datasets & folders” section.

Below that root path, the “naming rule” applies. See Making relocatable managed datasets for more information.