Amazon S3¶
DSS can interact with Amazon Web Services’ Simple Storage Service (AWS S3) to:
Read and write datasets
Read and write managed folders
S3 is an object storage service: you create “buckets” that can store arbitrary binary content and textual metadata under a specific key, unique in the container.
While not technically a hierarchical file system with folders, sub-folders and
files, that behavior can be emulated by using keys containing /
. For
instance, you can store your daily logs using keys like 2015/01/24/app.log
.
S3 lets you list all objects with a specific prefix, say 2015/01/
and many
S3 clients (including the AWS Web Console) can “browse” your buckets this way.
DSS uses the same filesystem-like mechanism when accessing S3: when you specify a bucket, you can browse it to quickly find your dataset, or you can set the prefix in which DSS may output datasets. Datasets on S3 thus must be in one of the supported filesystem formats.
Note
- Using S3 as a filesystem comes with a few limitations:
keys must not start with a
/
“files” with names containing
/
are not supported“folders” (prefixes)
.
and..
are not supportedlike on a filesystem, a file and a folder with the same name are not supported: if a file
some/key
exists, it takes precedence over asome/key/
prefix / folder
Create a S3 connection¶
To create a S3 connection, you need to have a AWS keypair (access key and secret key). Note that if your DSS server is running on AWS itself, DSS can automatically retrieve credentials from the environment. In that case, you can use “Environment” credentials
Required S3 permissions¶
The access you specify in an AWS connection must have the following permissions:
To read data from a bucket, it must at least have listing and reading permissions on that bucket:
s3:ListBucket arn:aws:s3:::examplebucket s3:GetObject arn:aws:s3:::examplebucket/*
To write data to a bucket, it must also have writing, deletion and multipart-upload-aborting permissions on that bucket:
s3:ListBucket arn:aws:s3:::examplebucket s3:GetObject arn:aws:s3:::examplebucket/* s3:PutObject arn:aws:s3:::examplebucket/* s3:DeleteObject arn:aws:s3:::examplebucket/* s3:AbortMultipartUpload arn:aws:s3:::examplebucket/*
Not mandatory, but bucket listing permission allows you to pick a bucket from a dropdown list instead of manually typing its name when creating a dataset:
s3:ListAllMyBuckets arn:aws:s3:::*
Also not mandatory, bucket locating permission can improve performance by making DSS access a bucket via its preferred AWS endpoint:
s3:GetBucketLocation arn:aws:s3:::*
For more details about how to create policies and add these permissions to them, please see the AWS documentation here: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/set-permissions.html
Transfer ownership to the bucket owner¶
If the target bucket is owned by another account, and if ACLs are enabled and set to “Bucket owner preferred”, it is possible to transfer the ownership of uploaded data to the bucket owner by adding the bucket-owner-full-control
canned ACL to the write operation.
The bucket-owner-full-control
canned ACL can be set on every write operation in DSS by setting the dku.connection.s3.addBucketOwnerCannedACL
custom property to true
on the S3 connection.
In addition to the required permissions mentioned above to write data to a bucket, the AWS account used by the S3 connection must have permission to Read/Write ACLs in the target bucket. This can be set on the bucket by its owner.
For more details about configuring ACLs, please refer to the AWS documentation here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/managing-acls.html
Note
The recommendation for cross-account access is to use AssumeRole. For more details about AssumeRole, please see the AWS documentation here: https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html
Creating S3 datasets¶
After creating your S3 connection in Administration, you can create S3 datasets.
From either the Flow or the datasets list, click on New dataset > S3.
Select the connection in which your files are located
If available, select the bucket (either by listing or entering it)
Click on “Browse” to locate your files.
Connections path handling¶
The S3 connection can be either in “free selection” mode, or in “path restriction mode”.
In “free selection” mode, users can select the bucket in which they want to read, and the path within the bucket. If the AWS keypair has the permission to list buckets, a bucket selector will be available for users.
In “path restriction mode”, you choose a bucket, and optionally a path within the bucket. Users will only be able to read and write data within that “base bucket + path”.
To enable “path restriction mode”, simply write a bucket name (and optionally a path in bucket) in the “Path restrictions” section of the connection settings
Location of managed datasets and folders¶
For a “free selection” connection¶
When you create a managed dataset or folder in a S3 connection, DSS will automatically create it within the “Default bucket” and the “Default path”.
Below that root path, the “naming rule” applies. See Making relocatable managed datasets for more information.
For a “path restriction” connection¶
When you create a managed dataset or folder in a S3 connection, DSS will automatically create it within the Bucket and Path selected in the “Path restrictions” section, and will append the “Default path” from the “managed datasets & folders” section.
Below that root path, the “naming rule” applies. See Making relocatable managed datasets for more information.
Server-side encryption of files¶
Encryption Mode¶
Options for server-side encryption of files can be set to either ‘None’, ‘SSE-S3’ or ‘SSE-KMS’. For more information on these options, please see the AWS documentation here: https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html