Managed folders

DSS comes with a large number of supported formats, machine learning engines, … But sometimes you want to do more. If you need to store and manipulate data in a format not supported natively by DSS, or to use training algorithms without a scikit-learn interface, then DSS offers unstructured storage handles in the form of “Managed Folders”. In there, you can read and write any kind of data, and you can do so on any filesystem-like connection (local filesystem of course, but also HDFS, S3, FTP…).

DSS does not try to read or write structured data from managed folders, but they appear in Flow and can be used for dependencies computation. Furthermore, you can upload and download files from the managed folder using the The DSS public API.

Here are some example use cases:

  • You have some files that DSS cannot read, but you have a Python library which can read them: upload the files to a manged folder, and use your Python library in a Python recipe to create a regular dataset
  • You want to use Vowpal Wabbit to create a model. DSS does not have full-fledged integration in VW. Write a first Python recipe that has a managed folder as output, and write the saved VW model in it. Write another recipe that reads from the same managed folder to make a prediction recipe
  • Anything else you might think of. Thanks to managed folders, DSS can help you even when it does not know about your data.

Creating a managed folder

Managed folders creation is available in the flow from the New dataset menu (under the name ‘Folder’). To create a managed folder, you need to select an existing connection that supports folders:

  • either a FS-like connection (filesystem, HDFS, S3, Azure, GCS, FTP, SSH) that has the Allow managed folders set
  • or a custom FS provider (see Writing DSS Filesystem providers)

The default connection to create managed folders on is named managed_folders and corresponds to the managed_folders directory in DSS data directory, ie. resides on the filesystem of the DSS server.

Using a managed folder

Managed folders are primarily intended to be used as input or output for code recipes (Python, R, Scala), though some visual recipes dealing with unstructured data also use managed folders as output (Export, Download). As its name indicates, a managed folder is basically a folder, which means that you interact with it the same way you interact with a folder on your local filesystem.

Concepts

The managed folder follows the usual conventions pertaining to file-like objects:

  • objects inside the managed folder are identified by a “path” corresponding to their position w.r.t. the folder’s root.
  • objects have a size and can have a modification time (if the underlying storage system permits it)
  • objects can be “files” or “directories”. The latter has a size of 0, contains no data in itself, but contains other files or directories

Note

Not all storage systems have a native concept of “directory”, in which case directories are merely a logical structure inferred from the actual objects’ paths (ie. the files’ paths)

  • when reading from a file, DSS returns a non-seekable stream of data
  • when writing to a file, DSS doesn’t support positioning, so the entire file is written from the beginning

Usage in Python

A managed folder can be used both as input or output of a recipe (NB: this also applies to the Pyspark recipe):

  • To use a managed folder as input, select it in the inputs selector
  • To use a managed folder as output, click on the “Add” button of outputs, and select “Create folder” at the bottom. Enter a label for the managed folder.

A managed folder can also be used without any restriction from a Python notebook.

To use a managed folder, you have to retrieve its handle from the Python Dataiku API:

import dataiku

handle = dataiku.Folder("folder_name")

You can then list the contents of the folder, as paths relative to its root.

import dataiku

handle = dataiku.Folder("folder_name")
# pass a partition identifier if the folder is partitioned
paths = handle.list_paths_in_partition()

If your user can access the details of the connection on which the folder’s contents are stored, then you can retrieve the path to the folder’s root, and then read and write data directly (with the regular Python API for a local filesystem, or the boto library for S3, etc…)

import dataiku, os.path

handle = dataiku.Folder("folder_name")
path = handle.get_path()

with open(os.path.join(path, "myinputfile.txt")) as f:
    data = f.read()

Otherwise, you can read and write data via calls to the API, in which case the data will transit through the DSS backend

import dataiku, os.path

handle = dataiku.Folder("folder_name")
path = handle.get_path()

with handle.get_download_stream("myinputfile.txt") as f:
    data = f.read()

with handle.get_writer("myoutputfile.txt") as w:
    w.write("some data")

The Dataiku API also provides some helpers to read and write JSON files in a single line of code. See Managed folders in Python API

Usage in R

A managed folder can be used both as input or output of a R recipe (NB: this also applies to the SparkR recipe):

  • To use a managed folder as input, select it in the inputs selector
  • To use a managed folder as output, click on the “Add” button of outputs, and select “Create folder” at the bottom. Enter a label for the managed folder.

A managed folder can also be used without any restriction from a R notebook.

To use a managed folder, you have to retrieve its path from the R Dataiku API:

library(dataiku)

path <- dkuManagedFolderPath("folder_name")

Once you have obtained the path, you can simply read and write files with the regular R files API

Usage of a folder as a dataset

The contents of a managed folder can be used to construct a filesystem dataset. This is done using the “Files in folder” dataset.

This enables advanced setups like:

  • Having a Python dataset download files from a files-oriented data store that DSS cannot read. The Python recipe downloads the files to a managed folder. This way, it does not have to deal with parsing the files.
  • Having a files-in-folder dataset do the parsing and extraction from the intermediate folder.

Clearing

When a managed folder is built, DSS does not perform any cleaning of the folder’s contents. The code of the recipe having the folder as output is responsible for performing any necessary cleanup.