Using managed folders

DSS comes with a large number of supported formats, machine learning engines, ... But sometimes you want to do more.

In addition to “Datasets”, DSS code recipes (Python, R, PySpark, SparkR) can read and write from “Managed Folders”, handles on filesystem-hosted folders, where you can store any kind of data.

DSS does not try to read or write structured data from managed folders, but they appear in Flow and can be used for dependencies computation. Furthermore, you can upload and download files from the managed folder using the The DSS public API.

Here are some example use cases:

  • You have some files that DSS cannot read, but you have a Python library which can read them: upload the files to a manged folder, and use your Python library in a Python recipe to create a regular dataset
  • You want to use Vowpal Wabbit to create a model. DSS does not have full-fledged integration in VW. Write a first Python recipe that has a managed folder as output, and write the saved VW model in it. Write another recipe that reads from the same managed folder to make a prediction recipe
  • Anything else you might think of. Thanks to managed folders, DSS can help you even when it does not know about your data.

Folder storage

As its name indicates, a managed folder is basically a folder, which means that you can read and write files from it.

A managed folder is stored on the filesystem of the DSS server, by default in the managed_folders subfolder.

Usage in Python

A managed folder can be used both as input or output of a recipe (NB: this also applies to the Pyspark recipe):

  • To use a managed folder as input, select it in the inputs selector
  • To use a managed folder as output, click on the “Add” button of outputs, and select “Create folder” at the bottom. Enter a label for the managed folder.

A managed folder can also be used without any restriction from a Python notebook.

To use a managed folder, you have to retrieve its path from the Python Dataiku API:

import dataiku

handle = dataiku.Folder("folder_name")
path = handle.get_path()

Once you have obtained the path, you can simply read and write files with the regular Python files API

import dataiku, os.path

handle = dataiku.Folder("folder_name")
path = handle.get_path()

with open(os.path.join(path, "myinputfile.txt")) as f:
        data = f.read()

The Dataiku API also provides some helpers to read and write JSON files in a single line of code. See Managed folders in Python API

Usage in R

A managed folder can be used both as input or output of a R recipe (NB: this also applies to the SparkR recipe):

  • To use a managed folder as input, select it in the inputs selector
  • To use a managed folder as output, click on the “Add” button of outputs, and select “Create folder” at the bottom. Enter a label for the managed folder.

A managed folder can also be used without any restriction from a R notebook.

To use a managed folder, you have to retrieve its path from the R Dataiku API:

library(dataiku)

path <- dkuManagedFolderPath("folder_name")

Once you have obtained the path, you can simply read and write files with the regular R files API

Viewing and modifying the contents of a folder

You can also manually create an input folder in the “New dataset” menu. By double-clicking on a managed folder in the Flow, you can view the content of a folder, download files, or upload new files.

Usage of a folder as a dataset

The contents of a managed folder can be used to construct a filesystem dataset. This is done using the “Files in folder” dataset.

This enables advanced setups like:

  • Having a Python dataset download files from a files-oriented data store that DSS cannot read. The Python recipe downloads the files to a managed folder. This way, it does not have to deal with parsing the files.
  • Having a files-in-folder dataset do the parsing and extraction from the intermediate folder.