Uncached FTP datasets

Data Science Studio can both read and write datasets directly on FTP servers. If you only need to read a FTP file, consider using a locally cached Remote FTP dataset for superior performance and reliability.

Note

A locally-cached dataset will still check the FTP server for updates when it is rebuilt.

Define a live-from-FTP input Dataset

After setting up a FTP Connection, simply add a new dataset to your project, choosing the “Uncached FTP” type. Select your FTP connection.

If necessary, specify a path (subpath of the connection’s path if it is not empty) or click “Browse” and select a file or directory.

If the final path a directory, the data is the union of all the data in all the files in that directory (including sub-directories). The sample displayed will only present data from the first non-empty file.

Define an output dataset

Two cases are supported:

  1. In a folder:
    • the data will be written in possibly multiple files
    • the content of the folder is wiped before writing
    • writing a managed dataset requires a directory
  2. In a file:
    • you must create the file beforehand (it may be empty)
    • the file is emptied before writing

The default output format will be a single Unix-style CSV file, gzip-compressed; this may be changed.

There are two create an output dataset:

  1. Create a managed dataset when needed: When creating a receipe, choose “Create a new dataset”, specify a name and store into the wanted FTP connection. This will create a managed dataset in a directory (under the connection’s path) with the name of the dataset. That FTP connection must have both Allow write and Allow managed datasets options checked.
  2. Create a dataset, then use it as output: Create a new “Uncached FTP” dataset (same as for input). When you need an output dataset, just select “Use an existing dataset” and pick your dataset. The FTP connection must have the Allow write option.