FTP (cached)

Data Science Studio can access datasets stored on remote servers using FTP.

This “remote” dataset can only be used as input in DSS. Data is not directly streamed from the source server. Instead, a local copy (mirror) is made of the remote data. The local mirror copy is updated (new files being downloaded and obsolete files being deleted, tracking changes on the origin servers) whenever it is used as input to a DSS recipe.

A remote dataset is defined by specifying:

  • one or several sets of files hosted on the remote FTP server
  • a local storage location (on the local server filesystem, or Hadoop HDFS) where to host the mirror copy them.

It can then be browsed, analyzed and processed like any other locally-stored dataset.

Remote datasets can be partitioned.

Note

DSS also features a “regular” (non-cached) FTP dataset. The regular dataset should be prefered for large datasets where it might not be acceptable to have a copy of the data.

This (cached) dataset will generally provide better performance for small datasets, especially if the FTP server is too slow to be accessed for each usage of the data.

Warning

When using these datasets, DSS makes a copy of the foreign data, stored in the local storage location

Defining the remote FTP connection

Accessing remote files stored on FTP servers first requires the definition of a connection to the remote server, as follows:

  • Go to Administration > Connections
  • Click the “New connection” button and select the FTP connection
  • Enter a name for the new connection, and the required connection parameters
  • Save the new connection

FTP connection parameters

Name Description
Host Host name or IP address of the FTP server to access (Mandatory)
User FTP username to use, or empty for an anonymous FTP connection
Password FTP password to use, or empty for an anonymous FTP connection
Use passive mode Check to use FTP “passive” data transfer mode (default). Using FTP passive mode is often mandatory when there is a firewall between the Data Science Studio server and the FTP server.
Path Path to the remote folder to use once connected to the FTP server. Start with a / to specify an absolute path, without / the path will be relative to the startup directory.
Writable Check to allow DSS to write datasets on this server. Those datasets will be written in a subfolder or the path (or the startup directory if path is empty) bearing the name of the dataset.
Allow managed Check to allow DSS to write managed datasets on this server. Requires Writable.
Use global proxy When checked, use the global proxy for this connection. Uncheck this if the FTP server is directly accessible. If you have an HTTP proxy, passive mode is mandatory.

Defining remote datasets

  • From the “Datasets” screen of Data Science Studio, click the “New dataset” button and pick the “FTP / HTTP / SSH” menu entry.
  • The remote dataset definition screen opens. The leftmost “Input/Output” pane lets you define the remote sources to mirror, with the following configuration items.
Name Type Description
Store into Drop down menu Choose the local filesystem or Hadoop HDFS location where the remote dataset files should be mirrored
Download from Drop down menu
  • Choose one of the FTP connections defined above
Path in connection Text box Defines the set of files to download from this connection (see below)
Check source Button Click this button to test the data source defined above. Displays the number and total size of files which would be downloaded from it.
Add another source Button Click this button to add another “Download from” remote source definition to this dataset. Click the blue cross at the right end of the remote source definition to remove it
Download Button Click this button to start the doanload of all the specified URLs.

Remote filesets definition and mirroring

The “Path in connection” text input defines the set of files to download from the remote host. Standard wildcard expansion patterns are supported, where a question mark character ? matches any character in a remote file or directory name, and a star character * matches any sequence of characters in a remote file or directory name. To match all files and subfolders in the connection’s startup directory, specify * as path.

Remote file names matching the given pattern are directly downloaded at the top-level of the locally-mirrored dataset. Remote directory names matching the given pattern are downloaded as sub-directories of the locally-mirrored dataset, along with all their contents.

Only regular files and directory are considered when enumerating remote servers. In particular, symbolic links are ignored.

Note that in order to avoid name collisions between downloaded files, renaming rules may be applied during the mirroring process:

  • When multiple data sources are defined for a single remote dataset, individual file names are prefixed with the data source index and an underscore _ character.
  • When wilcard expansion patterns appear in remote directory specifications, downloaded file or directory names are prefixed with enclosing directory names and an underscore _ character.

Remote file mirroring is performed according to file size and last modification time. If a remote file is created or updated on the remote host, it will be downloaded by the next dataset synchronization operation. If a remote file disappears from the remote host, its local mirror will be deleted by the next dataset synchronization operation.

Patterns can be absolute (with a leading / character) or relative, in which case they are interpreted according to the default remote directory for this connection.

Examples

Path in connection Matched remote files Downloaded files (single source) Downloaded files (multi sources) Notes
/stats/\*.log /stats/20140102.log /stats/20140103.log 20140102.log 20140103.log 1_20140102.log 1_20140103.log Pattern matches files in single directory
/stats*/*.log /stats.2013/0101.log /stats.2013/0102.log /stats.2014/0101.log stats.2013_0101.log stats.2013_0102.log stats.2014_0101.log 1_stats.2013_0101.log 1_stats.2013_0102.log 1_stats.2014_0101.log Pattern matches files in multiple directories
/stats\* /stats.2013/0101.log /stats.2013/0102.log /stats.2014/0101.log stats.2013/0101.log stats.2013/0102.log stats.2014/0101.log stats.2013/1_0101.log stats.2013/1_0102.log stats.2014/1_0101.log Pattern matches directories

Defining partitioned remote datasets

Remote datasets can be partitioned like any other Data Science Studio file-based dataset. Remote datasets become a file-based partitioning. See Working with partitions for more information.

When partitioning is activated for a remote dataset:

  • Remote files are downloaded from origin servers one partition at a time, each one being mirrored in a corresponding partition of the local storage.
  • A set of expansion variables are available to include in url and remote path definitions, to choose remote file names from partition values.
  • The source definition screen contains an additional input field “Partition to download for preview” - this value is used when manual download is triggered with the “Download” button, or individual data sources are checked through the “Check source” buttons.

The variable expansion patterns that can be used in urls and remote paths to define the set of files to download for a given dataset partition can be found at Partitioning variables substitutions.

Example

The following defines a remote dataset from a directory of web server log files accessible through a FTP server (similar instructions apply for SSH):

  • Partitioning scheme: use one time dimension named “date”, period “DAY”, pattern %Y/%M/%D/.*. This defines the naming pattern for downloaded files (ie, in the local storage).
  • Input/output definitions: download from url ftp://MYWEBSERVER/var/log/apache2/$DKU_DST_YEAR/$DKU_DST_MONTH/$DKU_DST_DAY/*.log. This defines the naming pattern for remote source files.

Given the above definitions, whenever Data Science Studio needs to access the files for February 5, 2014, it will mirror all files named \*.log in subdirectory /var/log/apache2/2014/02/05 of the FTP server, and store them in subdirectory 2014/02/05 of the dataset root directory of the local server.

Additionally, if the field “Partition to download for preview” of the remote dataset definition screen contains “YESTERDAY”, and the user clicks the “Download” button on Feb. 5, 2014, this mirror operation will be immediately triggered for the 2014/02/04 partition.