Accessing remote datasets

Data Science Studio can access datasets stored on remote servers.

A remote dataset is defined by specifying:

  • one or several sets of files hosted on remote HTTP, HTTPS, FTP or SSH servers, and
  • a local storage location (on the local server filesystem, or Hadoop HDFS) where to download them.

It can then be browsed, analyzed and processed like any other locally-stored dataset.

Remote datasets can only be used as input to Data Science Studio. Their local mirror copies are updated (new files being downloaded and obsolete files being deleted, tracking changes on the origin servers) whenever there are to be used as input to a Studio recipe.

Remote datasets can be partitioned.


When using these datasets, DSS makes a copy of the foreign data, stored in the local storage location

Defining remote dataset connections

Accessing remote files stored on FTP or SSH servers first requires the definition of a connection to the remote server, as follows:

  • Go to Administration > Connections
  • Click the “New connection” button and pick your connection type (FTP or SSH)
  • Enter a name for the new connection, and the required connection parameters
  • Save the new connection

FTP connection parameters

Name Description
Host Host name or IP address of the FTP server to access (Mandatory)
User FTP username to use, or empty for an anonymous FTP connection
Password FTP password to use, or empty for an anonymous FTP connection
Use passive mode Check to use FTP “passive” data transfer mode (default). Using FTP passive mode is often mandatory when there is a firewall between the Data Science Studio server and the FTP server.
Path Path to the remote folder to use once connected to the FTP server. Start with a / to specify an absolute path, without / the path will be relative to the startup directory.
Writable Check to allow DSS to write datasets on this server. Those datasets will be written in a subfolder or the path (or the startup directory if path is empty) bearing the name of the dataset.
Allow managed Check to allow DSS to write managed datasets on this server. Requires Writable.
Use global proxy When checked, use the global proxy for this connection. Uncheck this if the FTP server is directly accessible. If you have an HTTP proxy, passive mode is mandatory.

SSH connection parameters

Name Description
Host Host name or IP address of the SSH server to access, mandatory.
User SSH username to use, mandatory.
Use public key authentication
  • Checked to use public key authentication.
  • Unchecked to use password authentication.
Password SSH password to use. Mandatory is using password authentication.
Key passphrase In public-key authentication mode, optional passphrase to use to decrypt the SSH private key.

When using public-key authentication mode, connection to the remote server will be attempted using any of the two standard SSH keys for the Studio Linux user, stored respectively in files $HOME/.ssh/id_dsa and $HOME/.ssh/id_rsa, where $HOME is the home directory of the Studio user account.

Defining remote datasets

  • From the “Datasets” screen of Data Science Studio, click the “New dataset” button and pick the “FTP / HTTP / SSH” menu entry.
  • The remote dataset definition screen opens. The leftmost “Input/Output” pane lets you define the remote sources to mirror, with the following configuration items.
Name Type Description
Store into Drop down menu Choose the local filesystem or Hadoop HDFS location where the remote dataset files should be mirrored
Download from Drop down menu
  • Pick “HTTP or FTP url” to directly enter a URL download.
  • Otherwise choose one of the connections defined above.
HTTP/HTTPS/FTP URL Text box When in URL mode, defines the remote URL or URL pattern to download (see below)
Use global proxy Check box When in URL mode and a global proxy is set, uncheck this if the specified URL does not require a proxy.
Path in connection Text box When in “remote connection” mode, defines the set of files to download from this connection (see below)
Check source Button Click this button to test the data source defined above. Displays the number and total size of files which would be downloaded from it.
Add another source Button Click this button to add another “Download from” remote source definition to this dataset. Click the blue cross at the right end of the remote source definition to remove it
Download Button Click this button to start the doanload of all the specified URLs.

Remote filesets definition and mirroring

The “Path in connection” text input defines the set of files to download from the remote host. Standard wildcard expansion patterns are supported, where a question mark character ? matches any character in a remote file or directory name, and a star character * matches any sequence of characters in a remote file or directory name. To match all files and subfolders in the connection’s startup directory, specify * as path.

Remote file names matching the given pattern are directly downloaded at the top-level of the locally-mirrored dataset. Remote directory names matching the given pattern are downloaded as sub-directories of the locally-mirrored dataset, along with all their contents.

Only regular files and directory are considered when enumerating remote servers. In particular, symbolic links are ignored.

Note that in order to avoid name collisions between downloaded files, renaming rules may be applied during the mirroring process:

  • When multiple data sources are defined for a single remote dataset, individual file names are prefixed with the data source index and an underscore _ character.
  • When wilcard expansion patterns appear in remote directory specifications, downloaded file or directory names are prefixed with enclosing directory names and an underscore _ character.

Remote file mirroring is performed according to file size and last modification time. If a remote file is created or updated on the remote host, it will be downloaded by the next dataset synchronization operation. If a remote file disappears from the remote host, its local mirror will be deleted by the next dataset synchronization operation.

Patterns can be absolute (with a leading / character) or relative, in which case they are interpreted according to the default remote directory for this FTP or SSH connection.


Path in connection Matched remote files Downloaded files (single source) Downloaded files (multi sources) Notes
/stats/\*.log /stats/20140102.log /stats/20140103.log 20140102.log 20140103.log 1_20140102.log 1_20140103.log Pattern matches files in single directory
/stats*/*.log /stats.2013/0101.log /stats.2013/0102.log /stats.2014/0101.log stats.2013_0101.log stats.2013_0102.log stats.2014_0101.log 1_stats.2013_0101.log 1_stats.2013_0102.log 1_stats.2014_0101.log Pattern matches files in multiple directories
/stats\* /stats.2013/0101.log /stats.2013/0102.log /stats.2014/0101.log stats.2013/0101.log stats.2013/0102.log stats.2014/0101.log stats.2013/1_0101.log stats.2013/1_0102.log stats.2014/1_0101.log Pattern matches directories

Remote URL definitions

A remote source can be defined by a HTTP or HTTPS url. HTTP/HTTPS urls may only reference a single remote file, and wilcard expansion patterns are not recognized in them.

A remote source can be defined by a FTP url. FTP urls may contain wilcard expansion patterns, and reference sets of remote files using the same rules as above.

Remote url definitions can contain optional inline authentication credentials and non-default network ports.

URL Downloaded files (single source)
http://HOST/stats/20140102.log 20140102.log
http://USER:PASSWORD@HOST:8080/stats/20140102.log 20140102.log
ftp://USER:PASSWORD@HOST/stats*/*.log stats.2013_0101.log stats.2013_0102.log stats.2014_0101.log

Defining partitioned remote datasets

Remote datasets can be partitioned like any other Data Science Studio file-based dataset. Remote datasets become a file-based partitioning. See Working with partitions for more information.

When partitioning is activated for a remote dataset:

  • Remote files are downloaded from origin servers one partition at a time, each one being mirrored in a corresponding partition of the local storage.
  • A set of expansion variables are available to include in url and remote path definitions, to choose remote file names from partition values.
  • The source definition screen contains an additional input field “Partition to download for preview” - this value is used when manual download is triggered with the “Download” button, or individual data sources are checked through the “Check source” buttons.

The variable expansion patterns that can be used in urls and remote paths to define the set of files to download for a given dataset partition can be found at Partitioning variables substitutions.


The following defines a remote dataset from a directory of web server log files accessible through a FTP server:

  • Partitioning scheme: use one time dimension named “date”, period “DAY”, pattern %Y/%M/%D/.*. This defines the naming pattern for downloaded files (ie, in the local storage).
  • Input/output definitions: download from url ftp://MYWEBSERVER/var/log/apache2/$DKU_DST_YEAR/$DKU_DST_MONTH/$DKU_DST_DAY/*.log. This defines the naming pattern for remote source files.

Given the above definitions, whenever Data Science Studio needs to access the files for February 5, 2014, it will mirror all files named \*.log in subdirectory /var/log/apache2/2014/02/05 of the FTP server, and store them in subdirectory 2014/02/05 of the dataset root directory of the local server.

Additionally, if the field “Partition to download for preview” of the remote dataset definition screen contains “YESTERDAY”, and the user clicks the “Download” button on Feb. 5, 2014, this mirror operation will be immediately triggered for the 2014/02/04 partition.