Data Science Studio can access datasets stored on remote servers using FTP.
This “remote” dataset can only be used as input in DSS. Data is not directly streamed from the source server. Instead, a local copy (mirror) is made of the remote data. The local mirror copy is updated (new files being downloaded and obsolete files being deleted, tracking changes on the origin servers) whenever it is used as input to a DSS recipe.
A remote dataset is defined by specifying:
- one or several sets of files hosted on the remote FTP server
- a local storage location (on the local server filesystem, or Hadoop HDFS) where to host the mirror copy them.
It can then be browsed, analyzed and processed like any other locally-stored dataset.
Remote datasets can be partitioned.
DSS also features a “regular” (non-cached) FTP dataset. The regular dataset should be prefered for large datasets where it might not be acceptable to have a copy of the data.
This (cached) dataset will generally provide better performance for small datasets, especially if the FTP server is too slow to be accessed for each usage of the data.
When using these datasets, DSS makes a copy of the foreign data, stored in the local storage location
Accessing remote files stored on FTP servers first requires the definition of a connection to the remote server, as follows:
- Go to Administration > Connections
- Click the “New connection” button and select the FTP connection
- Enter a name for the new connection, and the required connection parameters
- Save the new connection
|Host||Host name or IP address of the FTP server to access (Mandatory)|
|User||FTP username to use, or empty for an anonymous FTP connection|
|Password||FTP password to use, or empty for an anonymous FTP connection|
|Use passive mode||Check to use FTP “passive” data transfer mode (default). Using FTP passive mode is often mandatory when there is a firewall between the Data Science Studio server and the FTP server.|
|Path||Path to the remote folder to use once connected to the FTP server.
Start with a
|Writable||Check to allow DSS to write datasets on this server.
Those datasets will be written in a subfolder or the
|Allow managed||Check to allow DSS to write
managed datasets on this server.
|Use global proxy||When checked, use the global proxy for this connection. Uncheck this if the FTP server is directly accessible. If you have an HTTP proxy, passive mode is mandatory.|
- From the “Datasets” screen of Data Science Studio, click the “New dataset” button and pick the “FTP / HTTP / SSH” menu entry.
- The remote dataset definition screen opens. The leftmost “Input/Output” pane lets you define the remote sources to mirror, with the following configuration items.
|Store into||Drop down menu||Choose the local filesystem or Hadoop HDFS location where the remote dataset files should be mirrored|
|Download from||Drop down menu||
|Path in connection||Text box||Defines the set of files to download from this connection (see below)|
|Check source||Button||Click this button to test the data source defined above. Displays the number and total size of files which would be downloaded from it.|
|Add another source||Button||Click this button to add another “Download from” remote source definition to this dataset. Click the blue cross at the right end of the remote source definition to remove it|
|Download||Button||Click this button to start the doanload of all the specified URLs.|
The “Path in connection” text input defines the set of files to download from
the remote host. Standard wildcard expansion patterns are supported, where a
question mark character
? matches any character in a remote file or
directory name, and a star character
* matches any sequence of characters
in a remote file or directory name. To match all files and subfolders in the
connection’s startup directory, specify
* as path.
Remote file names matching the given pattern are directly downloaded at the top-level of the locally-mirrored dataset. Remote directory names matching the given pattern are downloaded as sub-directories of the locally-mirrored dataset, along with all their contents.
Only regular files and directory are considered when enumerating remote servers. In particular, symbolic links are ignored.
Note that in order to avoid name collisions between downloaded files, renaming rules may be applied during the mirroring process:
- When multiple data sources are defined for a single remote dataset,
individual file names are prefixed with the data source index and an underscore
- When wilcard expansion patterns appear in remote directory specifications,
downloaded file or directory names are prefixed with enclosing directory names
and an underscore
Remote file mirroring is performed according to file size and last modification time. If a remote file is created or updated on the remote host, it will be downloaded by the next dataset synchronization operation. If a remote file disappears from the remote host, its local mirror will be deleted by the next dataset synchronization operation.
Patterns can be absolute (with a leading
/ character) or relative, in which
case they are interpreted according to the default remote directory for this
|Path in connection||Matched remote files||Downloaded files (single source)||Downloaded files (multi sources)||Notes|
||/stats/20140102.log /stats/20140103.log||20140102.log 20140103.log||1_20140102.log 1_20140103.log||Pattern matches files in single directory|
||/stats.2013/0101.log /stats.2013/0102.log /stats.2014/0101.log||stats.2013_0101.log stats.2013_0102.log stats.2014_0101.log||1_stats.2013_0101.log 1_stats.2013_0102.log 1_stats.2014_0101.log||Pattern matches files in multiple directories|
||/stats.2013/0101.log /stats.2013/0102.log /stats.2014/0101.log||stats.2013/0101.log stats.2013/0102.log stats.2014/0101.log||stats.2013/1_0101.log stats.2013/1_0102.log stats.2014/1_0101.log||Pattern matches directories|
Remote datasets can be partitioned like any other Data Science Studio file-based dataset. Remote datasets become a file-based partitioning. See Working with partitions for more information.
When partitioning is activated for a remote dataset:
- Remote files are downloaded from origin servers one partition at a time, each one being mirrored in a corresponding partition of the local storage.
- A set of expansion variables are available to include in url and remote path definitions, to choose remote file names from partition values.
- The source definition screen contains an additional input field “Partition to download for preview” - this value is used when manual download is triggered with the “Download” button, or individual data sources are checked through the “Check source” buttons.
The variable expansion patterns that can be used in urls and remote paths to define the set of files to download for a given dataset partition can be found at Partitioning variables substitutions.
The following defines a remote dataset from a directory of web server log files accessible through a FTP server (similar instructions apply for SSH):
- Partitioning scheme: use one time dimension named “date”, period “DAY”,
%Y/%M/%D/.*. This defines the naming pattern for downloaded files (ie, in the local storage).
- Input/output definitions: download from url
ftp://MYWEBSERVER/var/log/apache2/$DKU_DST_YEAR/$DKU_DST_MONTH/$DKU_DST_DAY/*.log. This defines the naming pattern for remote source files.
Given the above definitions, whenever Data Science Studio needs to access the
files for February 5, 2014, it will mirror all files named
/var/log/apache2/2014/02/05 of the FTP server, and store them
2014/02/05 of the dataset root directory of the local server.
Additionally, if the field “Partition to download for preview” of the remote
dataset definition screen contains “YESTERDAY”, and the user clicks the “Download”
button on Feb. 5, 2014, this mirror operation will be immediately triggered for