Accessing remote datasets¶
Data Science Studio can access datasets stored on remote servers.
A remote dataset is defined by specifying:
- one or several sets of files hosted on remote HTTP, HTTPS, FTP or SSH servers, and
- a local storage location (on the local server filesystem, or Hadoop HDFS) where to download them.
It can then be browsed, analyzed and processed like any other locally-stored dataset.
Remote datasets can only be used as input to Data Science Studio. Their local mirror copies are updated (new files being downloaded and obsolete files being deleted, tracking changes on the origin servers) whenever there are to be used as input to a Studio recipe.
Remote datasets can be partitioned.
When using these datasets, DSS makes a copy of the foreign data, stored in the local storage location
Defining remote dataset connections¶
Accessing remote files stored on FTP or SSH servers first requires the definition of a connection to the remote server, as follows:
- Go to Administration > Connections
- Click the “New connection” button and pick your connection type (FTP or SSH)
- Enter a name for the new connection, and the required connection parameters
- Save the new connection
FTP connection parameters¶
|Host||Host name or IP address of the FTP server to access (Mandatory)|
|User||FTP username to use, or empty for an anonymous FTP connection|
|Password||FTP password to use, or empty for an anonymous FTP connection|
|Use passive mode||Check to use FTP “passive” data transfer mode (default). Using FTP passive mode is often mandatory when there is a firewall between the Data Science Studio server and the FTP server.|
|Path||Path to the remote folder to use once connected to the FTP server.
Start with a
|Writable||Check to allow DSS to write datasets on this server.
Those datasets will be written in a subfolder or the
|Allow managed||Check to allow DSS to write
managed datasets on this server.
|Use global proxy||When checked, use the global proxy for this connection. Uncheck this if the FTP server is directly accessible. If you have an HTTP proxy, passive mode is mandatory.|
SSH connection parameters¶
|Host||Host name or IP address of the SSH server to access, mandatory.|
|User||SSH username to use, mandatory.|
|Use public key authentication||
|Password||SSH password to use. Mandatory is using password authentication.|
|Key passphrase||In public-key authentication mode, optional passphrase to use to decrypt the SSH private key.|
When using public-key authentication mode, connection to the remote server will be
attempted using any of the two standard SSH keys for the Studio Linux user, stored
respectively in files
$HOME/.ssh/id_rsa, where $HOME
is the home directory of the Studio user account.
Defining remote datasets¶
- From the “Datasets” screen of Data Science Studio, click the “New dataset” button and pick the “FTP / HTTP / SSH” menu entry.
- The remote dataset definition screen opens. The leftmost “Input/Output” pane lets you define the remote sources to mirror, with the following configuration items.
|Store into||Drop down menu||Choose the local filesystem or Hadoop HDFS location where the remote dataset files should be mirrored|
|Download from||Drop down menu||
|HTTP/HTTPS/FTP URL||Text box||When in URL mode, defines the remote URL or URL pattern to download (see below)|
|Use global proxy||Check box||When in URL mode and a global proxy is set, uncheck this if the specified URL does not require a proxy.|
|Path in connection||Text box||When in “remote connection” mode, defines the set of files to download from this connection (see below)|
|Check source||Button||Click this button to test the data source defined above. Displays the number and total size of files which would be downloaded from it.|
|Add another source||Button||Click this button to add another “Download from” remote source definition to this dataset. Click the blue cross at the right end of the remote source definition to remove it|
|Download||Button||Click this button to start the doanload of all the specified URLs.|
Remote filesets definition and mirroring¶
The “Path in connection” text input defines the set of files to download from
the remote host. Standard wildcard expansion patterns are supported, where a
question mark character
? matches any character in a remote file or
directory name, and a star character
* matches any sequence of characters
in a remote file or directory name. To match all files and subfolders in the
connection’s startup directory, specify
* as path.
Remote file names matching the given pattern are directly downloaded at the top-level of the locally-mirrored dataset. Remote directory names matching the given pattern are downloaded as sub-directories of the locally-mirrored dataset, along with all their contents.
Only regular files and directory are considered when enumerating remote servers. In particular, symbolic links are ignored.
Note that in order to avoid name collisions between downloaded files, renaming rules may be applied during the mirroring process:
- When multiple data sources are defined for a single remote dataset,
individual file names are prefixed with the data source index and an underscore
- When wilcard expansion patterns appear in remote directory specifications,
downloaded file or directory names are prefixed with enclosing directory names
and an underscore
Remote file mirroring is performed according to file size and last modification time. If a remote file is created or updated on the remote host, it will be downloaded by the next dataset synchronization operation. If a remote file disappears from the remote host, its local mirror will be deleted by the next dataset synchronization operation.
Patterns can be absolute (with a leading
/ character) or relative, in which
case they are interpreted according to the default remote directory for this
FTP or SSH connection.
|Path in connection||Matched remote files||Downloaded files (single source)||Downloaded files (multi sources)||Notes|
||/stats/20140102.log /stats/20140103.log||20140102.log 20140103.log||1_20140102.log 1_20140103.log||Pattern matches files in single directory|
||/stats.2013/0101.log /stats.2013/0102.log /stats.2014/0101.log||stats.2013_0101.log stats.2013_0102.log stats.2014_0101.log||1_stats.2013_0101.log 1_stats.2013_0102.log 1_stats.2014_0101.log||Pattern matches files in multiple directories|
||/stats.2013/0101.log /stats.2013/0102.log /stats.2014/0101.log||stats.2013/0101.log stats.2013/0102.log stats.2014/0101.log||stats.2013/1_0101.log stats.2013/1_0102.log stats.2014/1_0101.log||Pattern matches directories|
Remote URL definitions¶
A remote source can be defined by a HTTP or HTTPS url. HTTP/HTTPS urls may only reference a single remote file, and wilcard expansion patterns are not recognized in them.
A remote source can be defined by a FTP url. FTP urls may contain wilcard expansion patterns, and reference sets of remote files using the same rules as above.
Remote url definitions can contain optional inline authentication credentials and non-default network ports.
|URL||Downloaded files (single source)|
||stats.2013_0101.log stats.2013_0102.log stats.2014_0101.log|
Defining partitioned remote datasets¶
Remote datasets can be partitioned like any other Data Science Studio file-based dataset. Remote datasets become a file-based partitioning. See Working with partitions for more information.
When partitioning is activated for a remote dataset:
- Remote files are downloaded from origin servers one partition at a time, each one being mirrored in a corresponding partition of the local storage.
- A set of expansion variables are available to include in url and remote path definitions, to choose remote file names from partition values.
- The source definition screen contains an additional input field “Partition to download for preview” - this value is used when manual download is triggered with the “Download” button, or individual data sources are checked through the “Check source” buttons.
The variable expansion patterns that can be used in urls and remote paths to define the set of files to download for a given dataset partition can be found at Partitioning variables substitutions.
The following defines a remote dataset from a directory of web server log files accessible through a FTP server:
- Partitioning scheme: use one time dimension named “date”, period “DAY”,
%Y/%M/%D/.*. This defines the naming pattern for downloaded files (ie, in the local storage).
- Input/output definitions: download from url
ftp://MYWEBSERVER/var/log/apache2/$DKU_DST_YEAR/$DKU_DST_MONTH/$DKU_DST_DAY/*.log. This defines the naming pattern for remote source files.
Given the above definitions, whenever Data Science Studio needs to access the
files for February 5, 2014, it will mirror all files named
/var/log/apache2/2014/02/05 of the FTP server, and store them
2014/02/05 of the dataset root directory of the local server.
Additionally, if the field “Partition to download for preview” of the remote
dataset definition screen contains “YESTERDAY”, and the user clicks the “Download”
button on Feb. 5, 2014, this mirror operation will be immediately triggered for