HTTP (cached)

Data Science Studio can access data stored on HTTP or HTTPS servers.

This “remote” dataset can only be used as input in DSS. Data is not directly streamed from the source server. Instead, a local copy (mirror) is made of the remote data. The local mirror copy is updated (new files being downloaded and obsolete files being deleted, tracking changes on the origin servers) whenever it is used as input to a DSS recipe.

A remote dataset is defined by specifying:

  • one or several sets of files hosted on the remote HTTP server
  • a local storage location (on the local server filesystem, or Hadoop HDFS) where to host the mirror copy them.

It can then be browsed, analyzed and processed like any other locally-stored dataset.

Remote datasets can be partitioned.

Warning

When using these datasets, DSS makes a copy of the foreign data, stored in the local storage location

Defining the HTTP dataset

  • From the “Datasets” screen of Data Science Studio, click the “New dataset” button and pick the “FTP / HTTP / SSH” menu entry.
  • The remote dataset definition screen opens. The leftmost “Input/Output” pane lets you define the remote sources to mirror, with the following configuration items.
Name Type Description
Store into Drop down menu Choose the local filesystem or Hadoop HDFS location where the remote dataset files should be mirrored
Download from Drop down menu
  • Pick “HTTP or FTP url” to directly enter a URL download.
HTTP/HTTPS/FTP URL Text box Defines the remote URL or URL pattern to download (see below)
Use global proxy Check box When in URL mode and a global proxy is set, uncheck this if the specified URL does not require a proxy.
Check source Button Click this button to test the data source defined above. Displays the number and total size of files which would be downloaded from it.
Add another source Button Click this button to add another “Download from” remote source definition to this dataset. Click the blue cross at the right end of the remote source definition to remove it
Download Button Click this button to start the doanload of all the specified URLs.

Remote URL definition

A remote source can be defined by a HTTP or HTTPS URL. HTTP/HTTPS URL may only reference a single remote file, and wilcard expansion patterns are not recognized in them.

Remote URL definitions can contain optional inline authentication credentials and non-default network ports.

URL Downloaded files (single source)
http://HOST/stats/20140102.log 20140102.log
http://USER:PASSWORD@HOST:8080/stats/20140102.log 20140102.log

Defining partitioned remote datasets

Remote datasets can be partitioned like any other Data Science Studio file-based dataset. Remote datasets become a file-based partitioning. See Working with partitions for more information.

When partitioning is activated for a remote dataset:

  • Remote files are downloaded from origin servers one partition at a time, each one being mirrored in a corresponding partition of the local storage.
  • A set of expansion variables are available to include in url and remote path definitions, to choose remote file names from partition values.
  • The source definition screen contains an additional input field “Partition to download for preview” - this value is used when manual download is triggered with the “Download” button, or individual data sources are checked through the “Check source” buttons.

The variable expansion patterns that can be used in urls and remote paths to define the set of files to download for a given dataset partition can be found at Partitioning variables substitutions.

Example

The following defines a remote dataset from a directory of web server log files accessible through a FTP server (similar instructions apply for SSH):

  • Partitioning scheme: use one time dimension named “date”, period “DAY”, pattern %Y/%M/%D/.*. This defines the naming pattern for downloaded files (ie, in the local storage).
  • Input/output definitions: download from url ftp://MYWEBSERVER/var/log/apache2/$DKU_DST_YEAR/$DKU_DST_MONTH/$DKU_DST_DAY/*.log. This defines the naming pattern for remote source files.

Given the above definitions, whenever Data Science Studio needs to access the files for February 5, 2014, it will mirror all files named \*.log in subdirectory /var/log/apache2/2014/02/05 of the FTP server, and store them in subdirectory 2014/02/05 of the dataset root directory of the local server.

Additionally, if the field “Partition to download for preview” of the remote dataset definition screen contains “YESTERDAY”, and the user clicks the “Download” button on Feb. 5, 2014, this mirror operation will be immediately triggered for the 2014/02/04 partition.