HDFS

DSS can connect to filesystems based on the “Hadoop Filesystem” API to

  • Read and write datasets
  • Read and write managed folders

Compatible filesystems

DSS can read/write from any kind of Hadoop Filesystem and has been tested with the following URL schemes:

  • hdfs://
  • maprfs://
  • s3a://
  • wasb://
  • adl://

Note

DSS collectively refers all “Hadoop Filesystem” URIs as the “HDFS” dataset, even though it supports more than hdfs:// URIs

Using multiple Hadoop filesystems

All Hadoop clusters define a ‘default’ filesystem, which is traditionally a HDFS on the cluster, but access to HDFS filesystems on other clusters, or even to different filesystem types like cloud storages (S3, Azure Blob storage, Google Cloud Storage) is also possible. The prime benefit of framing other filesystem as Hadoop filesystem is that it enables the use of the Hadoop I/O layers, and as a corrolary, of important Hadoop file formats : Parquet and ORC.

For more information about connecting to multiple Hadoop filesystems and connection details, see hadoop/multi-hdfs.