Hadoop filesystems connections (HDFS, S3, EMRFS, WASB, ADLS, GS)¶
DSS can connect to multiple “Hadoop Filesystems”. A Hadoop filesystem is defined by a URL. Implementations of Hadoop filesystems exist that provide connectivity to:
Azure Data Lake Storage
Azure Blob Storage
Google Cloud Storage
The “main” Hadoop filesystem is traditionally a HDFS running on the cluster, but through Hadoop filesystems, you can also access to HDFS filesystems on other clusters, or even to different filesystem types like cloud storage.
The prime benefit of framing other filesystems as Hadoop filesystems is that it enables the use of the Hadoop I/O layers, and as a corrolary, of important Hadoop file formats: Parquet and ORC.
To access data on a filesystem using Hadoop, 3 things are needed:
libraries to handle the filesystem have to be installed on the cluster and on the node hosting DSS. Hadoop distributions normally come with at least HDFS and S3A
a fully-qualified URI to the file, of the form
scheme://host[:port]/path-to-file. Depending on the scheme, the
host[port]part can have different meanings; for example, for cloud storage filesystems, it is the bucket.
Hadoop configuration parameters that get passed to the relevant tools (Spark, Hive, MapReduce, HDFS libraries) - This is generally used to pass credentials and tuning options
In DSS, all Hadoop filesystem connections are called “HDFS”. This wording is not very precise since there can be “Hadoop filesystem” connections that precisely do not use “HDFS” which in theory only refers to the distributed implementation using NameNode/DataNode.
To setup a new Hadoop filesystem connection, go to Administration → Connections → New connection → HDFS.
A HDFS connection in DSS consists of :
a root path, under which all the data accessible through that connection resides. The root path can be fully-qualified, starting with a
scheme://, or starting with
/and relative to what is defined in fs.defaultFS
Hadoop configuration parameters that get passed to the relevant tools (Spark, Hive, MapReduce, HDFS libraries)
We suggest to have at least two connections:
A read-only connection to all data:
root: /(This is a path on HDFS, the Hadoop file system.)
Allow write, allow managed datasets: unchecked
max nb of activites: 0
A read-write connection, to allow DSS to create and store managed datasets:
allow write, allow managed datasets: checked
When “Hive database name” is configured, DSS declares its HDFS datasets in the Hive metastore, in this database namespace. This allows you to refer to DSS datasets in external Hive programs, or in Hive notebooks within DSS.
All Hadoop clusters define a ‘default’ filesystem, which is traditionally a HDFS on the cluster.
When missing, the
scheme://host[:port] is taken from the
fs.defaultFS Hadoop property in
core-site.xml, so that a URI like ‘/user/john/data/file’ is generally interpreted as a path on the local HDFS filesystem of the cluster. However, if the
fs.defaultFS of your cluster points to S3, an unqualified URI will similarly point to S3.
A HDFS located on a different cluster can be accessed with a HDFS connection that specified the host (and port) of the namenode of that other filesystem, like
hdfs://namenode_host:8020/user/johndoe/ . DSS will access the files on all HDFS filesystems with the same user name (even if User Isolation Framework is being used for HDFS access).
When the local cluster is using Kerberos, it is possible to access a non-kerberized cluster, but a HDFS configuration property is needed :
There are several options to access S3 as a Hadoop filesystem (see the Apache doc).
The S3 dataset in DSS has native support for using Hadoop software layers whenever needed, including for fast read/write from Spark and Parquet support. Using a Hadoop dataset for accessing S3 is not usually required.
Using S3 as a Hadoop filesystem is not supported on MapR
“S3A” is the primary mean of connecting to S3 as a Hadoop filesystem.
Access using the S3A filesystem involves using a URI like
s3a://bucket_name/path/inside/bucket/ , and ensuring the credentials are available. The credentials consist of the access key and the secret key. They can be passed :
either globally, using the
or for the bucket only, using the
EMRFS is an alternative mean of connecting to S3 as a Hadoop filesystem, which is only available on EMR
Access using the EMRFS filesystem involves using a URI like
s3://bucket_name/path/inside/bucket/ , and ensuring the credentials are available. The configuration keys for the access and secret keys are named
fs.s3.awsSecretAccessKey. Note that this is only possible from an EMR-aware machine.
The Azure Blob dataset in DSS has native support for using Hadoop software layers whenever needed, including for fast read/write from Spark and Parquet support. Using a Hadoop dataset for accessing Azure Blob Storage is not usually required.
The URI to access blobs on Azure is
wasb://container_name@your_account.blob.core.windows.net/path/inside/container/ (see the Hadoop Azure support).
The credentials being already partly in the URI (the account name), the only property needed to allow access is
fs.azure.account.key.your_account.blob.core.windows.net to pass the access key.
The Google Cloud Storage dataset in DSS has native support for using Hadoop software layers whenever needed, including for fast read/write from Spark and Parquet support. Using a Hadoop dataset for accessing Google Cloud Storage is not usually required.
The URI to access blobs on Google Cloud Storage is
gs://bucket_name/path/inside/bucket/ (see the GCS connect)
Access to ADLS gen 1 is possible with Oauth tokens provided by Azure
Make sure that your service principal is owner of the ADLS account and has read/write/execute access to the ADLS gen 1 root container recursively
Retrieve your App Id, Token endpoint and Secret for the registered application in Azure portal
The URI to access ADLS is
adl://<datalake_storage_name>.azuredatalakestore.net/<optional_path> (see the Hadoop ADLS support).
Add the following key values as Extra Hadoop Conf of the connection:
The Azure Blob Storage dataset in DSS has native support ADLS gen2 and for using Hadoop software layers whenever needed, including for fast read/write from Spark and Parquet support. Using a Hadoop dataset for accessing ADLS gen2 is not usually required.
Cloud storage filesystems require credentials to give access to the data they hold. Most of them allow these credentials to be passed by environment variable or by Hadoop configuration key (the preferred way). The mechanisms to make the credentials available to DSS are:
adding them as configuration properties for the entire cluster (ie. in
adding them as environment variables in DSS’s
$DATADIR/bin/env-site.sh(when it’s possible to pass them as environement variables), and DSS and any of its subprocess can access them
adding them as extra configuration keys on DSS’s HDFS connections, and then only usages of the HDFS connection will receive the credentials
Proper usage of cloud storage filesystems implies that the credentials are passed to the processes needing them. In particular, the Hive metastore and Sentry will need to be given the credentials in their configurations.
Since DSS uses the standard Hadoop libraries, before attempting to access files on different filesystems, command-line access to these filesystems should be checked. The simplest test is to run:
> hadoop fs -ls uri_to_file
If Kerberos authentication is active, logging in with
kinit first is required. If credentials need to be passed as Hadoop configuration properties, they can be added using the
-D flag, like
> hadoop fs -D fs.s3a.access.key=ABABABABA -D fs.s3a.secret.key=ABCDABCDABCDCDA -ls uri_to_file
To check that Hive is functional and gets the credentials it needs, creating a dummy table will uncover potential problems :
> beeline -u 'jdbc:hive2://localhost:10000/default' beeline> CREATE EXTERNAL TABLE dummy (a string) STORED AS PARQUET LOCATION 'fully_qualified_uri_to_some_folder'
Hadoop clusters most often have Hive installed, and with Hive comes a Hive Metastore to hold the definitions and locations of the tables Hive can access. The location of a Hive table does not need to be on the local cluster, but can be any location provided it’s defined as a fully-qualified URI. But a given Hive installation, and in particular a given Hiveserver2, only knows one Hive Metastore. If Hive tables are defined in a different Hive Metastore, on a different cluster, Hive doesn’t access them.