Using multiple Hadoop filesystems

All Hadoop clusters define a ‘default’ filesystem, which is traditionally a HDFS on the cluster, but access to HDFS filesystems on other clusters, or even to different filesystem types like cloud storages (S3, Azure Blob storage, Google Cloud Storage) is also possible. The prime benefit of framing other filesystem as Hadoop filesystem is that it enables the use of the Hadoop I/O layers, and as a corrolary, of important Hadoop file formats : Parquet and ORC.

Hadoop filesystems

To access data on a filesystem using Hadoop, 2 things are needed:

  • libraries to handle the filesystem have to be installed on the cluster and on the node hosting DSS. Hadoop distributions normally come with at least HDFS and S3A
  • a fully-qualified URI to the file, of the form scheme://host[:port]/path-to-file . Depending on the scheme, the host[port] part can have different meanings; for example, for cloud storage filesystems, it is the bucket.

When missing, the scheme://host[:port] is taken from the fs.defaultFS Hadoop property in core-site.xml, so that a URI like ‘/user/john/data/file’ is interpreted as a path on the local HDFS filesystem of the cluster.

Checking access to a Hadoop filesystem

Since DSS uses the standard Hadoop libraries, before attempting to access files on different filesystems, command-line access to these filesystems should be checked. The simplest test is to run:

> hadoop fs -ls uri_to_file

If Kerberos authentication is active, logging in with kinit first is required. If credentials need to be passed as Hadoop configuration properties, they can be added using the -D flag, like

> hadoop fs -D fs.s3a.access.key=ABABABABA -D fs.s3a.secret.key=ABCDABCDABCDCDA -ls uri_to_file

To check that Hive is functional and gets the credentials it needs, creating a dummy table will uncover potential problems :

> beeline -u 'jdbc:hive2://localhost:10000/default'
beeline> CREATE EXTERNAL TABLE dummy (a string) STORED AS PARQUET LOCATION 'fully_qualified_uri_to_some_folder'

Relation to the Hive metastore

Hadoop clusters most often have Hive installed, and with Hive comes a Hive Metastore to hold the definitions and locations of the tables Hive can access. The location of a Hive table does not need to be on the local cluster, but can be any location provided it’s defined as a fully-qualified URI. But a given Hive installation, and in particular a given Hiveserver2, only knows one Hive Metastore. If Hive tables are defined in a different Hive Metastore, on a different cluster, Hive doesn’t access them.

DSS setup for Hadoop filesystems

A HDFS connection in DSS consists of :

  • a root path, under which all the data accessible through that connection resides. The root path can be fully-qualified, starting with a scheme://, or starting with / and relative to what is defined in fs.defaultFS
  • Hadoop configuration parameters that get passed to the relevant tools (Spark, Hive, MapReduce, HDFS libraries)

As long as the required libraries are installed, DSS can have HDFS connections to different Hadoop filesystems, by adding HDFS connections with the appropriate fully-qualified URI.

HDFS filesystem on other clusters

A HDFS located on a different cluster can be accessed with a HDFS connection that specified the host (and port) of the namenode of that other filesystem, like hdfs://namenode_host:8020/user/johndoe/ . DSS will access the files on all HDFS filesystems with the same user name (even in multi-user security mode )

Warning

When the local cluster is using Kerberos, it is possible to access a non-kerberized cluster, but a HDFS configuration property is needed : ipc.client.fallback-to-simple-auth-allowed=true

Cloud storage filesystems

Cloud storage filesystems require credentials to give access to the data they hold. Most of them allow these credentials to be passed by environment variable or by Hadoop configuration key (the preferred way). The mechanisms to make the credentials available to DSS are:

  • adding them as configuration properties for the entire cluster (ie. in core-site.xml)
  • adding them as environment variables in DSS’s $DATADIR/bin/env-site.sh (when it’s possible to pass them as environement variables), and DSS and any of its subprocess can access them
  • adding them as extra configuration keys on DSS’s HDFS connections, and then only usages of the HDFS connection will recieve the credentials

Warning

Proper usage of cloud storage filesystems implies that the credentials are passed to the processes needing them. In particular, the Hive metastore and Sentry will need to be given the credentials in their configurations.

S3 as a Hadoop filesystem

There are several options to access S3 as a Hadoop filesystem (see the Apache doc).

Warning

Using S3 as a Hadoop filesystem is not supported on MapR

Access using the S3A filesystem involves using a URI like s3a://bucket_name/path/inside/bucket/ , and ensuring the credentials are available. The credentials consist of the access key and the secret key. They can be passed :

  • either globally, using the fs.s3a.access.key and fs.s3a.secret.key Hadoop property
  • or for the bucket only, using the fs.s3a.bucket_name.access.key and fs.s3a.bucket_name.secret.key Hadoop property

Access using the EMRFS filesystem involves using a URI like s3://bucket_name/path/inside/bucket/ , and ensuring the credentials are available. The configuration keys for the access and secret keys are named fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Note that this is only possible from an EMR machine.

Azure Blob Storage as a Hadoop filesystem

The URI to access blobs on Azure is wasb://container_name@your_account.blob.core.windows.net/path/inside/container/ (see the Hadoop Azure support). The credentials being already partly in the URI (the account name), the only property needed to allow access is fs.azure.account.key.your_account.blob.core.windows.net to pass the access key.

Google Cloud Storage as a Hadoop filesystem

The URI to access blobs on Google Cloud Storage is gs://bucket_name/path/inside/bucket/ (see the GCS connect)