HDFS datasets data structure¶
Note
This only applies for HDFS datasets for which ACL synchronization is used.
When user isolation for Hadoop is disabled, datasets location is specified by a path in a connection.
When user isolation for Hadoop is enabled, DSS uses a different files pattern for managed datasets: if the dataset’s configured location is /user/dataiku/datasets/MYPROJECT/mydataset
, then the actual data is written in /user/dataiku/datasets/MYPROJECT/mydataset/data
.
The “data” folder belongs to the last user who wrote the dataset (this might be “hive” or “impala”). The “mydataset” folder always belongs to the dssuser
user.
ACLs preventing access are on the mydataset
folder. Within that folder, it is normal for data files to have world-readable permissions. The restrictive “gateway” ACLs on mydataset
prevent unauthorized users from accessing them.
This behavior is configured in the settings of the HDFS connection, in the “Write ACL synchronization mode” setting.