Dataiku Documentation
  • Discussions
    • Setup & Configuration
    • Using Dataiku DSS
    • Plugins & Extending Dataiku DSS
    • General Discussion
    • Job Board
    • Community Resources
    • Product Ideas
  • Knowledge
    • Getting Started
    • Knowledge Base
    • Reference Documentation
    • Developer Guide
  • Academy
    • Quick Start Programs
    • Learning Paths
    • Certifications
    • Course Catalog
    • Academy Discussions
  • Community Programs
    • Upcoming User Events
    • Find a User Group
    • Past Events
    • Community Conundrums
    • Dataiku Neurons
    • Banana Data Podcast
  • What's New
  • User's Guide
  • DSS concepts
  • Connecting to data
  • Exploring your data
  • Schemas, storage types and meanings
  • Data preparation
  • Charts
  • Interactive statistics
  • Machine learning
  • The Flow
  • Visual recipes
  • Recipes based on code
  • Code notebooks
  • MLOps
  • Webapps
  • Code Studios
  • Code reports
  • Dashboards
  • Workspaces
  • Dataiku Applications
  • Working with partitions
  • DSS and SQL
  • DSS and Python
  • DSS and R
  • DSS and Spark
  • Code environments
  • Collaboration
  • Specific Data Processing
  • Time Series
  • Geographic data
  • Text
  • Images
  • Audio
  • Video
  • Automation & Deployment
  • Automation scenarios, metrics, and checks
  • Production deployments and bundles
  • API Node & API Deployer: Real-time APIs
  • Governance
  • APIs
  • Python APIs
  • R API
  • Public REST API
  • Additional APIs
  • Installation & Administration
  • Installing and setting up
  • Elastic AI computation
  • DSS in the cloud
  • DSS and Hadoop
    • Setting up Hadoop integration
    • Connecting to secure clusters
    • Hadoop filesystems connections (HDFS, S3, EMRFS, WASB, ADLS, GS)
    • Hive
    • Impala
    • Spark
    • Hive datasets
    • Hadoop user isolation
    • Distribution-specific notes
      • Cloudera CDP
      • Cloudera CDH
      • Hortonworks HDP
      • Amazon Elastic MapReduce
      • Google Cloud Dataproc
    • Teradata Connector For Hadoop
    • Multiple Hadoop clusters
    • Dynamic AWS EMR clusters
    • Dynamic Google Dataproc clusters
  • Metastore catalog
  • Operating DSS
  • Security
  • User Isolation
  • Other topics
  • Plugins
  • Streaming data
  • Formula language
  • Custom variables expansion
  • Sampling methods
  • Accessibility
  • Troubleshooting
  • Release notes
  • Other Documentation
  • Third-party acknowledgements
  • Hive RCFile
  • MapR
  • Hive SequenceFile
  • Guided setup 2: Use an existing VPC
  • Impute with computed value
  • Columns selection
  • Mitigation for PwnKit (CVE-2021-4034)
  • Incorrect access control allows users to edit discussions
  • Ability to tamper with creation and ownership metadata
  • Directory traversal vulnerability in Shapefile parser
  • Incorrect access control in Jupyter notebooks
  • Stored XSS in object titles
  • Stored XSS in object titles
  • Access control issue on downloading project exports
  • Access control issue on changing dataset connections
  • Access control issue on dashboards listing
  • Access control issue on saving project permissions
  • PwnKit Linux vulnerability (CVE-2021-4034)
  • Access control issue on foreign managed folders
  • Cross-script-scripting on model reports
  • Code execution through server-side-template-injection
  • Insufficient access control on managed cluster logs and configuration
  • Multiple access control issues
  • Multiple access control issues
  • Stored XSS in dataset settings
  • Stored XSS in machine learning results
  • Insufficient access control on export to dataset
  • Remote code execution in API designer
  • Session credential disclosure
  • Insufficient access control to project variables
  • Insufficient access control to projects list and information
  • Insufficient access control in troubleshooting tools
  • Credentials disclosure through path traversal
  • Cross-site-scripting through custom metric names
  • Cross-site-scripting through imported Jupyter notebooks
  • Host blacklist bypass
  • Takeover of Jupyter notebooks
  • Missing authentication on internal API call
  • Cross-site-scripting through Jupyter notebooks
  • Race condition on UIF can lead to account takeover
  • Compatibility of DSS with CIS Benchmark Level 1 on RHEL/CentOS
  • Third-party acknowledgements (internal usage)
  • Unstructured data
Dataiku DSS
You are viewing the documentation for version 11 of DSS.
  • »
  • DSS and Hadoop »
  • Distribution-specific notes

Distribution-specific notesΒΆ

Each supported Hadoop distribution makes different choices in terms of packaging, versions of the different components of the Hadoop stack, supported ecosystems.

Each distribution bundles its own libraries and backports specific bugs that can modify the behavior of the Hadoop ecosystem components.

Therefore, there are some specificities related to the support of each Hadoop distribution

  • Cloudera CDP
    • Spark support
    • Security
  • Cloudera CDH
    • Security
      • DSS regular security and Sentry
    • Scala notebook
    • S3 datasets and Spark 2
    • Impala
  • Hortonworks HDP
    • HDP 3.1 support
    • Limitations
    • Security
      • DSS regular security and Ranger
      • DSS User Isolation Framework and Ranger
    • Migrating to HDP 3.X
  • Amazon Elastic MapReduce
    • Supported versions
    • Security
    • Deployment scenarios
      • Let DSS dynamically manage one or several EMR clusters
      • Connect DSS to an existing EMR cluster
        • DSS running on one of the cluster nodes
        • DSS outside of the cluster
      • Connect DSS to multiple existing EMR clusters
    • Using EMRFS
      • EMRFS credentials
  • Google Cloud Dataproc
    • Security
    • Known limitations
    • Connecting DSS to Cloud Dataproc
      • DSS running on one of the cluster nodes
      • DSS outside of the cluster
Next Previous

© Copyright 2022, Dataiku

Built with Sphinx using a theme provided by Read the Docs.