Dataiku Reference Doc.
  • Product
    • Features
      • Connectivity
      • Data Wrangling
      • Machine Learning
      • Data Mining
      • Data Visualization
      • Data Workflow
      • Realtime Scoring
      •  
      • Code or Click
      • Collaboration
      • Deployment
      • Enterprise Readiness
    • Plugins
    • Samples
    • Technology
    • Editions
  • Solutions
    • Use cases
    • Industries
    • Departments
    • Customers
  • Learn
    • Learn Dataiku DSS
    • All How-To's
    • Reference Doc.
    • Q & A
    • What's new
    • Support
  • Resources
    • White Papers
    • Reference Doc.
    • Webinars
    • Success Stories
  • Company
    • Our Story
    • Team
    • Careers
    • News
    • Events
    • Customers
    • Partners
  • Blog
  • Contact us
  • Get Started
  • Installing DSS
    • Requirements
    • Installing a new DSS instance
    • Upgrading a DSS instance
    • Updating a DSS license
    • Other installation options
      • Install on macOS
      • Install on AWS
      • Install on Azure
      • Install a virtual machine
    • Setting up Hadoop and Spark integration
    • R integration
    • Customizing DSS installation
    • Installing database drivers
    • Java runtime environment
    • Python integration
    • Installing a DSS plugin
    • Configuring LDAP authentication
    • Working with proxies
    • Migration operations
  • DSS concepts
  • Connecting to data
    • Supported connections
    • Upload your files
    • Server filesystem
    • HDFS
    • Amazon S3
    • Google Cloud Storage
    • Azure Blob Storage
    • FTP
    • SCP / SFTP (aka SSH)
    • HTTP
    • SQL databases
      • MySQL
      • PostgreSQL
      • HP Vertica
      • Amazon Redshift
      • EMC Greenplum
      • Teradata
      • Oracle
      • Microsoft SQL Server
      • SAP HANA
      • IBM Netezza
      • Google Bigquery
      • IBM DB2
      • Snowflake
    • Cassandra
    • ElasticSearch
    • Managed folders
    • “Files in folder” dataset
    • HTTP (with cache)
    • Dataset plugins
    • Data connectivity macros
    • Making relocatable managed datasets
    • Data ordering
  • Exploring your data
    • Sampling
    • Analyze
  • Schemas, storage types and meanings
    • Definitions
    • Basic usage
    • Schema for data preparation
    • Creating schemas of datasets
    • Handling of schemas by recipes
    • List of recognized meanings
    • User-defined meanings
  • Data preparation
    • Processors reference
      • Extract from array
      • Fold an array
      • Sort array
      • Concatenate JSON arrays
      • Discretize (bin) numerical values
      • Change coordinates system
      • Copy column
      • Rename columns
      • Concatenate columns
      • Delete/Keep columns by name
      • Count occurrences
      • Convert currencies
      • Extract date elements
      • Compute difference between dates
      • Format date with custom format
      • Parse to standard date format
      • Split e-mail addresses
      • Enrich from French department
      • Enrich from French postcode
      • Extract ngrams
      • Extract numbers
      • Fill empty cells with fixed value
      • Filter rows/cells on date range
      • Filter rows/cells with formula
      • Filter invalid rows/cells
      • Filter rows/cells on numerical range
      • Filter rows/cells on value
      • Find and replace
      • Flag rows/cells on date range
      • Flag rows with formula
      • Flag invalid rows
      • Flag rows on numerical range
      • Flag rows on value
      • Fold multiple columns
      • Fold multiple columns by pattern
      • Fold object keys
      • Formula
      • Fuzzy join with other dataset (memory-based)
      • Generate Big Data
      • Compute distance between geopoints
      • Extract from geo column
      • Geo-join
      • Resolve GeoIP
      • Create GeoPoint from lat/lon
      • Extract lat/lon from GeoPoint
      • Flag holidays
      • Split invalid cells into another column
      • Join with other dataset (memory-based)
      • Extract with JSONPath
      • Group long-tail values
      • Translate values using meaning
      • Normalize measure
      • Negate boolean value
      • Force numerical range
      • Generate numerical combinations
      • Convert number formats
      • Nest columns
      • Unnest object (flatten JSON)
      • Extract with regular expression
      • Pivot
      • Python function
      • Split HTTP Query String
      • Remove rows where cell is empty
      • Round numbers
      • Simplify text
      • Split and fold
      • Split and unfold
      • Split column
      • Transform string
      • Tokenize text
      • Transpose rows to columns
      • Triggered unfold
      • Unfold
      • Unfold an array
      • Convert a UNIX timestamp to a date
      • Fill empty cells with previous/next value
      • Split URL (into protocol, host, port, …)
      • Classify User-Agent
      • Generate a best-effort visitor id
      • Zip JSON arrays
    • Filtering and flagging rows
    • Managing dates
    • Reshaping
    • Geographic processing
    • Sampling
    • Execution engines
  • Data Visualization
    • Sampling and charts engines
    • Standard chart types
    • Geographic charts (Beta)
    • Color palettes
  • Machine learning
    • Prediction (Supervised ML)
    • Clustering (Unsupervised ML)
    • Features handling
    • Machine learning training engines
      • Scikit-learn / XGBoost engine
      • MLLib (Spark) engine
      • H2O (Sparkling Water) engine
      • Vertica
    • Scoring engines
  • The Flow
    • Visual Grammar
    • Rebuilding Datasets
    • Limiting Concurrent Executions
  • Visual recipes
    • Sync: copying datasets
    • Grouping: aggregating data
    • Window: analytics functions
    • Distinct: get unique rows
    • Join: joining datasets
    • Splitting datasets
    • Top N: retrieve first N rows
    • Stacking datasets
    • Sampling datasets
    • Sort: order values
    • Pivot recipe
    • Download recipe
  • Recipes based on code
    • The common editor layout
    • Python recipes
    • R recipes
    • SQL recipes
    • Hive recipes
    • Pig recipes
    • Impala recipes
    • Spark-Scala recipes
    • PySpark recipes
    • Spark / R recipes
    • SparkSQL recipes
    • Shell recipes
    • Variables expansion in code recipes
  • Code notebooks
    • SQL notebook
    • Python notebooks
    • Predefined notebooks
  • Webapps
    • “Standard” web apps
    • Shiny web apps
    • Bokeh web apps
    • Publishing webapps on the dashboard
  • Code reports
    • R Markdown reports
  • Dashboards
    • Dashboard concepts
    • Display settings
    • Insights reference
      • Chart
      • Dataset table
      • Model report
      • Managed folder
      • Jupyter Notebook
      • Webapp
      • Metric
      • Scenarios
  • Working with partitions
    • Partitioning files-based datasets
    • Partitioned SQL datasets
    • Specifying partition dependencies
    • Partition identifiers
    • Recipes for partitioned datasets
    • Partitioned Hive recipes
    • Partitioned SQL recipes
    • Partitioning variables substitutions
  • DSS and Hadoop
    • Setting up Hadoop integration
    • Connecting to secure clusters
    • Setup a new HDFS connection
    • DSS and Hive
    • DSS and Impala
    • Hadoop multi-user security
    • Distribution-specific notes
      • Cloudera CDH
      • Hortonworks HDP
      • MapR
      • Amazon Elastic MapReduce
      • Microsoft Azure HDInsight
      • Google Cloud Dataproc
    • Hive datasets
    • Using multiple Hadoop filesystems
    • Teradata Connector For Hadoop
  • DSS and Spark
    • Usage of Spark in DSS
    • Setting up Spark integration
    • Spark configurations
    • Usage notes per dataset type
    • Spark pipelines
    • Limitations and attention points
  • DSS and Python
    • Installing Python packages
    • Reusing Python code
    • Using Matplotlib
    • Using Bokeh
    • Using Plot.ly
    • Using Ggplot
  • DSS and R
    • Installing R packages
    • Reusing R code
    • Using ggplot2
    • Using Dygraphs
    • Using googleVis
    • Using ggvis
  • Code environments
    • Operations (Python)
    • Operations (R)
    • Base packages
    • Using Conda
    • Automation nodes
    • Non-managed code environments
    • Plugins’ code environments
    • Custom options and environment
    • Troubleshooting
  • Collaboration
    • Version control
  • Plugins
    • Installing plugins
    • Installing plugins offline
    • Writing your own plugin
    • Plugin author reference guide
      • Plugins and components
      • Parameters
      • Writing recipes
      • Writing DSS macros
      • Writing DSS Filesystem providers
      • Custom chart elements
      • Other topics
  • Automation scenarios, metrics, and checks
    • Definitions
    • Scenario steps
    • Launching a scenario
    • Reporting on scenario runs
    • Custom scenarios
    • Variables in scenarios
    • Metrics
    • Checks
    • Custom probes and checks
  • Automation node and bundles
    • Installing the Automation node
    • Creating a bundle
    • Importing a bundle
  • API Node: Real-time service
    • Introduction
    • Concepts
    • Installing the API node
    • Your first API service
    • Exposing a visual prediction model
    • Exposing a Python prediction model
    • Exposing a R prediction model
    • Exposing a Python function
    • Exposing a R function
    • Exposing a SQL query
    • Exposing a lookup in a dataset
    • Enriching prediction queries
    • API node user API
    • Using the apinode-admin tool
    • API node administration API
    • High availability and scalability
    • Managing versions of your endpoint
    • Logging and auditing
    • Health monitoring
  • Advanced topics
    • Sampling methods
    • Formula language
    • Custom variables expansion
  • File formats
    • Delimiter-separated values (CSV / TSV)
    • Fixed width
    • Parquet
    • Avro
    • Hive SequenceFile
    • Hive RCFile
    • Hive ORCFile
    • XML
    • JSON
    • Excel
    • ESRI Shapefiles
  • DSS internal APIs
    • The internal Python API
      • Interacting with datasets
      • Performing SQL, Hive and Impala queries
      • Executing partial recipes
      • Interacting with Pyspark
      • Managed folders in Python API
      • Interacting with saved models
      • Interacting with metrics
      • API for custom recipes
      • API for custom datasets
      • API for custom formats
      • API for custom FS providers
      • Custom scenarios API
      • Creating static insights
    • The Javascript API
    • The R API
      • Creating static insights
    • The Scala API
  • Public API
    • Features
    • Public API Keys
    • Public API Python client
      • The main client class
      • Managing projects
      • Managing datasets
      • Managed folders
      • Managing recipes
      • Machine learning
      • Managing jobs
      • Managing scenarios
      • APINode API services
      • Managing meanings
      • Managing users and groups
      • Managing connections
      • Other administration tasks
      • Metrics and checks
      • SQL queries through DSS
      • Reference API documentation
    • The REST API
  • Security
    • Main permissions
    • Connections security
    • User profiles
    • Exposed objects
    • Dashboard authorizations
    • Multi-user security
      • Comparing security modes
      • Concepts
      • Prerequisites and limitations
      • Setup
      • Operations
      • Interaction with Hive and Impala
      • Interaction with Spark
      • Advanced topics
    • Audit Trail
    • Advanced security options
    • Single Sign-On
  • Operating DSS
    • dsscli tool
    • The data directory
    • Backing up
    • Logging in DSS
    • DSS Macros
    • Managing DSS disk usage
  • Troubleshooting
    • Diagnosing and debugging issues
    • Obtaining support
    • Common issues
      • DSS does not start / Cannot connect
      • Cannot login to DSS
      • DSS crashes / The “Disconnected” overlay appears
      • Websockets problems
      • Cannot connect to a SQL database
      • A job fails
      • A scenario fails
      • A ML model training fails
      • “Your user profile does not allow” issues
    • Error codes
      • ERR_CODEENV_EXISTING_ENV: Code environment already exists
      • ERR_CODEENV_INCORRECT_ENV_TYPE: Wrong type of Code environment
      • ERR_CODEENV_INVALID_CODE_ENV_ARCHIVE: Invalid code environment archive
      • ERR_CODEENV_MISSING_ENV: Code environment does not exists
      • ERR_CODEENV_MISSING_ENV_VERSION: Code environment version does not exists
      • ERR_CODEENV_NO_CREATION_PERMISSION: User not allowed to create Code environments
      • ERR_CODEENV_NO_USAGE_PERMISSION: User not allowed to use this Code environment
      • ERR_CODEENV_UNSUPPORTED_OPERATION_FOR_ENV_TYPE: Operation not supported for this type of Code environment
      • ERR_CONNECTION_API_BAD_CONFIG: Bad configuration for connection
      • ERR_CONNECTION_AZURE_INVALID_CONFIG: Invalid Azure connection configuration
      • ERR_CONNECTION_S3_INVALID_CONFIG: Invalid S3 connection configuration
      • ERR_CONNECTION_SQL_INVALID_CONFIG: Invalid SQL connection configuration
      • ERR_CONNECTION_SSH_INVALID_CONFIG: Invalid SSH connection configuration
      • ERR_DATASET_ACTION_NOT_SUPPORTED: Action not supported for this kind of dataset
      • ERR_DATASET_HIVE_INCOMPATIBLE_SCHEMA: Dataset schema not compatible with Hive
      • ERR_DATASET_INVALID_CONFIG: Invalid dataset configuration
      • ERR_DATASET_INVALID_FORMAT_CONFIG: Invalid format configuration for this dataset
      • ERR_DATASET_INVALID_METRIC_IDENTIFIER: Invalid metric identifier
      • ERR_DATASET_INVALID_PARTITIONING_CONFIG: Invalid dataset partitioning configuration
      • ERR_DATASET_PARTITION_EMPTY: Input partition is empty
      • ERR_ENDPOINT_INVALID_CONFIG: Invalid configuration for API Endpoint
      • ERR_FOLDER_INVALID_PARTITIONING_CONFIG: Invalid folder partitioning configuration
      • ERR_FSPROVIDER_CANNOT_CREATE_FOLDER_ON_DIRECTORY_UNAWARE_FS: Cannot create a folder on this type of file system
      • ERR_FSPROVIDER_DEST_PATH_ALREADY_EXISTS: Destination path already exists
      • ERR_FSPROVIDER_FSLIKE_REACH_OUT_OF_ROOT: Illegal attempt to access data out of connection root path
      • ERR_FSPROVIDER_HTTP_CONNECTION_FAILED: HTTP connection failed
      • ERR_FSPROVIDER_HTTP_INVALID_URI: Invalid HTTP URI
      • ERR_FSPROVIDER_HTTP_REQUEST_FAILED: HTTP request failed
      • ERR_FSPROVIDER_ILLEGAL_PATH: Illegal path for that file system
      • ERR_FSPROVIDER_INVALID_CONFIG: Invalid configuration
      • ERR_FSPROVIDER_INVALID_FILE_NAME: Invalid file name
      • ERR_FSPROVIDER_LOCAL_LIST_FAILED: Could not list local directory
      • ERR_FSPROVIDER_PATH_DOES_NOT_EXIST: Path in dataset or folder does not exist
      • ERR_FSPROVIDER_ROOT_PATH_DOES_NOT_EXIST: Root path of the dataset or folder does not exist
      • ERR_FSPROVIDER_SSH_CONNECTION_FAILED: Failed to establish SSH connection
      • ERR_HIVE_HS2_CONNECTION_FAILED: Failed to establish HiveServer2 connection
      • ERR_METRIC_DATASET_COMPUTATION_FAILED: Metrics computation completely failed
      • ERR_METRIC_ENGINE_RUN_FAILED: One of the metrics engine failed to run
      • ERR_MISC_ENOSPC: Out of disk space
      • ERR_MISC_EOPENF: Too many open files
      • ERR_OBJECT_OPERATION_NOT_AVAILABLE_FOR_TYPE: Operation not supported for this kind of object
      • ERR_PLUGIN_CANNOT_LOAD: Plugin cannot be loaded
      • ERR_PLUGIN_COMPONENT_NOT_INSTALLED: Plugin component not installed or removed
      • ERR_PLUGIN_DEV_INVALID_COMPONENT_PARAMETER: Invalid parameter for plugin component creation
      • ERR_PLUGIN_DEV_INVALID_DEFINITION: The descriptor of the plugin is invalid
      • ERR_PLUGIN_INVALID_DEFINITION: The plugin’s definition is invalid
      • ERR_PLUGIN_NOT_INSTALLED: Plugin not installed or removed
      • ERR_PLUGIN_WITHOUT_CODEENV: The plugin has no code env specification
      • ERR_PLUGIN_WRONG_TYPE: Unexpected type of plugin
      • ERR_PROJECT_INVALID_ARCHIVE: Invalid project archive
      • ERR_PROJECT_INVALID_PROJECT_KEY: Invalid project key
      • ERR_RECIPE_CANNOT_CHANGE_ENGINE: Cannot change engine
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY: Cannot check schema consistency
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY_EXPENSIVE: Cannot check schema consistency: expensive checks disabled
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY_NEEDS_BUILD: Cannot compute output schema with an empty input dataset. Build the input dataset first.
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY_ON_RECIPE_TYPE: Cannot check schema consistency on this kind of recipe
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY_WITH_RECIPE_CONFIG: Cannot check schema consistency because of recipe configuration
      • ERR_RECIPE_CANNOT_CHANGE_ENGINE: Not compatible with Spark
      • ERR_RECIPE_CANNOT_USE_ENGINE: Cannot use the selected engine for this recipe
      • ERR_RECIPE_INCONSISTENT_I_O: Inconsistent recipe input or output
      • ERR_RECIPE_PDEP_UPDATE_REQUIRED: Partition dependecy update required
      • ERR_RECIPE_SPLIT_INVALID_COMPUTED_COLUMNS: Invalid computed column
      • ERR_SCENARIO_INVALID_STEP_CONFIG: Invalid scenario step configuration
      • ERR_SECURITY_CRUD_INVALID_SETTINGS: The user attributes submitted for a change are invalid
      • ERR_SECURITY_GROUP_EXISTS: The new requested group already exists
      • ERR_SECURITY_INVALID_NEW_PASSWORD: The new password is invalid
      • ERR_SECURITY_INVALID_PASSWORD: The password hash from the database is invalid
      • ERR_SECURITY_MUS_USER_UNMATCHED: The DSS user is not configured to be matched onto a system user
      • ERR_SECURITY_PATH_ESCAPE: The requested file is not within any allowed directory
      • ERR_SECURITY_USER_EXISTS: The requested user for creation already exists
      • ERR_SECURITY_WRONG_PASSWORD: The old password provided for password change is invalid
      • ERR_SPARK_FAILED_DRIVER_OOM: Spark failure: out of memory in driver
      • ERR_SPARK_FAILED_TASK_OOM: Spark failure: out of memory in task
      • ERR_SPARK_FAILED_YARN_KILLED_MEMORY: Spark failure: killed by YARN (excessive memory usage)
      • ERR_SPARK_PYSPARK_CODE_FAILED_UNSPECIFIED: Pyspark code failed
      • ERR_SQL_CANNOT_LOAD_DRIVER: Failed to load database driver
      • ERR_SQL_DB_UNREACHABLE: Failed to reach database
      • ERR_SQL_IMPALA_MEMORYLIMIT: Impala memory limit exceeded
      • ERR_SQL_POSTGRESQL_TOOMANYSESSIONS: too many sessions open concurrently
      • ERR_SQL_TABLE_NOT_FOUND: SQL Table not found
      • ERR_SQL_VERTICA_TOOMANYROS: Error in Vertica: too many ROS
      • ERR_SQL_VERTICA_TOOMANYSESSIONS: Error in Vertica: too many sessions open concurrently
      • ERR_TRANSACTION_FAILED_ENOSPC: Out of disk space
      • ERR_TRANSACTION_GIT_COMMMIT_FAILED: Failed committing changes
    • Known issues
  • Release notes
    • DSS 4.2 Release notes
    • DSS 4.1 Release notes
    • DSS 4.0 Release notes
    • DSS 3.1 Release notes
    • DSS 3.0 Relase notes
    • DSS 2.3 Relase notes
    • DSS 2.2 Relase notes
    • DSS 2.1 Relase notes
    • DSS 2.0 Relase notes
    • DSS 1.4 Relase notes
    • DSS 1.3 Relase notes
    • DSS 1.2 Relase notes
    • DSS 1.1 Release notes
    • DSS 1.0 Release Notes
    • Pre versions
  • Other Documentation
  • Third-party acknowledgements
 
Dataiku DSS
  • Docs »
  • DSS and Hadoop »
  • Distribution-specific notes

Distribution-specific notes¶

Each supported Hadoop distribution makes different choices in terms of packaging, versions of the different components of the Hadoop stack, supported ecosystems.

Each distribution bundles its own libraries and backports specific bugs that can modify the behavior of the Hadoop ecosystem components.

Therefore, there are some specificities related to the support of each Hadoop distribution

  • Cloudera CDH
    • Security
      • DSS regular security and Sentry
    • Scala notebook
    • S3 datasets and Spark 2
  • Hortonworks HDP
    • Security
      • DSS regular security and Ranger
      • DSS multi-user-security and Ranger
  • MapR
    • MEP support
    • Security
    • Others
  • Amazon Elastic MapReduce
    • Supported versions
    • Security
    • Connecting DSS to EMR
      • DSS running on one of the cluster nodes
      • DSS outside of the cluster
    • EMRFS support
    • HDFS vs S3
    • Deployment scenarios
  • Microsoft Azure HDInsight
    • Tested versions
    • Security
    • Connecting Dataiku DSS to Azure HDInsight
      • DSS running on an edge node managed by HDInsight
        • One-click deployment
        • Using an Azure Resource Management (ARM) template
        • Manual DSS installation
      • DSS running outside of HDInsight
    • Using Dataiku DSS on Azure HDInsight
      • Accessing Dataiku DSS on managed HDInsight edge nodes
      • Operating Dataiku DSS when using managed edge nodes
      • Interacting with cluster storage
  • Google Cloud Dataproc
    • Security
    • Known limitations
    • Connecting DSS to Cloud Dataproc
      • DSS running on one of the cluster nodes
      • DSS outside of the cluster
Next Previous

© Copyright 2018, Dataiku.

Sphinx theme provided by Read the Docs