API for custom datasets

class dataiku.connector.Connector(config, plugin_config=None)

The base interface for a Custom Python connector

generate_rows(dataset_schema=None, dataset_partitioning=None, partition_id=None, records_limit=-1)

The main reading method.

Returns a generator over the rows of the dataset (or partition) Each yielded row must be a dictionary, indexed by column name.

The dataset schema and partitioning are given for information purpose.


You may create a folder DATA_DIR/plugins/dev/<plugin id>/resource/ to hold resources useful fo your plugin, e.g. data files; this method returns the path of this folder.

This resource folder is meant to be read-only, and included in the .zip release of your plugin. Do not put resources next to the connector.py or recipe.py.


Return the partitioning schema that the connector defines.


Returns the schema that this connector generates when returning rows.

The returned schema may be None if the schema is not known in advance. In that case, the dataset schema will be infered from the first rows.

Additional columns returned by the generate_rows are discarded if and only if connector.json contains “strictSchema”:true

The schema must be a dict, with a single key: “columns”, containing an array of {‘name’:name, ‘type’ : type}.

return {“columns” : [ {“name”: “col1”, “type” : “string”}, {“name” :”col2”, “type” : “float”}]}

Supported types are: string, int, bigint, float, double, date, boolean

get_records_count(partitioning=None, partition_id=None)

Returns the count of records for the dataset (or a partition).

Implementation is only required if the corresponding flag is set to True in the connector definition

get_writer(dataset_schema=None, dataset_partitioning=None, partition_id=None)

Returns a write object to write in the dataset (or in a partition)

The dataset_schema given here will match the the rows passed in to the writer.

Note: the writer is responsible for clearing the partition, if relevant


Return the list of partitions for the partitioning scheme passed as parameter

partition_exists(partitioning, partition_id)

Return whether the partition passed as parameter exists

Implementation is only required if the corresponding flag is set to True in the connector definition

class dataiku.connector.CustomDatasetWriter

Row is a tuple with N + 1 elements matching the schema passed to get_writer. The last element is a dict of columns not found in the schema