Data Science Studio can both read and write datasets on ElasticSearch versions 1.1 to 2.2.
Append Mode (to append to an elasticsearch dataset instead of replacing) is not yet supported.
Define an ElasticSearch connection¶
- Go to Administration > Connections
- Click the “New connection” button and pick ElasticSearch
- Enter a name for the new connection, and the required connection parameters, then save the new connection
The port parameter should be ElasticSearch’s HTTP API port (9200 by default), not the Java API port.
Managed ElasticSearch datasets¶
If you allow DSS to write managed dataset into the ElasticSearch connection, you can use this connection to create output datasets for recipes.
Creating such a dataset creates a new index on your ElasticSearch server, with
the name of the dataset by default, and its data as a type also the name of the
dataset by default. For example, if your ElasticSearch server is hoster on
localhost:9200, a managed dataset named
Articles stores its data into
localhost:9200/articles/articles. This name will not change if you rename
the dataset in case you are relying on its presence, so if you rename the
dataset and want those names to remain similar, you should edit the index and
type names after renaming the dataset, then rebuild it and manually delete the
You should not create other types in the index that are managed by DSS, they might be deleted or altered.
By default, fields get the default ElasticSearch mapping, e.g. string are
analyzed and indexed. If you want access to a non-analyzed version of some or
all of your columns, you can list those columns (comma-separated, or
all string columns) in the dataset settings. You can also specify your own
complete type mapping.
If your dataset is partitioned, then one index per partition is created (prefixed by the index name) and the index name is actually an ElasticSearch alias that points to all the partition’s indices. You can still search or delete from the alias normally.
External ElasticSearch datasets¶
You can also import existing data from ElasticSearch into DSS. Simply create an ElasticSearch dataset and specify the index and type name of the data. If the connection is writable, DSS can also overwrite that data, but the type mapping will not be modified by DSS and the index/type, not created if they don’t already exist.
Your index may be an alias if it’s only used for reading, or for writing if it only points to one index (otherwise ElasticSearch refuses the write operation).
You can partition your external dataset in DSS: simply specify the partitioning column and the type of partitioning (value or time-based). You can only partition on one column for external datasets. Partition deletion/cleaning uses delete by query, wich can be slow depending on the volume of the indexed data.