Elasticsearch

Data Science Studio can both read and write datasets on Elasticsearch versions 1.1 to 7.4.

Append Mode (to append to an elasticsearch dataset instead of replacing) is not supported.

Define an Elasticsearch connection

  • Go to Administration > Connections
  • Click the “New connection” button and pick Elasticsearch
  • Enter a name for the new connection, and the required connection parameters, then test and save the new connection

Note

The port parameter should be Elasticsearch’s HTTP API port (9200 by default), not the Java API port.

Managed Elasticsearch datasets

If you allow DSS to write managed dataset into the Elasticsearch connection, you can use this connection to create output datasets for recipes.

Creating such a dataset creates a new index on your Elasticsearch server with the name of the dataset by default. For Elasticsearch 6 and below, a mapping type is also created with the name of the dataset by default. For example, if your Elasticsearch server is hosted on localhost:9200, a managed dataset named Articles stores its data into localhost:9200/articles/articles. For Elasticsearch 7, it will be stored into localhost:9200/articles. This name will not change if you rename the dataset in case you are relying on its presence, so if you rename the dataset and want those names to remain similar, you should edit the index and type names after renaming the dataset, then rebuild it and manually delete the previous index.

Warning

For Elasticsearch 6 and below, you should not create other types in the index that are managed by DSS, they might be deleted or altered.

By default, fields get the default Elasticsearch mapping, e.g. string are analyzed and indexed (mapped to text in Elasticsearch 5+). If you want access to a non-analyzed version(mapped to keyword in Elasticsearch 5+) of some or all of your columns, you can list those columns (comma-separated, or * for all string columns) in the dataset settings. You can also specify your own complete type mapping.

If your dataset is partitioned, then one index per partition is created (prefixed by the index name) and the index name is actually an Elasticsearch alias that points to all the partition’s indices. You can still search or delete from the alias normally.

If you want the index to have non-default settings, you can use an index template before building the managed dataset for the first time.

External Elasticsearch datasets

You can also import existing data from Elasticsearch into DSS. Simply create an Elasticsearch dataset and specify the index of the data (and the type name for Elasticsearch 6 and below). If the connection is writable, DSS can also overwrite that data, but the type mapping will not be modified by DSS and the index/type will not be created if they don’t already exist.

Your index may be an alias if it’s only used for reading, or for writing if it only points to one index (otherwise Elasticsearch refuses the write operation).

You can partition your external dataset in DSS: simply specify the partitioning column and the type of partitioning (value or time-based). You can only partition on one column for external datasets.

Note

The partitioning column must have fielddata enabled, which is the case by default for keyword fields in Elasticsearch 5+ but not for text.