Sampling datasets

The “sample/filter” recipe serves the dual purpose of sampling and/or filtering dataset.


Filtering in DSS

4 types of filtering are available and can be selected using the top dropdown menu :

  • Rules based

  • Formula based

  • SQL expression based

  • ElasticSearch query string (only available when the input dataset is on ElasticSearch v7 and above)

Rules based filters

A filter is defined by a list of possibly grouped conditions and the boolean operators that bind them.

Conditions

A Condition is defined by an input column, an operator, and a value.

  • Input column : choose any column from the dataset.

  • operator : choose an operator from the dropdown menu. The available operators match the storage type of the column. (a string column will have string operators available, such as contains, while a number column will have numerical operators available, such as <).

  • value : input a value or choose an existing column to apply the operator to.

Conditions can be added, deleted, duplicated, and turned into a group to create advanced conditions.

../_images/filters_example1.png

Groups

Groups can be used to create advanced logic for conditional statements. Groups can be nested to create sub-conditions (y AND z AND (t OR u))) or defined at the same level (y OR z) AND (t OR u). Groups can be added using the +ADD > Add group button, deleted, duplicated, and ungrouped.

../_images/filters_example2.png

Boolean operators

Conditions and groups are binded using boolean operators, that can be either And or Or.

Formula based filters

Formulas are manually defined using functions of the formula language, dataset column names, and project variables. Formulas are well suited for more complex filtering options or specific functions that do not appear in the rules based filter view. The formula language documentation can be found here.

../_images/filters_example3.png

SQL expression based filters

When using an SQL based recipe engine, an SQL expression can directly be given to filter the dataset, using dataset columns and project variables.

../_images/filters_example4.png

ElasticSearch query string

When using an input dataset on ElasticSearch v7 and above, you can use the query_string syntax to filter the dataset.

Note

When using an ElasticSearch query string, sampling is disabled and filtering is performed on the whole dataset.