Sampling methods

Many parts of DSS support sampling data to extract subsets and/or reduce the size of data to process

Sampling can be configured in the following locations in DSS:

  • Exploration

  • Visual data preparation

  • Charts

  • The sampling recipe

  • Machine learning

  • Various APIs for fetching datasets data

DSS provides a variety of sampling methods

Generic sampling methods

DSS provides the following methods that are available in most cases where sampling is requested.

No sampling

All data is taken, sampling does not happen.

First records

This method takes the first N rows of the dataset. It is very fast, as it only reads N rows, but may result in a very biased view of the dataset.

Random sampling (approximate ratio)

This method randomly selects approximately X% of the rows. The target count of records is approximate, and will be more precise with large input datasets.

This method requires a full pass reading the data.

Random sampling (approximate count)

This method randomly selects approximately N records. The target count of records is approximate, and will be more precise with large input datasets.

This method requires 2 full passes reading the data.

Column values subset

This method randomly selects a subset of values and chooses all rows with these values, in order to obtain approximately N rows. This is useful for selecting a subset of customers, for example.

This sampling method requires 2 full passes reading the data. The time taken by this method is thus linear with the size of the dataset.

This method is useful if you want to have all records for some values of the column, for your analysis. For example, if your dataset is a log of user actions, it is more interesting to have “all actions for a sample of the users” rather than “a sample of all actions”, as it allows you to really study the sequences of actions of these users.

“Column values subset” sampling will only provide interesting results if the selected column has a sufficiently large number of values. A user id would generally be a good choice for the sampling column.

Class rebalancing (approximate number of records)

This method randomly selects approximately N rows, trying to rebalance equally all modalities of a column. This method does not oversample, only undersample (so some rare modalities may remain under-represented).In all cases, rebalancing is approximative.

This sampling method requires 2 full passes reading the data. The time taken by this method is thus linear with the size of the dataset.

Class rebalancing (approximate ratio)

This method randomly selects approximately X% of the rows, trying to rebalance equally all modalities of a column.

This method does not oversample, only undersample (so some rare modalities may remain under-represented). In all cases, rebalancing is approximative.

This sampling method requires 2 full passes reading the data. The time taken by this method is thus linear with the size of the dataset.

Exploration / Visual data preparation

For exploration and visual data preparation, additional sampling methods are available, thanks to the “in-memory” characteristic.

See Sampling in explore for more information