Sampling

When exploring and enriching data in Data Science Studio, you always get immediate visual feedback, no matter how big the dataset that you are manipulating. To achieve this, Data Science Studio works on a sample of your dataset.

In the « Explore » module, all transformation steps that you define are executed on the sample and the results are presented to you right away. When you export the processed data or create a recipe to insert your shaker script in your data flow, the whole input dataset is processed, in parallel.

Sampling in Explore

By default, the first 30,000 records of your dataset are selected for the sample. While this sampling method does not provide the best sample quality, it allows you to get your sample very quickly, whatever the size of your dataset.

The sampling can be configured in the “Sampling” tab

../_images/sampling-1.png

Warning

For best performance in the immediate steps feedback and detection of types, the exploration sample is always loaded in RAM. It is therefore crucial that you do not configure a sample so large that it would not fit in the memory of the Data Science Studio backend.

For more information about raising the backend memory limit, see Java runtime environment

For best performance, it is recommended that you do not use samples above 200 000 records.

Sampling methods

Data Science Studio provides 4 sampling methods

  • First records

    This sampling method simply takes the first N rows of the dataset. If the dataset is made of several files, the files will be taken one by one, until the defined number of records is reached for the sample.

    This method is by far the fastest sampling method, as only the first records need to be read from the dataset. However, depending on how your data is organized in the dataset, it can provide a very biased view of the dataset.

  • Fixed number of records (random sampling)

    With this method, Data Science Studio will randomly select approximately the requested number of records among the whole dataset. This sampling method is the slowest one, as Data Science Studio needs to iterate twice on the whole dataset (once to count the number of records and once to actually select the records that will be part of the sample), but it generally provides a good sample.

  • Fixed ratio (random sampling)

    With this method, Data Science Studio will randomly select records, only keeping the requested ratio of the dataset. This method is faster than the previous one, as only one pass over the dataset is required, while providing the same quality of sampling.

    However, you need to be careful: entering a too high ratio here could lead to a too big sample being selected, overflowing the memory limits.

  • Column-based sampling

    With this method, Data Science Studio will select approximately the requested number of records among the whole dataset. Instead of randomly picking records, Data Science Studio will use the values of a column. It will:

    • Compute the total ratio of the dataset required to fulfill the requested number of records
    • Randomly select this ratio of the values of the column
    • Keep all records that have the selected values of the column.

    Column-based sampling is useful if you want to have all records for some values of the column, for your analysis. For example, if your dataset is a log of user actions, it is more interesting to have “all actions for a sample of the users” rather than “a sample of all actions”, as it allows you to really study the sequences of actions of these users.

    Column-based sampling will only provide interesting results if the selected column has a sufficiently large number of values. A user id would generally be a good choice for the sampling column.

    This sampling method is as as slow as Fixed number of records (random sampling), as Data Science Studio needs to iterate twice on the whole dataset (once to count the number of records and once to actually select the records that will be part of the sample).

Sampling and partitioning

If the dataset is partitioned, by default, DSSwill use all partitions to compute the sample. You can also explicitly select some of the partitions.

Selected partitions can be entered manually.

../_images/sampling-2.png

Or you can click « List partitions » to select amongst the partitions currently detected in the dataset:

../_images/sampling-3.png

Note

Listing all partitions in the dataset can be slow, especially for SQL datasets

Refresh of the sample

The first time you open a dataset in Explore, the sample will be computed according to the default sampling parameters. Once a sample has been computed, Data Science Studio will not recompute each time, but reuse it.

The sample is recomputed to take into account new data in the following cases :

  • If the dataset is a managed dataset and has been rebuilt since the sample was computed. For more information about managed datasets and building datasets, see DSS concepts
  • If the configuration of the dataset has been changed in the « Configure dataset » screen.
  • If the sampling configuration is changed

At any time, you can also open the Sampling configuration box and click the “Save and Refresh Sample” button to recompute the sample.

In addition, for some kinds of datasets, you can ask DSS to automatically recompute the sample each time the content of the dataset changes. This is NOT possible for SQL-based datasets. Note that checking if the dataset content changed can be slow for very large files-based datasets, as Data Science Studio needs to enumerate all files (especially for S3 datasets).