Metrics¶
The metrics system in DSS allows you to automatically compute measurement on Flow items (datasets, managed folders, and saved models).
Note
Metrics are often used in conjunction with scenarios but are not strictly dependent on scenarios.
Examples of metrics on a dataset include:
The size (in MB) of the dataset
The number of rows in the dataset
The average of a given column
Metrics are configured in the Status tabs of datasets, managed folders and saved models.
Metrics are automatically historized, which is very useful to track the evolution of the status of a dataset. For example, how did the average of basket
evolve in the last month?
Probes and metrics¶
The whole system is made around the concept of a Probe. A probe is a component which computes several metrics on an item.
As much as possible, the probes execute all of their computations in a single pass over the data. Furthermore, the DSS metrics system automatically “merges” together probes when there is an efficient execution path which combines several probes.
Each probe has a configuration which indicates what should be computed for this probe.
Furthermore, each probe can be configured to run automatically after each build of the dataset or not.
Dataset probes¶
The following probes are available on a dataset.
Basic info¶
This probe computes the size (when relevant) and the number of files (when relevant) of the dataset
Records¶
This probe computes the number of records in the dataset.
Note
On non-Hadoop non-SQL datasets, this probe requires enumerating the whole dataset, which can be costly. See the execution engines section.
Partitioning¶
This probe computes the number of partitions and the list of partitions. It only makes sense for a partitioned dataset.
Basic column statistics¶
This probe computes descriptive statistics on dataset columns (MIN, MAX, AVG, …). You can enable multiple metrics on multiple columns
Advanced column statistics¶
This probe computes more descriptive statistics on dataset columns: most frequent value and top N values.
This probe is separate from the “Basic column statistics” because its computation costs are much higher.
Column data validity statistics¶
This probe computes the number and ratio of invalid data in a column. Invalid is defined here with regard to the Meaning of the column. For more information about meanings and storage types, see Schemas, storage types and meanings.
Note that you can only enable this probe on columns for which there is a forced meaning, i.e. it is not possible to check validity compared to a meaning which is only automatically inferred by DSS.
SQL¶
On SQL dataset, probes can be written using a SQL query. Each value in the first row of the query’s result is stored as a metric, using the column name of the value as name for the metric.
Note
If the option “is a single aggregate” is selected, your aggregate metric will automatically be wrapped by the correspondoning SELECT
, FROM
, and WHERE
clauses.
For example, if this option is selected, the following would be a valid probe statement: SUM(cost) / COUNT(customers)
.
If you were to run this probe on the dataset orders
with the partition order_date
set to ‘2018-01-01’, your aggregation would get translated into the following SQL statement: SELECT SUM(cost) / COUNT(customers) as "col_0" FROM orders WHERE order_date='2018-01-01'
.
Python¶
You can also write a custom probe in Python.
Metrics display UI¶
The value of metrics can be viewed in the “Status” tab of a dataset, managed folder or saved model.
Since there can be a lot of metrics on an item, you must select which metrics to display, by clicking on the X/Y metrics
button
Datasets and managed folders¶
There are two main metric views:
A “tile” view which displays the latest value of each selected metric. Clicking on the value of a metric will bring up a modal box with the history of this value.
A “ribbon” view which displays the history of all selected metrics.
Note that not all metric values are numerical. For example, the “most frequent value” for a given column is not always numerical. Therefore, the history view sometimes shows history as tables rather than charts.
Probe execution engines¶
Depending on the type of dataset and selected probe configurations, dataset probes can use the following execution engines for their computations:
Hive
If selected for a dataset, the Hive engine will be used on datasets that are larger than 10MB in size. The Streaming data engine will be used on smaller datasets.
Impala
SQL database
This engine is used for SQL datasets, for probes that can perform their computation as a SQL query. The query is fully executed in the database, with no data movement.
No specific engine
For example, the “files size” probe does not require any engine, it simply reads the size of the files
Streaming data engine
Data is streamed into DSS for computation
This engine is generally much slower since it needs to move all of the data
This engine acts as a fallback if no other engine is possible (for example, )
Metrics on partitioned datasets¶
On partitioned datasets, the metrics can be computed either on a per-partition basis or on the whole dataset.
Note
Metrics must actually be computed independently on each partition and on the whole dataset, since for a lot of metrics, the metric on the whole dataset is not the “sum” of metrics on each partition.
For example, the median of a column.
For these datasets, there are 4 views into the metrics:
The regular “tile” view, showing the last value of selected metrics, either for a given partition or the whole dataset.
The regular “ribbon” view, showing the history of the values of selected metrics, either for a given partition or the whole dataset.
A “partitions table” view, showing the last values of several metrics on all partitions, as a data table
A “partitions chart” view, which tries to display the last values of each metric on all partitions as a chart. This view particularly makes sense for time-based partitions, where the chart will actually be a timeline chart.