Pivot recipe¶

The “pivot” recipe lets you build pivot tables, with more control over the rows, columns and aggregations than what the pivot processor offers. It also lets you run the pivoting natively on external systems, like SQL databases or Hive.

Defining the pivot table rows ¶

The rows in the output dataset are defined by the values of a tuple of columns, the row identifiers. This tuple can be specified explicitly, or implicitely as “all other columns”, in which case any column that is not used to define modalities nor is used in an aggregate will be used as row identifier.

A	B	C	D	E
x	1	a	1	6
y	1	b	2	5
x	1	a	3	4
y	2	b	1	3
x	2	b	2	2

For the input columns {A, B, C, D, E}, giving {A, B} as explicit list of row identifiers will produce a pivot table where the rows are indexed by the pairs of values for A and B found in the input data.

A	B	…
x	1	…
y	1	…
x	2	…
y	2	…

On the other hand, using “all other columns” while having the modalities defined by column B and aggregates on columns D and E will produce a pivot table where the rows are indexed by the tuples of values for A and C found in the input data.

A	C	1_D_sum	1_E_sum	2_D_sum	2_E_sum
x	a	4	10
x	b			2	2
y	b	2	5	1	3

Modality handling ¶

The columns in the output are defined as the list aggregates times the list of modalities.

Computation of the list of modalities ¶

The modalities themselves are the combinations of a non-empty list of columns. Since the list of combinations can be huge, there are several options to bring it back to something more manageable:

most frequent : keep only the N combinations appearing the most in the input data
min occurrence cont : keep only the combination appearing at least N times in the input data
explicit : specify the combinations explicitly

The effective list of modalities used to build the output is only known after the entire input dataset is scanned, so it’s not readily available at design-time, but computed when the recipe is run. By default, the list of modalities for a given set of settings is computed only once, and kept for ulterior runs of the recipe. The option to “recompute schema at each run” on the “Output” section of the recipe lets you force a recompute of the list of modalities at each run. Note that in this case, the changes in the list of output columns are not automatically propagated to downstream datasets and recipes.

Cleaning of the modalities’ name ¶

Since modalities are made up of a concatenation of the values of columns from the input data, their name is usually not directly usable as column name in SQL databases or Hive. The “Output” section of the recipe therefore offers options to simplify the names so that they become compatible with these systems:

soft slugification : swaps out whitespace and punctuation with ‘_’. This is sufficient for most SQL databases (PostgreSQL, Oracle…)
hard slugification : only keep alphanumeric characters, ‘_’ and ‘-’. This is typically for Hive (i.e. when the output dataset is HDFS)
numbering : completely ignores the original name of the modality and uses numbers instead. This is the safest of all schemes, and produces the shortest names.
truncation : after the above simplifications have been applied, truncate the names. SQL databases natural limitations are natively taken into account (for example the 32 char limit on Oracle’s column names), but some limitations are not implicit in the nature of the output dataset; typically, if the output is HDFS and is to be used with Impala, a 128 char limit needs to be enforced.

Aggregates ¶

The recipe offers 2 levels of aggregates :

aggregates per row and modality (i.e. per pivot table cell)
aggregates per row (i.e. marginals)

Per row and modality ¶

These are defined in the “Pivot” section of the recipe. “Add new” creates a new simple aggregate on a selected column, and the aggregate can be further setup by changing its aggregation, and if relevant, the aggregation settings.

For each aggregate defined in this section, and each modality, one column will be created in the output. The column name is made of the modality name concatenated with the aggregate’s column and aggregation type.

Per row ¶

The “Other columns” section of the recipe adds aggregates per row. There are 2 typical uses:

to keep columns that are neither row identifiers nor aggregates in the pivot table. In this case the aggregates “First”, “Last” or “Concat” should be preferred.
to compute marginals to compare the aggregates per row and modality to. For example, one can aggregate the average of column A for each row of X and modality of Y, and at the same time aggregate the average of column A for each row X (across modalities of Y).

Comparison to pivot processor ¶

The pivot processor is a stream-oriented processor that pivots one row at a time and is available in the preparation scripts, and consequently in Prepare recipes.

	Pivot recipe	Pivot processor
Modalities	computed by inspecting entire dataset. Not available at design-time until the recipe has run once	computed by using the design-time sample. A small sample or very imbalanced modalities implies that some modalities can be missed
Dynamic output schema	the list of modalities can optionally be computed at each run of the recipe	schema is fixed at design-time
Aggregations	aggregates can be defined for each value	no aggregation
Output row definition	combinations of columns can be used to define a row. The data doesn’t need to be pre-sorted	rows are defined by the value of one column. The data needs to be sorted on that column to have all rows with the same key squashed together

Pre-filtering ¶

Pre-filters can be applied. The filters documentation is available here.

Examples ¶

Pivoting country net revenue by product ¶

For the input:

Product	Country	net	Year
Toothpaste	FR	40	2015
Toothpaste	GB	80	2015
Toothpaste	US	60	2015
Toothpaste	GB	75	2017
Toothpaste	US	55	2017
Chocolate	FR	110	2015
Chocolate	FR	120	2017
Chocolate	GB	70	2017
Peanut butter	US	200	2017
Peanut butter	GB	30	2017

A pivot recipe using Product as row identifier, Country to create columns with, and with an aggregate of sum of Net will yield

Product	FR_Net_sum	GB_Net_sum	US_Net_sum
Toothpaste	40	155	115
Chocolate	230	70
Peanut butter		30	200

Adding an aggregate of sum of Net in the ‘Other columns’ section will yield

Product	FR_Net_sum	GB_Net_sum	US_Net_sum	Net_sum
Toothpaste	40	155	115	310
Chocolate	230	70		300
Peanut butter		30	200	230

Dummifying ¶

The use of the Count of records aggregate allows for an easy and controlled way of dummifying columns. On the input:

Country	Product	Year
FR	Chocolate	2017
FR	Sugar	2016
FR	Apples	2017
GB	Chocolate	2017
GB	Sugar	2015
GB	Apples	2017
GB	Toffee	2017
US	Sugar	2016
US	Corn syrup	2017
US	Toffee	2017
US	Peanut butter	2017

A pivot recipe using Country as row identifier, Product to create columns with, and with an aggregate of count of records will yield:

Country	Chocolate	Sugar	Apples	Toffee	Corn syrup	Peanut butter
FR	1	1	1	0	0	0
GB	1	1	1	1	0	0
US	0	1	0	1	1	1

By additionaly specifying that only the top 4 modalities should be used, the output becomes:

Country	Chocolate	Sugar	Apples	Toffee
FR	1	1	1	0
GB	1	1	1	1
US	0	1	0	1