Pivot recipe¶
The “pivot” recipe lets you build pivot tables, with more control over the rows, columns and aggregations than what the pivot processor offers. It also lets you run the pivoting natively on external systems, like SQL databases or Hive.
Defining the pivot table rows¶
The rows in the output dataset are defined by the values of a tuple of columns, the row identifiers. This tuple can be specified explicitly, or implicitely as “all other columns”, in which case any column that is not used to define modalities nor is used in an aggregate will be used as row identifier.
A 
B 
C 
D 
E 

x 
1 
a 
1 
6 
y 
1 
b 
2 
5 
x 
1 
a 
3 
4 
y 
2 
b 
1 
3 
x 
2 
b 
2 
2 
For the input columns {A, B, C, D, E}, giving {A, B} as explicit list of row identifiers will produce a pivot table where the rows are indexed by the pairs of values for A and B found in the input data.
A 
B 
… 

x 
1 
… 
y 
1 
… 
x 
2 
… 
y 
2 
… 
On the other hand, using “all other columns” while having the modalities defined by column B and aggregates on columns D and E will produce a pivot table where the rows are indexed by the tuples of values for A and C found in the input data.
A 
C 
1_D_sum 
1_E_sum 
2_D_sum 
2_E_sum 

x 
a 
4 
10 

x 
b 
2 
2 

y 
b 
2 
5 
1 
3 
Modality handling¶
The columns in the output are defined as the list aggregates times the list of modalities.
Computation of the list of modalities¶
The modalities themselves are the combinations of a nonempty list of columns. Since the list of combinations can be huge, there are several options to bring it back to something more manageable:
most frequent : keep only the N combinations appearing the most in the input data
min occurrence cont : keep only the combination appearing at least N times in the input data
explicit : specify the combinations explicitly
The effective list of modalities used to build the output is only known after the entire input dataset is scanned, so it’s not readily available at designtime, but computed when the recipe is run. By default, the list of modalities for a given set of settings is computed only once, and kept for ulterior runs of the recipe. The option to “recompute schema at each run” on the “Output” section of the recipe lets you force a recompute of the list of modalities at each run. Note that in this case, the changes in the list of output columns are not automatically propagated to downstream datasets and recipes.
Cleaning of the modalities’ name¶
Since modalities are made up of a concatenation of the values of columns from the input data, their name is usually not directly usable as column name in SQL databases or Hive. The “Output” section of the recipe therefore offers options to simplify the names so that they become compatible with these systems:
soft slugification : swaps out whitespace and punctuation with ‘_’. This is sufficient for most SQL databases (PostgreSQL, Oracle…)
hard slugification : only keep alphanumeric characters, ‘_’ and ‘‘. This is typically for Hive (i.e. when the output dataset is HDFS)
numbering : completely ignores the original name of the modality and uses numbers instead. This is the safest of all schemes, and produces the shortest names.
truncation : after the above simplifications have been applied, truncate the names. SQL databases natural limitations are natively taken into account (for example the 32 char limit on Oracle’s column names), but some limitations are not implicit in the nature of the output dataset; typically, if the output is HDFS and is to be used with Impala, a 128 char limit needs to be enforced.
Aggregates¶
The recipe offers 2 levels of aggregates :
aggregates per row and modality (i.e. per pivot table cell)
aggregates per row (i.e. marginals)
Per row and modality¶
These are defined in the “Pivot” section of the recipe. “Add new” creates a new simple aggregate on a selected column, and the aggregate can be further setup by changing its aggregation, and if relevant, the aggregation settings.
For each aggregate defined in this section, and each modality, one column will be created in the output. The column name is made of the modality name concatenated with the aggregate’s column and aggregation type.
Per row¶
The “Other columns” section of the recipe adds aggregates per row. There are 2 typical uses:
to keep columns that are neither row identifiers nor aggregates in the pivot table. In this case the aggregates “First”, “Last” or “Concat” should be preferred.
to compute marginals to compare the aggregates per row and modality to. For example, one can aggregate the average of column A for each row of X and modality of Y, and at the same time aggregate the average of column A for each row X (across modalities of Y).
Comparison to pivot processor¶
The pivot processor is a streamoriented processor that pivots one row at a time and is available in the preparation scripts, and consequently in Prepare recipes.
Pivot recipe 
Pivot processor 

Modalities 
computed by inspecting entire dataset. Not available at designtime until the recipe has run once 
computed by using the designtime sample. A small sample or very imbalanced modalities implies that some modalities can be missed 
Dynamic output schema 
the list of modalities can optionally be computed at each run of the recipe 
schema is fixed at designtime 
Aggregations 
aggregates can be defined for each value 
no aggregation 
Output row definition 
combinations of columns can be used to define a row. The data doesn’t need to be presorted 
rows are defined by the value of one column. The data needs to be sorted on that column to have all rows with the same key squashed together 
Examples¶
Pivoting country net revenue by product¶
For the input:
Product 
Country 
net 
Year 

Toothpaste 
FR 
40 
2015 
Toothpaste 
GB 
80 
2015 
Toothpaste 
US 
60 
2015 
Toothpaste 
GB 
75 
2017 
Toothpaste 
US 
55 
2017 
Chocolate 
FR 
110 
2015 
Chocolate 
FR 
120 
2017 
Chocolate 
GB 
70 
2017 
Peanut butter 
US 
200 
2017 
Peanut butter 
GB 
30 
2017 
A pivot recipe using Product as row identifier, Country to create columns with, and with an aggregate of sum of Net will yield
Product 
FR_Net_sum 
GB_Net_sum 
US_Net_sum 

Toothpaste 
40 
155 
115 
Chocolate 
230 
70 

Peanut butter 
30 
200 
Adding an aggregate of sum of Net in the ‘Other columns’ section will yield
Product 
FR_Net_sum 
GB_Net_sum 
US_Net_sum 
Net_sum 

Toothpaste 
40 
155 
115 
310 
Chocolate 
230 
70 
300 

Peanut butter 
30 
200 
230 
Dummifying¶
The use of the Count of records aggregate allows for an easy and controlled way of dummifying columns. On the input:
Country 
Product 
Year 

FR 
Chocolate 
2017 
FR 
Sugar 
2016 
FR 
Apples 
2017 
GB 
Chocolate 
2017 
GB 
Sugar 
2015 
GB 
Apples 
2017 
GB 
Toffee 
2017 
US 
Sugar 
2016 
US 
Corn syrup 
2017 
US 
Toffee 
2017 
US 
Peanut butter 
2017 
A pivot recipe using Country as row identifier, Product to create columns with, and with an aggregate of count of records will yield:
Country 
Chocolate 
Sugar 
Apples 
Toffee 
Corn syrup 
Peanut butter 

FR 
1 
1 
1 
0 
0 
0 
GB 
1 
1 
1 
1 
0 
0 
US 
0 
1 
0 
1 
1 
1 
By additionaly specifying that only the top 4 modalities should be used, the output becomes:
Country 
Chocolate 
Sugar 
Apples 
Toffee 

FR 
1 
1 
1 
0 
GB 
1 
1 
1 
1 
US 
0 
1 
0 
1 