Pivot recipe

The “pivot” recipe lets you build pivot tables, with more control over the rows, columns and aggregations than what the pivot processor offers. It also lets you run the pivoting natively on external systems, like SQL databases or Hive.

Defining the pivot table rows

The rows in the output dataset are defined by the values of a tuple of columns, the row identifiers. This tuple can be specified explicitly, or implicitely as “all other columns”, in which case any column that is not used to define modalities nor is used in an aggregate will be used as row identifier.

A

B

C

D

E

x

1

a

1

6

y

1

b

2

5

x

1

a

3

4

y

2

b

1

3

x

2

b

2

2

For the input columns {A, B, C, D, E}, giving {A, B} as explicit list of row identifiers will produce a pivot table where the rows are indexed by the pairs of values for A and B found in the input data.

A

B

x

1

y

1

x

2

y

2

On the other hand, using “all other columns” while having the modalities defined by column B and aggregates on columns D and E will produce a pivot table where the rows are indexed by the tuples of values for A and C found in the input data.

A

C

1_D_sum

1_E_sum

2_D_sum

2_E_sum

x

a

4

10

x

b

2

2

y

b

2

5

1

3

Modality handling

The columns in the output are defined as the list aggregates times the list of modalities.

Computation of the list of modalities

The modalities themselves are the combinations of a non-empty list of columns. Since the list of combinations can be huge, there are several options to bring it back to something more manageable:

  • most frequent : keep only the N combinations appearing the most in the input data

  • min occurrence cont : keep only the combination appearing at least N times in the input data

  • explicit : specify the combinations explicitly

The effective list of modalities used to build the output is only known after the entire input dataset is scanned, so it’s not readily available at design-time, but computed when the recipe is run. By default, the list of modalities for a given set of settings is computed only once, and kept for ulterior runs of the recipe. The option to “recompute schema at each run” on the “Output” section of the recipe lets you force a recompute of the list of modalities at each run. Note that in this case, the changes in the list of output columns are not automatically propagated to downstream datasets and recipes.

Cleaning of the modalities’ name

Since modalities are made up of a concatenation of the values of columns from the input data, their name is usually not directly usable as column name in SQL databases or Hive. The “Output” section of the recipe therefore offers options to simplify the names so that they become compatible with these systems:

  • soft slugification : swaps out whitespace and punctuation with ‘_’. This is sufficient for most SQL databases (PostgreSQL, Oracle…)

  • hard slugification : only keep alphanumeric characters, ‘_’ and ‘-’. This is typically for Hive (i.e. when the output dataset is HDFS)

  • numbering : completely ignores the original name of the modality and uses numbers instead. This is the safest of all schemes, and produces the shortest names.

  • truncation : after the above simplifications have been applied, truncate the names. SQL databases natural limitations are natively taken into account (for example the 32 char limit on Oracle’s column names), but some limitations are not implicit in the nature of the output dataset; typically, if the output is HDFS and is to be used with Impala, a 128 char limit needs to be enforced.

Aggregates

The recipe offers 2 levels of aggregates :

  • aggregates per row and modality (i.e. per pivot table cell)

  • aggregates per row (i.e. marginals)

Per row and modality

These are defined in the “Pivot” section of the recipe. “Add new” creates a new simple aggregate on a selected column, and the aggregate can be further setup by changing its aggregation, and if relevant, the aggregation settings.

For each aggregate defined in this section, and each modality, one column will be created in the output. The column name is made of the modality name concatenated with the aggregate’s column and aggregation type.

Per row

The “Other columns” section of the recipe adds aggregates per row. There are 2 typical uses:

  • to keep columns that are neither row identifiers nor aggregates in the pivot table. In this case the aggregates “First”, “Last” or “Concat” should be preferred.

  • to compute marginals to compare the aggregates per row and modality to. For example, one can aggregate the average of column A for each row of X and modality of Y, and at the same time aggregate the average of column A for each row X (across modalities of Y).

Comparison to pivot processor

The pivot processor is a stream-oriented processor that pivots one row at a time and is available in the preparation scripts, and consequently in Prepare recipes.

Pivot recipe

Pivot processor

Modalities

computed by inspecting entire dataset. Not available at design-time until the recipe has run once

computed by using the design-time sample. A small sample or very imbalanced modalities implies that some modalities can be missed

Dynamic output schema

the list of modalities can optionally be computed at each run of the recipe

schema is fixed at design-time

Aggregations

aggregates can be defined for each value

no aggregation

Output row definition

combinations of columns can be used to define a row. The data doesn’t need to be pre-sorted

rows are defined by the value of one column. The data needs to be sorted on that column to have all rows with the same key squashed together

Pre-filtering

Pre-filters can be applied. The filters documentation is available here.

Examples

Pivoting country net revenue by product

For the input:

Product

Country

net

Year

Toothpaste

FR

40

2015

Toothpaste

GB

80

2015

Toothpaste

US

60

2015

Toothpaste

GB

75

2017

Toothpaste

US

55

2017

Chocolate

FR

110

2015

Chocolate

FR

120

2017

Chocolate

GB

70

2017

Peanut butter

US

200

2017

Peanut butter

GB

30

2017

A pivot recipe using Product as row identifier, Country to create columns with, and with an aggregate of sum of Net will yield

Product

FR_Net_sum

GB_Net_sum

US_Net_sum

Toothpaste

40

155

115

Chocolate

230

70

Peanut butter

30

200

Adding an aggregate of sum of Net in the ‘Other columns’ section will yield

Product

FR_Net_sum

GB_Net_sum

US_Net_sum

Net_sum

Toothpaste

40

155

115

310

Chocolate

230

70

300

Peanut butter

30

200

230

Dummifying

The use of the Count of records aggregate allows for an easy and controlled way of dummifying columns. On the input:

Country

Product

Year

FR

Chocolate

2017

FR

Sugar

2016

FR

Apples

2017

GB

Chocolate

2017

GB

Sugar

2015

GB

Apples

2017

GB

Toffee

2017

US

Sugar

2016

US

Corn syrup

2017

US

Toffee

2017

US

Peanut butter

2017

A pivot recipe using Country as row identifier, Product to create columns with, and with an aggregate of count of records will yield:

Country

Chocolate

Sugar

Apples

Toffee

Corn syrup

Peanut butter

FR

1

1

1

0

0

0

GB

1

1

1

1

0

0

US

0

1

0

1

1

1

By additionaly specifying that only the top 4 modalities should be used, the output becomes:

Country

Chocolate

Sugar

Apples

Toffee

FR

1

1

1

0

GB

1

1

1

1

US

0

1

0

1