Reshaping

Reshaping processors are used to change the « shape » (rows/columns) of the data.

DSS provides the following reshaping processors

Split and Fold

This processor creates new lines by splitting the values of a column. For example, with the following dataset:

customer_id events browser
1 login,product,buy Mozilla
2 login,product,logout Chrome

Applying “Split and Fold” on the “events” column with ”,” as the separator will generate the following result:

customer_id events browser
1 login Mozilla
1 product Mozilla
1 buy Mozilla
2 login Chrome
2 product Chrome
2 logout Chrome

All columns except the folded column are copied in each new line.

Fold multiple columns

This processor takes values from multiple columns and transforms them to one line per column

For example, with the following dataset representing monthly scores:

person age 01/2014 02/2014 03/2014
John 24 3 4 3
Sidney 31   6 9
Bill 33 1   4

We would like to get one line per (month, person) couple with the score.

Applying the “Split multiple columns” processor with:

  • 3 columns in the “columns list”: 01/2014, 02/2014, 03/2014
  • “month” as the “fold name column”
  • “score” as the “fold value column”

will generate the following result:

person age month score
John 24 01/2014 3
John 24 02/2014 4
John 24 03/2014 6
Sidney 31 01/2014  
Sidney 31 02/2014 6
Sidney 31 03/2014 9
Bill 33 01/2014 1
Bill 33 02/2014  
Bill 33 03/2014 4
  • The names of the folded columns are used as values of the “fold name column”
  • The values of the folded columns are used as values of the “fold values column”
  • The folded columns are removed
  • All other columns are copied
  • Empty values are preserved in the folded result

Fold multiple columns by pattern

This processor is a variant of Fold multiple columns, where the columns to fold are specified by a pattern instead of a list. The processor only creates lines for non-empty columns.

For example, with the following dataset representing quarterly scores:

person age Q1_score Q2_score Q3_score
John 24 3 4 6
Sidney 31   6 9
Bill 33 1   4

Applying the “Fold multiple columns by pattern” processor with a pattern ”.*_score” will generate the following result:

person age quarter score
John 24 Q1_score 3
John 24 Q2_score 4
John 24 Q3_score 6
Sidney 31 Q2_score 6
Sidney 31 Q3_score 9
Bill 33 Q1_score 1
Bill 33 Q3_score 4

When the pattern contains a capture group, the captured portion of the folded column’s name is used. On the same dataset, using the pattern “(.*)_score” would produce:

person age quarter score
John 24 Q1 3
John 24 Q2 4
John 24 Q3 6
Sidney 31 Q2 6
Sidney 31 Q3 9
Bill 33 Q1 1
Bill 33 Q3 4

Unfold

This processor transforms cell values into binary columns.

For example, with the following dataset:

id type
0 A
1 A
2 C
3 B

Applying the “Unfold” processor on the “type” column will generate the following result:

id type_A type_C type_B
0 1    
1 1    
2   1  
3     1

Each value of the unfolded column will create a new column. This new column:

  • contain the value “1” if the original column contained this value
  • remains empty else.

Unfolding is often used to find some correlations to a particular value, or for creating graphs.

Warning

Limitations

The Unfold processor dynamically creates new columns based on the actual data within the cells.

Due to the way the schema is handled when you create a Shaker recipe, only the values that were found at least once in the sample will create columns in the output dataset.

Unfolding a column with a large number of values will create a large number of columns. This can cause performance issues. It is highly recommended not to unfold columns with more than 100 values.

Split and Unfold

This processor splits multiple values in a cell and transforms them into columns.

For example, with the following dataset:

customer_id events
1 login, product, buy
2 login, product, logout

We get:

customer_id events_login events_product events_buy events_logout
1 1 1 1  
2 1 1   1

The unfolded column is deleted.

Warning

Limitations

The limitations that apply to the “Unfold” processor also apply to the “Split and Unfold” processor.

Triggered Unfold

This processor is used to reassemble several rows when a specific value is encountered. It is useful for analysis of “interaction sessions” (a series of events with a specific event marking the beginning of a new interaction session). For example, while analyzing the logs of a web game, the “start game” event would be the beginning event.

Warning

Limitations

Triggered unfold offers a a basic session analysis that is very simple to use, but it comes with many limitations.

Triggered unfold assumes that the input data is sorted by time. It only works on “unsplitted” datasets (for example, a single file or a SQL table)

Non-terminated sessions are kept in memory. It is recommended that you do not use Triggered Unfold if you have more than a few thousands sessions

For more advanced sessions analysis, if you have splitted data or a large number of sessions, you should use specific recipes (for example, using SQL or Pig)

For example, let’s imagine this dataset:

user_id event_type timestamp
user_id1 login_event t1
user_id2 login_event t2
user_id1 event_type2 t3
user_id2 event_type2 t4
user_id1 login_Event t5
user_id2 event_type3 t6
user_id2 login_event t7

We know that “login_event” marks the beginning of a new session / new interaction, and we want to track the timestamps of other event types in each session.

We apply a “Triggered unfold” with the following parameters:

  • Column acting as event key: user_id
  • Fold column: event_type
  • Trigger value: login_event
  • Column with data: timestamp

We generate the following result:

user_id login_event event_type2 event_type3 login_event_prev
user_id1 t1 t3    
user_id2 t2 t4 t6  
user_id1 t5     t1
user_id2 t7     t2

We get:

  • One column for each type of event
  • One line for each occurence of “login_event” in the stream
  • The user_id associated to each login_event is kept, and the timestamps of events are restored
  • The “_prev” column tracks the data associated to the previous occurence of “login_event” for each user_id.