Join with other dataset (memory-based)¶
This processor performs a left join with another (small) dataset. # Example use case
You are processing a dataset of events. The events contain a reference to a product id. You have another dataset which contains details about the products, and you want to retrieve the product details for each event. # Requirements and limitations
The ‘other’ dataset must fit in RAM. A good limit would be that it should not be more than ~500 000 rows. If this is not the case, you should use a recipe to join the datasets (for example, a Pig, Hive, Python or SQL recipe).
Both the dataset being processed and the ‘other’ dataset must contain a column containing the join key.
The processor performs a deduplicated left join:
- If no rows in the ‘other’ dataset match, joined columns are left empty
- If multiple rows match in the ‘other’ dataset, the ‘last’ one is selected (but ordering is not guaranteed)
The processor needs the following parameters:
- Column containing the join key in the current dataset (which may have been generated by a previous step)
- Name of the dataset to join with. Note that the dataset to join with must be in the same project.
- Column containing the join key in the joined dataset.
- Columns from the joined dataset that should be copied to the local dataset, for the matched row.
The processor outputs selected columns from the joined dataset. For each row of the current dataset, the columns will contain the data from the matching row in the joined dataset.
If no row matched in the joined dataset, the output columns will be left empty.