Upsert: Consolidate data¶
Upsert is a term coined by blending “update” and “insert”, and the operation is often handled in SQL databases by a MERGE INTO statement. The goal of the operation is to take incoming rows and perform two different operations at the same time:
If the row already exists in the output, update the values of the row in the output with the values of the incoming row (UPDATE mode).
If the row does not yet exist in the output, add the incoming row to the output (INSERT mode).
Since no row deletion occurs in the output, the net effect of an upsert is that incoming rows are consolidated into a single output dataset.
This capability is provided by the Upsert recipe plugin, which you need to install. Please see Installing plugins.
Upsert keys¶
The upsert operation needs to map each row in the input and output datasets to a unique entity. This is done by specifying a subset of columns to act as a unique key.
Example¶
Let’s say you regularly receive customer data files with one row per customer, but only for customers who were added or whose information was modified since the last run. Each customer is identified by a customer_id key. Starting from an empty output dataset, an upsert recipe behaves as follows:
First run (add customers 1 and 2), with input:
customer_id |
name |
rating |
|---|---|---|
1 |
Bob |
3 |
2 |
Alice |
1 |
The output after this run is:
customer_id |
name |
rating |
|---|---|---|
1 |
Bob |
3 |
2 |
Alice |
1 |
Second run (modify customer 2 and add customer 3), with input:
customer_id |
name |
rating |
|---|---|---|
2 |
Alicia |
1 |
3 |
Daphne |
2 |
The output after this run is:
customer_id |
name |
rating |
|---|---|---|
1 |
Bob |
3 |
2 |
Alicia |
1 |
3 |
Daphne |
2 |
Third run (modify customer 1), with input:
customer_id |
name |
rating |
|---|---|---|
1 |
Bob |
999 |
The output after this run is:
customer_id |
name |
rating |
|---|---|---|
1 |
Bob |
999 |
2 |
Alicia |
1 |
3 |
Daphne |
2 |
Engines¶
Depending on the input dataset types, DSS adjusts the engine it uses to execute the recipe and chooses between Hive, Impala, SparkSQL, plain SQL, and internal DSS. The available engines can be viewed and selected by clicking the cog below the “Run” button.
When the engine is SQL, DSS can offer several modes of operation depending on the type of underlying database:
direct upsert statement: many SQL databases can handle upsert recipes natively, usually with aMERGE INTOstatement, sometimes with special handling ofINSERT INTOrejections due to a unique constraint.update then insert: issue two SQL statements to the database, one to update rows already present in the output dataset, then one to add new rows. This mode does not use a temporary table.prepare upserted then replace output: prepare a temporary table with existing rows and new rows, then clear the output dataset and replace it with this temporary table. This is the most generic mode, but also the slowest, and it requires the ability to create a temporary table.
Notes¶
At least one upsert key is required to identify rows.
DSS does not enforce or control the uniqueness of rows in the input or output dataset. If several rows have the same combination of values for the upsert keys, the behavior is undefined.