Join: joining datasets¶
The “join” recipe is dedicated to joins between two or more datasets. DSS handles inner joins, left outer joins and full outer joins.
Building a simple join¶
Adding join is a process involving several configuration steps. In the “Join” section of the recipe (in the left pane):
use the “Add input” button to add one join.
select the 2 datasets involved in the join.
select the join type, between “Inner”, “Outer” and “Left”
fill in the join conditions. Conditions can be added with the “+” button, and removed with the “Remove” button (after selecting one)
Once the join definition is ready, go to the “Selected columns” section of the recipe and select the columns of each dataset whose values you want to get.
Finally, review the execution specs in the “Output” section, for example the generated SQL query and execution plan.
Columns in the output¶
Since datasets routinely have columns with identical names, it is possible to disambiguate column names in the “Selected columns” section, either by giving an alias for a given column (using the “pencil” button next to the given column), or by assigning a prefix to apply to all columns of the table (by clicking on the “No prefix” button).
Additional columns can be created by computations in the “Computed columns” section. There, the name for the column is explicitly specified, alongside the expression defining the column.
Engines¶
Depending on the input dataset types, DSS will adjust the engine it uses to execute the recipe, and choose between Hive, Impala, SparkSQL, plain SQL, and internal DSS. The available engines can be seen and selected by clicking on the cog below the “Run” button.
Database-specific notes¶
Vertica¶
Due to the way Vertica handles the lowercasing and string normalization operations, if you want to use the Join recipe with these options enabled, each join column must be below 8192 chars. You can set the width of string columns in the schema of the input datasets.
If you use lowercasing only, the width must be below 32K.