Fuzzy join: joining two datasets¶
The “fuzzy join” recipe is dedicated to joins between two datasets when join keys don’t match exactly.
It works by calculating a distance chosen by user and then comparing it to a threshold. DSS handles inner, left, right or outer joins.
Fuzzy join prepare recipe processor¶
DSS has another way to fuzzy join datasets, by using a dedicated prepare recipe processor. However a separate fuzzy join recipe is preferred as it has less technical limitations and works with larger datasets.
Building a simple join¶
When the recipe is first created it will try to automatically find matching columns based on their name and type. One to five initial conditions will be provided, but this list can be changed by user.
Adding join is a process involving several configuration steps. In the “Join” section of the recipe (in the left pane):
Click on an existing join conditions list or on a message “No join condition” to add a new condition.
Select the join type, between “Inner”, “Outer”, “Left” or “Right”
Fill in the join conditions. Conditions can be added with the “+” button, and removed with the “Remove” button (after selecting one)
Once the join definition is ready, go to the “Selected columns” section of the recipe and select the columns of each dataset whose values you want to get.
Finally, review the execution specs in the “Output” section
Join conditions¶
Each join condition describes a matching rule for two columns. Depending on column types different options will be available.
Note
If all of the join conditions are set to strict equality then a fuzzy join recipe will be equivalent to a regular join recipe. In this case a regular join is preferred as it’s more performant.
Available distances¶
Text columns¶
Damerau–Levenshtein - an edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.
Hamming - a distance between two strings of equal length is the number of positions at which the corresponding symbols are different.
Jaccard - a distance, which measures dissimilarity between sample sets of characters from joined strings. Calculated as a size of a set containing common characters divided by a size of a set containing all characters from both strings.
Cosine - a distance is measured by converting strings into vectors by counting characters appearing in both strings and then calculating a dot product of two vectors.
Also text values can be normalized before joining, a list of possible operations includes:
Name |
Description |
Example before |
Example after |
Case insensitive |
Ignores case when matching characters |
Hello, the Mister Lefèvre |
hello, the mister lefèvre |
Remove punctuation and extra spaces |
Removes punctuation and extra spaces |
Hello, the Mister Lefèvre |
Hello the Mister Lefèvre |
Clear salutations |
Removes English salutations, e.g. Miss, Sir, Dr |
Hello, the Mister Lefèvre |
Hello, the Lefèvre |
Clear stop words |
Removes common stop words depending on the language |
Hello, the Mister Lefèvre |
Hello Mister Lefèvre |
Transform to stem |
Transforms words to base form (Snowball stemmer) |
Monkeys eat bananas |
Monkey eat banana |
Alphabetic sorting of words |
Alphabetic sorting of words |
Hello, the Mister Lefèvre |
Hello Lefèvre Mister the |
Numeric columns¶
Euclidean distance
Geopoint columns¶
Geospatial distance
In case of other types or when column types don’t match the only join condition available is a strict equality.
For string and numeric columns it’s also possible to set a relative threshold. In this case a threshold will be in percents and the calculated distance will be divided by the length of a corresponding join key (or its value in case of numbers)
For example if there are two join keys “propre” and “propeller”, the distance is set to Damerau–Levenshtein and a threshold is relative and set to 50%.
An absolute Damerau–Levenshtein distance between these words is 4
If the distance is calculated relatively to the first dataset, then a relative distance is 4/6 = 66%, (6 is a length of “propre”) so with a 50% it’s not a match
If the distance is calculated relatively to the second dataset, then a relative distance is 4/9 = 44%, ( is a length of “propeller”) and it’s a match
Additional settings¶
There are two additional options of the recipe
Output matching details¶
Adds an additional “meta” column that contains a JSON object with details about joined keys that includes:
distance type
threshold
calculated distance
a result showing if two values matched
a pair of join values
Debug mode¶
Activates a cross join and also enabled meta column generation. Useful when trying to understand why certain rows didn’t match.
Warning
Since debug mode forces a cross join the recipe can be slow and can generate very large output. Consider filtering the inputs to only the rows that you’re interested in while debugging.
Columns in the output¶
Since datasets routinely have columns with identical names, it is possible to disambiguate column names in the “Selected columns” section, either by giving an alias for a given column (using the “pencil” button next to the given column), or by assigning a prefix to apply to all columns of the table (by clicking on the “No prefix” button).