Joins two tables based on text approximation.

Parameters:
Name).Name).Region).Jaro Winkler • Damerau LevenStein (distance) • similarity of Dice coefficient.Country).Region_).
Parameters:
fuzzyJoin matches records between two datasets when the keys are close but not exactly equal.
Instead of a strict = join, it computes a similarity (or distance) between the key in the main table and the key in the reference table, then returns the k nearest candidates together with their scores and any columns you choose to bring back from the reference.
It supports three proven string-comparison methods:
Why you need it. In real datasets, people type “Jonson” for “Johnson”, “Acme Inc.” for “ACME”, or swap first/last names. Regular joins drop those rows.
fuzzyJoinlets you recover them, score them, and decide how strict you want to be.
Let’s imagine we have three tables in which a key has not been defined, and we want to use names to join them:
Humans sometimes input data incorrectly:

Jaro–Winkler and Damerau–Levenshtein will recover many of these, but depending on the particular typo or word order, some matches will still be missed. In practice, Dice coefficient tends to work well across all such cases because it compares common sets of word tokens rather than character-by-character order alone.
For upstream cleanup (lower/trim/remove accents), Correct Spelling
Name) and the key column in the reference table (also Name).Region) and an optional prefix (e.g., Region_).ClosestKey_1…k and Similarity_1…k (or Distance_1…k) plus any reference columns for each candidate.For each row in Pin 0, you get:
Original columns from the main table.
For each n = 1…k:
_n suffix, e.g.:Region_Region_1, Region_Region_2, … (if idPrefix = Region_ and you selected Region).similarity of Dice coefficient, k = 3.Jaro Winkler, k = 3.distance of Damerau LevenStein, k = 1–3.Country, Source, Brand) to compare only within the same bucket.k = 1 and follow with a Filter (Similarity_1 ≥ 0.85 or Distance_1 ≤ 1) to accept only strong matches.Pipeline layout

Read the main table (Pin 0)
readCSV → file with columns Name, Company.
Read the reference table (Pin 1)
readCSV → file with columns Name, Region.
Select columns in fuzzyJoin
NameNameRegion1similarity of Dice coefficientRegion_
Preview results (Dice)
ClosestKey_1, Similarity_1, and Region_Region_1.
Compare metrics


k. Start with k = 3 to inspect alternates; move to k = 1 in production.k for very large tables.“The column ‘X’ is missing on input pin 1 / initialization of ‘idRefId’ failed.”
The chosen column to join from the reference does not exist (often a mis-selected pin). Open the selector and pick a column from Pin 1.
Low similarities across the board.
Normalize text, try Dice, and lower the threshold temporarily to inspect what the model is comparing.
Too many look-alike matches.
Reduce k, raise your acceptance threshold, and/or add partitions to limit comparisons to the right cohort.
