Correct Spelling-Mistakes in text fields.

Parameters:
ETL include an operator that checks & corrects the spelling mistakes in any text field. For example, let’s assume that your database contains a field named “City of Birth”. This field will usually contains many different orthography (i.e. spelling error) of the same city. The CorrectSpelling action will detect and correct these errors automatically. It’s typically used to “clean” the database to get better reports, better predictive models, etc.
NOTE :
For example, the city "RIO DE JANEIRO" can be mis-spelled in a number of different ways (this is a real-world example):RIO DXE JANEIRO, RIO DE JAEIRO, RIOP DE JANEIRO, RIO NDE JANEIRO, RIO DEJANEIRO, RIO DE JANEIRO, RIO DE JANIRO, RIO DE JANEI RO, RIO DE JANEIRIO, RI0 DE JANEIRO, RIO DE JNEIRO, RIO DE JANEEIRO, RIO DE JANEIROO, RIO DE JANAEIRO, RIO DE JANEIROR, RIO DE JANEIRO RJ
This action can operate in two different modes:
Here are the parameters of this action:

Let’s go back the “city name” example: i.e. We have a table with different city names, some of them have some spelling mistakes and we want to correct these mistakes.
We’ll typically have such a pipeline:

Some modalities practically never appears inside the text corpus. These very “Light” strings (in opposition to the “heavy” strings) could easily be spelling errors. The threshold between a “Light” and a “Heavy” modality is given in parameter P3.
A given string is a spelling error (that must be corrected) if:
The parameters P1 and P2 defines if two strings must be declared “very similar”.
What’s happening with the cities that have a small population in real-life? All these cities will only have a small representation inside the database (i.e. they will never be classified as “heavy” modalities). To still be able to correct the spelling mistakes of these “small cities”, you must set the parameter P4 to the value “regroup with any other modality” (this is actually the best choice, but it’s much slower).
All the strings that are spelling errors are “regrouped” with the string that is the most likely to have the correct spelling (i.e. with the most represented string). So, we can have the following re-groupings:

Hereabove, we see that many people make the same spelling mistake: i.e. They replace the “M” letter with the “N” letter inside the word “CAMPINA”. Let’s now assume that, after the corrections/regrouping illustrated above, we find the following two new “groups”:

We see that, amongst the many people that erroneously write CANPINA (instead of the correct orthography “CAMPINA”), there are some people that don’t manage to write correctly the word “GRANDE” neither (i.e. they erroneously write “GRANED” or “GRANEDE” instead). The two groups illustrated above are quite similar because CAMPINA is written with exactly the same spelling mistake (“CANPINA”) in both groups. Thus, it would be nice if these 2 groups would new be merged together in a second iteration (because they most likely represent the same city “CAMPINA GRANDE”):

Such that we finally obtain one final group:

Note that these 5 strings could not be “regouped” together at the first iteration (for example, because “CANPINA GRANED” is “too far away” from “CAMPINA GRANDE”) …But the group containing “CANPINA GRANED” is similar enough to the group containing “CAMPINA GRANDE”, so that it’s still possible to regroup these 2 groups in a later iteration.
We can now give more details about the spelling-correction-algorithm used in ETL. The algorithm is:

This is the best option, although it’s very slow.
- P5 is checked: … the similarity between the most represented string in each group.
This means that, in the above example, the similarity between our two groups is the similarity between the following 2 strings:

The parameters P6 and P7 are used to decide how many iterations the algorithm does. Most of the time, one iteration is enough. Some time, two iterations can be useful (but the second iteration usually takes a huge amount of time).
Let’s go back the “city name” example: i.e. We now have two tables that contains city names:
We’ll typically have such a pipeline:

As input to the CorrectSpelling action, we have such a table:

All the cities from the reference table have a “heavy weight” (greater than the parameter P3) such that no attempt to “correct” them will be made.
When using a reference table, the parameters from the CorrectSpelling action are, typically:
There are 3 options for the “Similarity Measure”:
The “Damereau-Levenstein” similarity measure is derived from the “Damereau-Levenstein” distance between two strings. This distance is defined in the following way: The “Damereau-Levenstein” distance between 2 strings is the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters.
We derive the “Damereau-Levenstein” similarity from the “Damereau-Levenstein” distance:

Here are some examples:
Let’s look at the above two examples: The “Damereau-Levenstein” distance finds that “BRUGELS” is closer to “BRUSELS” than “BRUSSSEL”?? There is something very wrong about that. The example illustrates that the “Damereau-Levenstein” distance…
The Jaro-winkler similarity is an attempt to improve on the idea of the “Damereau-Levenstein” similarity. The Jaro-winkler similarity tries to stay meaningful in the case where the “Damereau-Levenstein” distance fails: i.e. when the distance between the strings might go above 1. Extensive tests demonstrates that, in practically all real-world studies, the Jaro-winkler similarity outperforms the “Damereau-Levenstein” similarity. The exact definition of the Jaro-winkler similarity is well documented on internet and will not be reproduced here.
NOTE :
The documentation about the “Jaro-winkler similarity” is available on Wikipedia on a page named “Jaro-winkler distance”. This is somewhat disturbing because it’s actually really a similarity measure (the higher the value, the closer the two strings are) and not a distance measure.
Although the Jaro-winkler similarity is better than the “Damereau-Levenstein” similarity, it’s still limited to “small” strings (for example: city names, surnames, etc.) where the number of differences between the compared strings stays relative small because of the small sizes of the strings (After all, the “Damereau-Levenstein” is just an improvement on the “Damereau-Levenstein” similarity and finally suffers from the same defaults). Thus, if you want to compute similarities between very long strings (such as “street names”, long “product names”, long “SKU names”, long “Book titles”, etc.), you’d better use the “Dice Coefficient” (the “Dice Coefficient” is also sometime named the “Pair letters similarity”).
The “Dice Coefficient” is a similarity metric that rewards both common substrings and a common ordering of those substrings. The definition is:

The intention is that by considering adjacent characters (i.e. character-pairs), we take into account not only of the common characters, but also of the character ordering in the two strings, since each character pair contains a little information about the ordering. Here is an example:

Inside the ETL implementation of the Dice Coefficient, all the letter-pairs that contains a space character are discarded before any computation. This means that the similarity between the strings "vitamin B" and "vitamin C" is 100%. In this case (when you are interested in correcting small differences), you should rather use the “Damereau-Levenstein” or “Jaro Winkler” similarity.
The output of the CorrectSpelling action is a table that contains all the corrections to apply on the data to “clean it” completely. The easiest way to use this “Correction table” is the following pipeline:

