Perform various simplifications on a text column.
Normalize text: Transform to lowercase, remove punctuation and accents and perform Unicode NFD normalization (
Stem words: Transform each word into its “stem”, i.e. its grammatical root. For example,
grammaticalis transformed to
grammat. This transformation is language-specific.
Clear stop words: Remove so-called “stop words” (
of, …). This transformation is language-specific.
Sort words alphabetically: Sorts all words of the text. For example,
the small dogis transformed to
dog small the, allowing strings containing the same words in different order to be matched.
Other processors with text operation — tokenization, n-gram extraction, fuzzy join — benefit from built-in text simplification options. You do not need to perform text simplification separately prior to using them.