Simplify text

Perform various simplifications on a text column.

Options

  • Normalize text: Transform to lowercase, remove punctuation and accents and perform Unicode NFD normalization (Café -> cafe).

  • Stem words: Transform each word into its “stem”, i.e. its grammatical root. For example, grammatical is transformed to grammat. This transformation is language-specific.

  • Clear stop words: Remove so-called “stop words” (the, I, a, of, …). This transformation is language-specific.

  • Sort words alphabetically: Sorts all words of the text. For example, the small dog is transformed to dog small the, allowing strings containing the same words in different order to be matched.

Note

Other processors with text operation — tokenization, n-gram extraction, fuzzy join — benefit from built-in text simplification options. You do not need to perform text simplification separately prior to using them.