This processor performs various simplifications on a text column. It removes all non alphanumerical characters. Also, it offers several processing options.
Example use case¶
You want to perform statistics on a search engine query log, but there is too much variance among the user queries. Simplifying the query first allows you to have more relevant statistics by merging queries that should be considered as the same.
- Normalize text: transforms to lowercase, removes accents and performs Unicode normalization (Café -> cafe)
- Clear stop words: remove so-called ‘stop words’ (the, I, a, of, …). This transformation is language-specific and requires you to enter the language of your column.
- Stem words: transforms each word into its ‘stem’, ie its grammatical root. For example, ‘grammatical’ is transformed to ‘grammat’. This transformation is language-specific and requires you to enter the language of your column.
- Sort words alphabetically: sorts all words of the text. For example, ‘the small dog’ is transformed to ‘dog small the’. This allows you to match together strings that are written with the same words in a different order.
Note about simplification in text processors¶
Many text operations like tokenization, n-gram extraction or fuzzy join benefit from text simplification.
Therefore the simplification options are already available in all of these processors. You do not need to perform simplification prior to using them.