Extract with regular expression

This processor extracts parts from a column using a regular expression The chunks to extract are delimited using regular expression captures

Unnamed captures

With simple (unnamed) captures, the matches are put in numbered columns starting with the output column prefix. Unnamed capture groups use the (pattern) syntax.

Example:

  • Cell value : id-37-X234

  • Pattern: id-([0-9]*)-([0-9A-Z]*)

  • Output column prefix:extracted_

  • Result : extracted_1=37  extracted_2=X234

Named captures

With named captures, the matches are put in columns starting with the output column prefix and the group name. Named capture groups use the (?<groupname>pattern) syntax.

Example:

  • Cell value : id-37-X234

  • Pattern: id-(?<numidentifier>[0-9]*)-(?<identifier2>[0-9A-Z]*)

  • Output column prefix:extracted_

  • Result : extracted_numidentifier=37  extracted_identifier2=X234

Found column

If you enable this option, a column named ‘prefix found’ will contain a boolean to indicate whether the pattern matched

Notes

  • Regular expressions are not anchored: ([0-9]*) will capture 232 in val-232

Smart Pattern

You can get help from Smart Pattern to write your regular expression. Fill ‘Input column’ and click on ‘Find with Smart Pattern’.

In the ‘Smart Pattern’ window you can highlight examples of substring you want to have extracted. To use a pattern in the processor, select it and click on ‘OK’.

See How-To: Extract Patterns With the Smart Pattern Builder for a detailed example.