Extracting document content

Dataiku can natively process unstructured documents using the “Extract content” recipe. This recipe takes a managed folder of documents as input and lets you choose between two recipe types:

  • Extract full document, which extracts the content of each document into a dataset at different granularities, such as one row per page, detected section, or document.

  • Extract fields, which extracts specific structured fields from documents into a dataset.

To get started with document extraction:

Note

The “Embed documents” recipe allows to extract documents in a similar way into a vector store instead of a dataset (see Embedding and searching documents):

Recipe types

The “Extract content” recipe includes two recipe types.

Extract full document

Use Extract full document to extract the content of documents into a dataset. Depending on the extraction method and output settings, the output can contain one row per page, detected section, or document.

Extract fields

Use Extract fields to extract specific values from documents into a structured dataset. This is useful for use cases such as extracting invoice numbers, supplier names, dates, totals, or repeated line items from business documents.

Supported document types

The “Extract content” meta-recipe includes two recipe types with the following supported file types:

Recipe type

Supported file types

Extract full document

PDF, PPTX/PPT, DOCX/DOC, ODT/ODP, TXT, MD (Markdown), PNG, JPG/JPEG, HTML

Extract fields

PDF, PPTX/PPT, DOCX/DOC, ODT/ODP, PNG, JPG/JPEG

Extract full document

The Extract full document recipe supports two ways of handling documents.

Text extraction

Text extraction extracts text from documents and organizes it into meaningful units. It supports PDF, DOCX, PPTX, HTML, TXT, and MD. Image formats (PNG, JPEG, JPG) are supported if Optical Characters Recognition (OCR) is enabled.

Two engines are available for text extraction and can be configured using custom rules.

Raw text extraction

This engine focuses on the physical layout of the document. It extracts text in a single row per document while keeping the page or slide division of the document in the structured content column whenever possible, for example for PPTX and PDF files.

If OCR is enabled, PDF files are first converted into images. The engine then extracts text from those images (useful for scanned documents).

Because this engine does not try to infer the semantic structure of the document, it is very fast.

Structured text extraction

Structured text extraction runs as follows:

  • The text content is extracted from the document.

  • If headers are available, they are used to divide the content into meaningful units.

  • The extracted text is aggregated into one row per section or per document while keeping the structure of detected sections in a structured column.

Text can also be extracted from images detected in the documents:

  • with the ‘Optical Character Recognition’ (OCR) image handling mode. You can choose either EasyOCR or Tesseract as OCR engines. EasyOCR does not require any configuration but is slow when running on CPU. Tesseract requires some configuration, see OCR setup. Enabling OCR is recommended on scanned documents.

  • with the ‘VLM description’ image handling mode. A visual LLM is used to generate descriptions for each image in the document. This is available for PDF, DOCX, and PPTX files.

By default, this engine uses a lightweight classification model to identify and filter out non-informative images, such as barcodes, signatures, icons, logos, or QR codes, from text processing. While these images are skipped during extraction, all images are still saved to the output folder if Store images is enabled in Output > Storage settings. To process all images regardless of content, disable Skip non-informative images in the recipe advanced settings.

Note

Structured text extraction requires internet access for PDF document extraction. The models that need to be downloaded are layout models available from Hugging Face. The runtime environment will need to have access to the internet at least initially so that those models can be downloaded and placed in the huggingface cache.

If your instance does not have internet access then you can download those models manually. Here are the steps to follow:

14.3.0 and later
  • Go to the model repository and clone the repository (on the v2.3.0 revision)

  • Create a “ds4sd--docling-models” repository in the resources folder of the document extraction code environment (or the code env you chose for the recipe), under: /code_env_resources_folder/document_extraction_models/ds4sd--docling-models

  • Create a “ds4sd--docling-layout-egret-medium” repository in the resources folder of the document extraction code environment (or the code env you chose for the recipe), under: /code_env_resources_folder/document_extraction_models/ds4sd--docling-layout-egret-medium

  • (Optional) You can also choose to download “docling-layout-heron” model in addition to / instead of “docling-layout-egret-medium” if you need more accuracy in layout detection and sacrifice speed.

    • Create a “ds4sd--docling-layout-heron” repository in the resources folder of the document extraction code environment (or the code env you chose for the recipe), under: /code_env_resources_folder/document_extraction_models/ds4sd--docling-layout-heron

    • The folder “ds4sd--docling-layout-heron”, should contain the same files as https://huggingface.co/docling-project/docling-layout-heron/tree/main/

Prior to 14.3.0
  • Go to the model repository and clone the repository (on the v2.2.0 revision)

  • Create a “ds4sd--docling-models” repository in the resources folder of the document extraction code environment (or the code env you chose for the recipe), under: /code_env_resources_folder/document_extraction_models/ds4sd--docling-models

If the models are not in this resources folder, then the huggingface cache will be checked and if the cache is empty, the models will be downloaded and placed in the huggingface cache.

Note

You can edit the run configuration of the text extraction engine in the Administration > Settings > Other > LLM Mesh > Configuration > Document extraction recipes.

VLM extraction

For complex documents, Dataiku implements another strategy based on Vision Language Models (VLM), i.e. LLMs that can take images as input. If your LLM Mesh is connected to one of these (see Multimodal capabilities for a list), you can use the VLM strategy.

  • Instead of extracting the text, the recipe transforms each page of the document into images.

  • Ranges of images are sent to the VLM to extract the document content, including descriptions of visual elements such as graphics and tables.

  • The extracted content is then saved as dataset rows or aggregated to one row per document.

  • The images themselves can be stored in a managed folder and their paths will be referenced in the corresponding dataset rows.

The advanced image understanding capabilities of the VLM allow for more relevant results than relying only on extracted text.

The Extract full document recipe supports the VLM strategy for DOCX/DOC, PPTX/PPT, PDF, ODT/ODP, JPG/JPEG, and PNG files.

Extract fields

The Extract fields recipe extracts specific structured values from documents into a dataset.

This recipe is useful when you want to define a target schema and automatically extract fields such as identifiers, dates, amounts, names, or repeated items from a set of documents.

The Extract fields recipe relies on a Vision Language Model (VLM) available through the LLM Mesh and supporting structured output.

Recipe configuration

To configure the recipe, first select the VLM used for extraction.

Optionally, you can customize the general instructions to provide additional context to the model before defining the extraction schema.

Note

Extract fields requires a VLM connection in the LLM Mesh that supports structured output.

Extraction schema

To define the extraction schema, select a sample document in the preview and define the fields to extract.

For each field, you can configure:

  • a field name

  • a description

  • a data type

Supported data types are:

  • string

  • integer

  • number

  • date

  • boolean

  • array

Use array fields when you want to extract repeated items, such as invoice line items. Array fields can contain subfields for each repeated item.

For example, an invoice_items array can contain subfields such as:

  • description

  • quantity

  • unit price

You can add, rename, remove, or modify fields and subfields after creating the recipe.

Warning

Changing the extraction schema changes the structure of the output dataset. After modifying the schema, re-run the recipe to regenerate the output dataset.

Test extraction on a document

To evaluate the extraction results, you can run the extraction on the document currently selected in the preview.

If the extraction quality is not satisfactory, try one of the following:

  • refine the field descriptions to make them clearer and more specific

  • customize the general instructions to describe the document type, language, or extraction rules more precisely

  • use a different VLM available in the LLM Mesh

Output settings

You can choose an update method to control how the output dataset is updated. For more details, see Output update methods.

You can also configure how values are handled when they do not match the expected data type defined in the extraction schema.

If your schema contains arrays with subfields, you can choose how arrays are written to the output dataset:

  • With array expansion disabled, the output contains one column per field and a single row for each document. Array values are stored in a nested structure.

  • With array expansion enabled, the output contains one column per subfield and one row per extracted item.

Initial document extraction setup

  • Document Extraction is automatically preinstalled when using Dataiku Cloud Stacks or Dataiku Cloud. If you are using Dataiku Custom, before using the VLM extraction, you need a server administrator with elevated (sudoers) privileges to run:

sudo -i "/home/dataiku/dataiku-dss-VERSION/scripts/install/install-deps.sh" -with-libreoffice
  • Text extraction on “Embed documents” and “Extract full document” recipes on DOCX/PDF/PPTX (with both engines) requires to install and enable a dedicated code environment (see Code environments):

    In Administration > Code envs > Internal envs setup, in the Document extraction code environment section, select a Python version from the list and click Create code environment.

  • Extract fields requires access to a VLM in the LLM Mesh that supports structured output.

OCR setup

When using the OCR mode of the text extraction, you can choose between EasyOCR and Tesseract. The AUTO mode will first use Tesseract if installed, else will use EasyOCR.

Tesseract

Tesseract is preinstalled on Dataiku Cloud and Dataiku Cloud Stacks. If you are using Dataiku Custom, Tesseract needs to be installed on the system. Dataiku uses the tesserocr python package as a wrapper around the tesseract-ocr API. It requires libtesseract (>=3.04 and libleptonica (>=1.71).

The English language and the OSD files must be installed. Additional languages can be downloaded and added to the tessdata repository. Here is the list of supported languages.

For example on Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config

On AlmaLinux:

sudo dnf install tesseract
curl -L -o /usr/share/tesseract/tessdata/osd.traineddata https://github.com/tesseract-ocr/tessdata/raw/4.1.0/osd.traineddata
chmod 0644 /usr/share/tesseract/tessdata/osd.traineddata

At runtime, Tesseract relies on the TESSDATA_PREFIX environment variable to locate the tessdata folder. This folder should contain the language files and config. You can either:

  • Set the TESSDATA_PREFIX environment variable (must end with a slash /). It should point to the tessdata folder of the instance.

  • Leave it unset. During the Document Extraction internal code env resources initialization, DSS will look for possible locations of the folder, copy it to the resources folder of the code env, then set the TESSDATA_PREFIX accordingly.

Note

If run in a container execution configuration, DSS handles the installation of Tesseract during the build of the image.

EasyOCR

EasyOCR does not require any additional configuration. But it’s very slow if run on CPU. We recommend using an execution environment with GPU.

Note

By default EasyOCR will try to download missing language files. Any of the supported languages can be added in the UI of the recipe. If your instance does not have access to the internet, then all requested language models need to be directly accessible. DSS expects to find the language files in the resources folder of the code environment: /code_env_resources_folder/document_extraction_models/EasyOCR/model. You can retrieve the language files (*.pth) from here

Output update methods

There are four different methods that you can choose for updating your recipe’s output and its associated folder (used for assets storage).

You can select the update method in the recipe output step.

Method

Description

Smart sync

Synchronizes the recipe’s output to match the input folder documents, smartly deciding which documents to add/update or remove.

Upsert

Adds and updates the documents from the input folder into the recipe’s output. Smartly avoids adding duplicate documents. Does not delete any existing document that is no-longer present in the input folder.

Overwrite

Deletes the existing output, and recreates it from scratch, using the input folder documents.

Append

Adds the documents from the input folder into the recipe’s output, without deleting or updating any existing records. Can result in duplicates.

Documents are identified by their path in the input folder. Renaming or moving around documents will prevent smart modes from matching them with any pre-existing documents and can result in outdated or duplicated versions of documents in the output of the recipe.

The update method also manages the output folder of the recipe to ensure its content synchronisation with the recipe’s output. Non-managed deletions in the output folder is not recommended and can cause the recipe’s output to point to a missing source.

Tip

If your folder changes frequently, and you need to frequently re-run your recipe, choosing one of the smart update methods, Smart sync or Upsert, will be much more efficient than Overwrite or Append.

The smart update methods minimize the number of documents to be re-extracted, thus lowering the cost of running the recipe repeatedly.

Warning

When using one of the smart update methods, Smart sync or Upsert, all write operations on the recipe output must be performed through DSS. This also means that you cannot provide an output node that already contains data, when using one of the smart update methods.

Limitations

  • Output dataset partitioning is not supported.

  • The Extract fields recipe relies on a VLM to process documents. The processing limit depends on the number of document pages sent as images to the VLM. If it exceeds the model limit, a warning is raised, the document is skipped, and other documents continue to be processed.

  • Some DSS SQL connectors, notably Teradata, only support a limited number of characters in each column of data. The output of the Extract full document recipe is likely to exceed those limits, resulting in an error message similar to: Data too long for column 'extracted_content' at row 1 or can bind a LONG value only for insert into a LONG column.

    To work around this limitation, you can manually redefine the structure of the output dataset.

    Warning

    This will delete all previously extracted data.

    • Go to the recipe’s output dataset Settings > Advanced tab.

    • Change Table creation mode to Manually define

    • Modify Table creation SQL to allow for a longer content in columns section_outline, extracted_content and structured_content. For example, you can increase the number of characters limit or switch to a different type (e.g LONGTEXT or NCLOB).

    • SAVE, then click on DROP TABLE and CREATE TABLE NOW to confirm the new structure is in place.

    • Re-run the recipe to take into account the new limits.