Extracting document content¶

Dataiku can natively process unstructured documents using the “Extract content” recipe. This recipe takes a managed folder of documents as input and lets you choose between two kinds of recipe:

Extract full document, which extracts the content of each document into a dataset at different granularities, such as one row per page, detected section, or document.
Extract fields, which extracts specific structured fields from documents into a structured dataset. For instance, extract invoices into date, invoice number, and, for each item on the invoice, the name, quantity and price.

To get started with document extraction:

see our How to: Extract fulltext content for the Extract full document recipe
see our How to: Extract fields from documents for the Extract fields recipe

Note

The “Embed documents” recipe allows to extract documents in a similar way into a vector store instead of a dataset (see Embedding and searching documents):

Supported document types¶

There are 2 kinds of “Extract content”, with the following supported file types:

Extract full document, PDF, PPTX/PPT, DOCX/DOC, ODT/ODP, TXT, MD (Markdown), PNG, JPG/JPEG, HTML
Extract fields, PDF, PPTX/PPT, DOCX/DOC, ODT/ODP, PNG, JPG/JPEG

Extract full document¶

The Extract full document recipe supports two ways of handling documents.

Text extraction¶

Text extraction extracts text from documents and organizes it into meaningful units. It supports PDF, DOCX, PPTX, HTML, TXT, and MD. Image formats (PNG, JPEG, JPG) are supported if Optical Characters Recognition (OCR) is enabled.

Two engines are available for text extraction and can be configured using custom rules.

Raw text extraction¶

This engine focuses on the physical layout of the document. It extracts text in a single row per document while keeping the page or slide division of the document in the structured content column whenever possible, for example for PPTX and PDF files.

If OCR is enabled, PDF files are first converted into images. The engine then extracts text from those images (useful for scanned documents).

Because this engine does not try to infer the semantic structure of the document, it is very fast.

Structured text extraction¶

Structured text extraction runs as follows:

The text content is extracted from the document.
If headers are available, they are used to divide the content into meaningful units.
The extracted text is aggregated into one row per section or per document while keeping the structure of detected sections in a structured column.

Text can also be extracted from images detected in the documents:

with the ‘Optical Character Recognition’ (OCR) image handling mode. You can choose either EasyOCR or Tesseract as OCR engines. EasyOCR does not require any configuration but is slow when running on CPU. Tesseract requires some configuration, see OCR setup. Enabling OCR is recommended on scanned documents.
with the ‘VLM description’ image handling mode. A visual LLM is used to generate descriptions for each image in the document. This is available for PDF, DOCX, and PPTX files.

By default, this engine uses a lightweight classification model to identify and filter out non-informative images, such as barcodes, signatures, icons, logos, or QR codes, from text processing. While these images are skipped during extraction, all images are still saved to the output folder if Store images is enabled in Output > Storage settings. To process all images regardless of content, disable Skip non-informative images in the recipe advanced settings.

Note

Structured text extraction requires internet access for PDF document extraction. The models that need to be downloaded are layout models available from Hugging Face. The runtime environment will need to have access to the internet at least initially so that those models can be downloaded and placed in the huggingface cache.

If your instance does not have internet access then you can download those models manually. Here are the steps to follow:

14.3.0 and later

Go to the model repository (v2.3.0) and clone the repository (on the v2.3.0 revision)
Create a “ds4sd--docling-models” repository in the resources folder of the document extraction code environment (or the code env you chose for the recipe), under: /code_env_resources_folder/document_extraction_models/ds4sd--docling-models
- The folder “ds4sd--docling-models”, should contain the same files as https://huggingface.co/ds4sd/docling-models/tree/v2.3.0/
Create a “ds4sd--docling-layout-egret-medium” repository in the resources folder of the document extraction code environment (or the code env you chose for the recipe), under: /code_env_resources_folder/document_extraction_models/ds4sd--docling-layout-egret-medium
- The folder “ds4sd--docling-layout-egret-medium”, should contain the same files as https://huggingface.co/docling-project/docling-layout-egret-medium/tree/main/
(Optional) You can also choose to download “docling-layout-heron” model in addition to / instead of “docling-layout-egret-medium” if you need more accuracy in layout detection and sacrifice speed.
- Create a “ds4sd--docling-layout-heron” repository in the resources folder of the document extraction code environment (or the code env you chose for the recipe), under: /code_env_resources_folder/document_extraction_models/ds4sd--docling-layout-heron
- The folder “ds4sd--docling-layout-heron”, should contain the same files as https://huggingface.co/docling-project/docling-layout-heron/tree/main/

Prior to 14.3.0

Go to the model repository (v2.2.0) and clone the repository (on the v2.2.0 revision)
Create a “ds4sd--docling-models” repository in the resources folder of the document extraction code environment (or the code env you chose for the recipe), under: /code_env_resources_folder/document_extraction_models/ds4sd--docling-models
- The folder “ds4sd--docling-models”, should contain the same tree structure as https://huggingface.co/ds4sd/docling-models/tree/v2.2.0/

If the models are not in this resources folder, then the huggingface cache will be checked and if the cache is empty, the models will be downloaded and placed in the huggingface cache.

Note

You can edit the run configuration of the text extraction engine in the Administration > Settings > Other > LLM Mesh > Configuration > Document extraction recipes.

VLM extraction¶

For complex documents, Dataiku implements another strategy based on Vision Language Models (VLM), i.e. LLMs that can take images as input. If your LLM Mesh is connected to one of these (see Multimodal capabilities for a list), you can use the VLM strategy.

Instead of extracting the text, the recipe transforms each page of the document into images.
Ranges of images are sent to the VLM to extract the document content, including descriptions of visual elements such as graphics and tables.
The extracted content is then saved as dataset rows or aggregated to one row per document.
The images themselves can be stored in a managed folder and their paths will be referenced in the corresponding dataset rows.

The advanced image understanding capabilities of the VLM allow for more relevant results than relying only on extracted text.

The Extract full document recipe supports the VLM strategy for DOCX/DOC, PPTX/PPT, PDF, ODT/ODP, JPG/JPEG, and PNG files.

Extract fields¶

The Extract fields recipe extracts specific structured values from documents into a dataset.

This recipe is useful when you want to define a target schema and automatically extract fields such as identifiers, dates, amounts, names, or repeated items from a set of documents.

The Extract fields recipe relies on a Vision Language Model (VLM) available through the LLM Mesh and supporting structured output.

Recipe configuration¶

To configure the recipe, first select the VLM used for extraction.

Optionally, you can customize the general instructions to provide additional context to the model before defining the extraction schema.

Note

Extract fields requires a VLM connection in the LLM Mesh that supports structured output.

Extraction schema¶

To define the extraction schema, select a sample document in the preview and define the fields to extract.

For each field, you can configure:

a field name
a description
a data type

Supported data types are:

string
integer
number
date
boolean
array

Use array fields when you want to extract repeated items, such as invoice line items. Array fields can contain subfields for each repeated item.

For example, an invoice_items array can contain subfields such as:

description
quantity
unit price

You can change the fields after creating the recipe.

Warning

Changing the extraction schema changes the structure of the output dataset. After modifying the schema, re-run the recipe to regenerate the output dataset.

Test extraction on a document¶

To evaluate the extraction results, you can run the extraction on the document currently selected in the preview.

If the extraction quality is not satisfactory, try one of the following:

refine the field descriptions to make them clearer and more specific
customize the general instructions to describe the document type, language, or extraction rules more precisely
use a different VLM available in the LLM Mesh

Output settings¶

Choose an update method to control how the output dataset is updated. For more details, see Output update methods.

You can also configure how values are handled when they do not match the expected data type defined in the extraction schema.

If your schema contains arrays with subfields, array expansion controls how arrays are written to the output dataset:

When disabled, the output contains one column per field and a single row for each document. Array values are stored in a nested JSON structure.
When enabled, the output contains one column per array subfield and one row per extracted item. Fields that do not belong to the array are repeated across those rows.

Initial document extraction setup¶

Document Extraction is automatically preinstalled when using Dataiku Cloud Stacks or Dataiku Cloud. If you are using Dataiku Custom, before using the VLM extraction, you need a server administrator with elevated (sudoers) privileges to run:

sudo -i "/home/dataiku/dataiku-dss-VERSION/scripts/install/install-deps.sh" -with-libreoffice

Text extraction on “Embed documents” and “Extract full document” recipes on DOCX/PDF/PPTX (with both engines) requires to install and enable a dedicated code environment (see Code environments):

In Administration > Code envs > Internal envs setup, in the Document extraction code environment section, select a Python version from the list and click Create code environment.

Extract fields requires access to a VLM in the LLM Mesh that supports structured output.

OCR setup¶

When using the OCR mode of the text extraction, you can choose between EasyOCR and Tesseract. The AUTO mode will first use Tesseract if installed, else will use EasyOCR.

Tesseract¶

Tesseract is preinstalled on Dataiku Cloud and Dataiku Cloud Stacks. If you are using Dataiku Custom, Tesseract needs to be installed on the system. Dataiku uses the tesserocr python package as a wrapper around the tesseract-ocr API. It requires libtesseract (>=3.04 and libleptonica (>=1.71).

The English language and the OSD files must be installed. Additional languages can be downloaded and added to the tessdata repository. Here is the list of supported languages.

For example on Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config

On AlmaLinux:

sudo dnf install tesseract
curl -L -o /usr/share/tesseract/tessdata/osd.traineddata https://github.com/tesseract-ocr/tessdata/raw/4.1.0/osd.traineddata
chmod 0644 /usr/share/tesseract/tessdata/osd.traineddata

At runtime, Tesseract relies on the TESSDATA_PREFIX environment variable to locate the tessdata folder. This folder should contain the language files and config. You can either:

Set the TESSDATA_PREFIX environment variable (must end with a slash /). It should point to the tessdata folder of the instance.
Leave it unset. During the Document Extraction internal code env resources initialization, DSS will look for possible locations of the folder, copy it to the resources folder of the code env, then set the TESSDATA_PREFIX accordingly.

Note

If run in a container execution configuration, DSS handles the installation of Tesseract during the build of the image.

EasyOCR¶

EasyOCR does not require any additional configuration. But it’s very slow if run on CPU. We recommend using an execution environment with GPU.

Note

By default EasyOCR will try to download missing language files. Any of the supported languages can be added in the UI of the recipe. If your instance does not have access to the internet, then all requested language models need to be directly accessible. DSS expects to find the language files in the resources folder of the code environment: /code_env_resources_folder/document_extraction_models/EasyOCR/model. You can retrieve the language files (*.pth) from here

Output update methods¶

There are four different methods that you can choose for updating your recipe’s output and its associated folder (used for assets storage).

You can select the update method in the recipe output step.

Method	Description
Smart sync	Synchronizes the recipe’s output to match the input folder documents, smartly deciding which documents to add/update or remove.
Upsert	Adds and updates the documents from the input folder into the recipe’s output. Smartly avoids adding duplicate documents. Does not delete any existing document that is no-longer present in the input folder.
Overwrite	Deletes the existing output, and recreates it from scratch, using the input folder documents.
Append	Adds the documents from the input folder into the recipe’s output, without deleting or updating any existing records. Can result in duplicates.

Documents are identified by their path in the input folder. Renaming or moving around documents will prevent smart modes from matching them with any pre-existing documents and can result in outdated or duplicated versions of documents in the output of the recipe.

The update method also manages the output folder of the recipe to ensure its content synchronisation with the recipe’s output. Non-managed deletions in the output folder is not recommended and can cause the recipe’s output to point to a missing source.

Tip

If your folder changes frequently, and you need to frequently re-run your recipe, choosing one of the smart update methods, Smart sync or Upsert, will be much more efficient than Overwrite or Append.

The smart update methods minimize the number of documents to be re-extracted, thus lowering the cost of running the recipe repeatedly.

Warning

When using one of the smart update methods, Smart sync or Upsert, all write operations on the recipe output must be performed through DSS. This also means that you cannot provide an output node that already contains data, when using one of the smart update methods.

Limitations¶

Output dataset partitioning is not supported.
The Extract fields recipe relies on a VLM to process documents. The processing limit depends on the number of document pages sent as images to the VLM. If it exceeds the model limit, a warning is raised, the document is skipped, and other documents continue to be processed.