Extracting document content

Dataiku can natively process unstructured documents and extract their content using the “Extract content” recipe. It takes as input a a managed folder of documents and outputs a Dataset with extracted content to different granularities: one row per page, detected sections or a single row per document.

To get started with document extraction, see our How to: Extract unstructured content into a dataset.

Note

The “Embed documents” recipe allows to extract documents in a similar way into a vector store instead of a dataset (see Embedding and searching documents):

Supported document types

The “Extract content” recipe supports the following file types:

  • PDF

  • PPTX/PPT

  • DOCX/DOC

  • ODT/ODP

  • TXT

  • MD (Markdown)

  • PNG

  • JPG/JPEG

  • HTML

Text Extraction vs. VLM extraction

The “Extract content” recipe supports two ways of handling documents.

Text extraction

The simplest one is Text Extraction, it extracts text from documents and uses headers when available to divide the content into meaningful extraction units.

Supported file formats are PDF, DOCX, PPTX, HTML, TXT, MD (and PNG, JPEG, JPG with OCR enabled).

The extraction runs as follows:

  • The text content is extracted from the document.

  • If headers are available, they are used to divide the content into meaningful units.

  • The extracted text is aggregated into one row per section or per document while keeping the structure of detected sections in a structured column.

Text can also be extracted from images detected in the documents:

  • with the Optical Character Recognition (OCR) image handling mode. You can either choose EasyOCR or Tesseract as OCR engines. EasyOCR does not require any configuration but is slow when running on CPU. Tesseract requires some configuration, see OCR setup below. Enabling OCR is recommended on scanned documents.

  • with the ‘VLM description’ image handling mode. A visual LLM is used to generate descriptions for each image in the document. Available for PDF, DOCX and PPTX files.

Note

Text extraction requires internet access for PDF document extraction. The models that need to be downloaded are layout models available from Hugging Face. The runtime environment will need to have access to the internet at least initially so that those models can be downloaded and placed in the huggingface cache.

If your instance does not have internet access then you can download those models manually. Here are the steps to follow:

  • Go to the model repository and clone the repository (on the v2.2.0 revision)

  • Create a “ds4sd--docling-models” repository in the resources folder of the document extraction code environment (or the code env you chose for the recipe), under: /code_env_resources_folder/document_extraction_models/ds4sd--docling-models

  • The folder “ds4sd--docling-models”, should contain the same tree structure as https://huggingface.co/ds4sd/docling-models/tree/v2.2.0/

If the models are not in this resources folder, then the huggingface cache will be checked and if the cache is empty, the models will be downloaded and placed in the huggingface cache.

Note

You can edit the run configuration of the text extraction engine in the Administration > Settings > Other > LLM Mesh > Configuration > Document extraction recipes.

VLM extraction

For complex documents, Dataiku implements another strategy based on Vision LLMs (VLM), i.e. LLMs that can take images as input. If your LLM Mesh is connected to one of these (see Multimodal capabilities for a list), you can instead use the VLM strategy.

  • Instead of extracting the text, the recipe transforms each page of the document into images.

  • Ranges of images are sent to the VLM, asking for a summary.

  • The summaries are then saved as dataset rows or aggregated to one row per document.

  • The images themselves can be stored to a managed folder and their paths will be referenced in corresponding dataset rows.

The advanced image understanding capabilities of the VLM allow for much more relevant answers than just using extracted text.

The “Extract content” recipe supports VLM strategy for DOCX/DOC, PPTX/PPT, PDF, ODT/ODP, JPG/JPEG and PNG files.

Initial setup

  • Document Extraction is automatically preinstalled when using Dataiku Cloud Stacks or Dataiku Cloud. If you are using Dataiku Custom, before using the VLM extraction, you need a server administrator with elevated (sudoers) privileges to run:

sudo -i "/home/dataiku/dataiku-dss-VERSION/scripts/install/install-deps.sh" -with-libreoffice
  • Text extraction on DOCX/PDF/PPTX requires to install and enable a dedicated code environment (see Code environments):

In Administration > Code envs > Internal envs setup, in the Document extraction code environment section, select a Python version from the list and click Create code environment.

OCR setup

When using the OCR mode of the text extraction, you can choose between EasyOCR and Tesseract. The AUTO mode will first use Tesseract if installed, else will use EasyOCR.

Tesseract

Tesseract is preinstalled on Dataiku Cloud and Dataiku Cloud Stacks. If you are using Dataiku Custom, Tesseract needs to be installed on the system. Dataiku uses the tesserocr python package as a wrapper around the tesseract-ocr API. It requires libtesseract (>=3.04 and libleptonica (>=1.71).

The English language and the OSD files must be installed. Additional languages can be downloaded and added to the tessdata repository. Here is the list of supported languages.

For example on Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config

On AlmaLinux:

sudo dnf install tesseract
curl -L -o /usr/share/tesseract/tessdata/osd.traineddata https://github.com/tesseract-ocr/tessdata/raw/4.1.0/osd.traineddata
chmod 0644 /usr/share/tesseract/tessdata/osd.traineddata

At runtime, Tesseract relies on the TESSDATA_PREFIX environment variable to locate the tessdata folder. This folder should contain the language files and config. You can either:

  • Set the TESSDATA_PREFIX environment variable (must end with a slash /). It should point to the tessdata folder of the instance.

  • Leave it unset. During the Document Extraction internal code env resources initialization, DSS will look for possible locations of the folder, copy it to the resources folder of the code env, then set the TESSDATA_PREFIX accordingly.

Note

If run in a container execution configuration, DSS handles the installation of Tesseract during the build of the image.

EasyOCR

EasyOCR does not require any additional configuration. But it’s very slow if run on CPU. We recommend using an execution environment with GPU.

Note

By default EasyOCR will try to download missing language files. Any of the supported languages can be added in the UI of the recipe. If your instance does not have access to the internet, then all requested language models need to be directly accessible. DSS expects to find the language files in the resources folder of the code environment: /code_env_resources_folder/document_extraction_models/EasyOCR/model. You can retrieve the language files (*.pth) from here

Limitations

  • “Extract content” recipe only supports “Overwrite” as an update method. This means that all input documents will be re-extracted upon rebuild.

  • Output dataset partitioning is not supported.

  • Some DSS SQL connectors, notably MySQL, Oracle and Teradata, only support a limited number of characters in each column of data. The output of this recipe is likely to exceed those limits, resulting in an error message similar to: Data too long for column 'extracted_content' at row 1 or can bind a LONG value only for insert into a LONG column.

    To work around this limitation, you can manually redefine the structure of the output dataset.

    Warning

    This will delete all previously extracted data.

    • Go to the recipe’s output dataset Settings > Advanced tab.

    • Change Table creation mode to Manually define

    • Modify Table creation SQL to allow for a longer content in columns section_outline, extracted_content and structured_content. For example, you can increase the number of characters limit or switch to a different type (e.g LONGTEXT or NCLOB).

    • SAVE, then click on DROP TABLE and CREATE TABLE NOW to confirm the new structure is in place.

    • Re-run the recipe to take into account the new limits.