Extracting document content¶
Dataiku can natively process unstructured documents and extract their content using the “Extract content” recipe. It takes as input a a managed folder of documents and outputs a Dataset with extracted content to different granularities: one row per page, detected sections or a single row per document.
To get started with document extraction, see our How to: Extract unstructured content into a dataset.
Note
The “Embed documents” recipe allows to extract documents in a similar way into a vector store instead of a dataset (see Embedding and searching documents):
Supported document types¶
The “Extract content” recipe supports the following file types:
PDF
PPTX/PPT
DOCX/DOC
ODT/ODP
TXT
MD (Markdown)
PNG
JPG/JPEG
HTML
Text extraction vs. VLM extraction¶
The “Extract content” recipe supports two ways of handling documents.
Text extraction¶
Text extraction extracts text from documents and organizes it into meaningful units. It supports PDF, DOCX, PPTX, HTML, TXT, and MD. Image formats (PNG, JPEG, JPG) are supported if Optical Characters Recognition (OCR) is enabled.
Two engines are available for text extraction and can be configured using custom rules.
Raw text extraction¶
This engine focuses on the physical layout of the document. It extracts text in a single row per document but keep the page/slide division of the document in the structured content column whenever it’s possible (PPTX and PDF files).
If OCR is enabled, PDF files are first converted into images. The engine then extracts text from those images (useful for scanned documents).
Because this engine does not try to infer the semantic structure of the document, it is very fast.
Structured text extraction¶
Structured text extraction runs as follows:
The text content is extracted from the document.
If headers are available, they are used to divide the content into meaningful units.
The extracted text is aggregated into one row per section or per document while keeping the structure of detected sections in a structured column.
Text can also be extracted from images detected in the documents:
with the ‘Optical Character Recognition’ (OCR) image handling mode. You can either choose EasyOCR or Tesseract as OCR engines. EasyOCR does not require any configuration but is slow when running on CPU. Tesseract requires some configuration, see OCR setup. Enabling OCR is recommended on scanned documents.
with the ‘VLM description’ image handling mode. A visual LLM is used to generate descriptions for each image in the document. Available for PDF, DOCX and PPTX files.
By default, this engine uses a lightweight classification model to identify and filter out non-informative images (such as barcodes, signatures, icon, logos, or QR codes) from text processing. While these images are skipped during extraction, all images are still saved to the output folder if Store Images is enabled in Output > Storage settings. To process all images regardless of content, disable Skip non-informative images in the recipe’s rules advanced settings.
Note
Structured text extraction requires internet access for PDF document extraction. The models that need to be downloaded are layout models available from Hugging Face. The runtime environment will need to have access to the internet at least initially so that those models can be downloaded and placed in the huggingface cache.
If your instance does not have internet access then you can download those models manually. Here are the steps to follow:
Go to the model repository and clone the repository (on the v2.3.0 revision)
Create a “ds4sd--docling-models” repository in the resources folder of the document extraction code environment (or the code env you chose for the recipe), under: /code_env_resources_folder/document_extraction_models/ds4sd--docling-models
The folder “ds4sd--docling-models”, should contain the same files as https://huggingface.co/ds4sd/docling-models/tree/v2.3.0/
Create a “ds4sd--docling-layout-egret-medium” repository in the resources folder of the document extraction code environment (or the code env you chose for the recipe), under: /code_env_resources_folder/document_extraction_models/ds4sd--docling-layout-egret-medium
The folder “ds4sd--docling-layout-egret-medium”, should contain the same files as https://huggingface.co/docling-project/docling-layout-egret-medium/tree/main/
(Optional) You can also choose to download “docling-layout-heron” model in addition to / instead of “docling-layout-egret-medium” if you need more accuracy in layout detection and sacrifice speed.
Create a “ds4sd--docling-layout-heron” repository in the resources folder of the document extraction code environment (or the code env you chose for the recipe), under: /code_env_resources_folder/document_extraction_models/ds4sd--docling-layout-heron
The folder “ds4sd--docling-layout-heron”, should contain the same files as https://huggingface.co/docling-project/docling-layout-heron/tree/main/
If the models are not in this resources folder, then the huggingface cache will be checked and if the cache is empty, the models will be downloaded and placed in the huggingface cache.
Note
You can edit the run configuration of the text extraction engine in the Administration > Settings > Other > LLM Mesh > Configuration > Document extraction recipes.
VLM extraction¶
For complex documents, Dataiku implements another strategy based on Vision LLMs (VLM), i.e. LLMs that can take images as input. If your LLM Mesh is connected to one of these (see Multimodal capabilities for a list), you can instead use the VLM strategy.
Instead of extracting the text, the recipe transforms each page of the document into images.
Ranges of images are sent to the VLM, asking for a summary.
Ranges of images are sent to the VLM, asking for an extraction of the content (including description of visual parts like graphics, tables).
The extracts are then saved as dataset rows or aggregated to one row per document.
The images themselves can be stored to a managed folder and their paths will be referenced in corresponding dataset rows.
The advanced image understanding capabilities of the VLM allow for much more relevant answers than just using extracted text.
The “Extract content” recipe supports VLM strategy for DOCX/DOC, PPTX/PPT, PDF, ODT/ODP, JPG/JPEG and PNG files.
Initial document extraction setup¶
Document Extraction is automatically preinstalled when using Dataiku Cloud Stacks or Dataiku Cloud. If you are using Dataiku Custom, before using the VLM extraction, you need a server administrator with elevated (sudoers) privileges to run:
sudo -i "/home/dataiku/dataiku-dss-VERSION/scripts/install/install-deps.sh" -with-libreoffice
Text extraction on DOCX/PDF/PPTX (with both engines) requires to install and enable a dedicated code environment (see Code environments):
In Administration > Code envs > Internal envs setup, in the Document extraction code environment section, select a Python version from the list and click Create code environment.
OCR setup¶
When using the OCR mode of the text extraction, you can choose between EasyOCR and Tesseract. The AUTO mode will first use Tesseract if installed, else will use EasyOCR.
Tesseract¶
Tesseract is preinstalled on Dataiku Cloud and Dataiku Cloud Stacks. If you are using Dataiku Custom, Tesseract needs to be installed on the system. Dataiku uses the tesserocr python package as a wrapper around the tesseract-ocr API. It requires libtesseract (>=3.04 and libleptonica (>=1.71).
The English language and the OSD files must be installed. Additional languages can be downloaded and added to the tessdata repository. Here is the list of supported languages.
For example on Ubuntu/Debian:
sudo apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config
On AlmaLinux:
sudo dnf install tesseract
curl -L -o /usr/share/tesseract/tessdata/osd.traineddata https://github.com/tesseract-ocr/tessdata/raw/4.1.0/osd.traineddata
chmod 0644 /usr/share/tesseract/tessdata/osd.traineddata
At runtime, Tesseract relies on the TESSDATA_PREFIX environment variable to locate the tessdata folder. This folder should contain the language files and config. You can either:
Set the TESSDATA_PREFIX environment variable (must end with a slash /). It should point to the tessdata folder of the instance.
Leave it unset. During the Document Extraction internal code env resources initialization, DSS will look for possible locations of the folder, copy it to the resources folder of the code env, then set the TESSDATA_PREFIX accordingly.
Note
If run in a container execution configuration, DSS handles the installation of Tesseract during the build of the image.
EasyOCR¶
EasyOCR does not require any additional configuration. But it’s very slow if run on CPU. We recommend using an execution environment with GPU.
Note
By default EasyOCR will try to download missing language files. Any of the supported languages can be added in the UI of the recipe. If your instance does not have access to the internet, then all requested language models need to be directly accessible. DSS expects to find the language files in the resources folder of the code environment: /code_env_resources_folder/document_extraction_models/EasyOCR/model. You can retrieve the language files (*.pth) from here
Output update methods¶
There are four different methods that you can choose for updating your recipe’s output and its associated folder (used for assets storage).
You can select the update method in the recipe output step.
Method |
Description |
|---|---|
Smart sync |
Synchronizes the recipe’s output to match the input folder documents, smartly deciding which documents to add/update or remove. |
Upsert |
Adds and updates the documents from the input folder into the recipe’s output. Smartly avoids adding duplicate documents. Does not delete any existing document that is no-longer present in the input folder. |
Overwrite |
Deletes the existing output, and recreates it from scratch, using the input folder documents. |
Append |
Adds the documents from the input folder into the recipe’s output, without deleting or updating any existing records. Can result in duplicates. |
Documents are identified by their path in the input folder. Renaming or moving around documents will prevent smart modes from matching them with any pre-existing documents and can result in outdated or duplicated versions of documents in the output of the recipe.
The update method also manages the output folder of the recipe to ensure its content synchronisation with the recipe’s output. Non-managed deletions in the output folder is not recommended and can cause the recipe’s output to point to a missing source.
Tip
If your folder changes frequently, and you need to frequently re-run your recipe, choosing one of the smart update methods, Smart sync or Upsert, will be much more efficient than Overwrite or Append.
The smart update methods minimize the number of documents to be re-extracted, thus lowering the cost of running the recipe repeatedly.
Warning
When using one of the smart update methods, Smart sync or Upsert, all write operations on the recipe output must be performed through DSS. This also means that you cannot provide an output node that already contains data, when using one of the smart update methods.
Limitations¶
Output dataset partitioning is not supported.
Some DSS SQL connectors, notably MySQL, Oracle and Teradata, only support a limited number of characters in each column of data. The output of this recipe is likely to exceed those limits, resulting in an error message similar to:
Data too long for column 'extracted_content' at row 1orcan bind a LONG value only for insert into a LONG column.To work around this limitation, you can manually redefine the structure of the output dataset.
Warning
This will delete all previously extracted data.
Go to the recipe’s output dataset Settings > Advanced tab.
Change
Table creation modetoManually defineModify
Table creation SQLto allow for a longer content in columnssection_outline,extracted_contentandstructured_content. For example, you can increase the number of characters limit or switch to a different type (e.gLONGTEXTorNCLOB).SAVE, then click onDROP TABLEandCREATE TABLE NOWto confirm the new structure is in place.Re-run the recipe to take into account the new limits.