Extracting document content¶
Dataiku can natively process unstructured documents and extract their content using the “Extract content” recipe. It takes as input a a managed folder of documents and outputs a Dataset with extracted content to different granularities: one row per page, detected sections or a single row per document.
To get started with document extraction, see our How to: Extract unstructured content into a dataset.
Note
The “Embed documents” recipe allows to extract documents in a similar way into a vector store instead of a dataset (see Embedding and searching documents):
Supported document types¶
The “Extract content” recipe supports the following file types:
PDF
PPTX/PPT
DOCX/DOC
ODT/ODP
TXT
MD (Markdown)
PNG
JPG/JPEG
HTML
Text Extraction vs. VLM extraction¶
The “Extract content” recipe supports two ways of handling documents.
Text extraction¶
The simplest one is Text Extraction, it extracts text from documents and uses headers when available to divide the content into meaningful extraction units.
Supported file formats are PDF, DOCX, PPTX, HTML, TXT, MD (and PNG, JPEG, JPG with OCR enabled).
The extraction runs as follows:
The text content is extracted from the document.
If headers are available, they are used to divide the content into meaningful units.
The extracted text is aggregated into one row per section or per document while keeping the structure of detected sections in a structured column.
Text can also be extracted from images detected in the documents:
with the Optical Character Recognition (OCR) image handling mode. You can either choose EasyOCR or Tesseract as OCR engines. EasyOCR does not require any configuration but is slow when running on CPU. Tesseract requires some configuration, see OCR setup below. Enabling OCR is recommended on scanned documents.
with the ‘VLM description’ image handling mode. A visual LLM is used to generate descriptions for each image in the document. Available for PDF, DOCX and PPTX files.
Note
Text extraction requires internet access for PDF document extraction. The models that need to be downloaded are layout models available from Hugging Face. The runtime environment will need to have access to the internet at least initially so that those models can be downloaded and placed in the huggingface cache.
If your instance does not have internet access then you can download those models manually. Here are the steps to follow:
Go to the model repository and clone the repository (on the v2.2.0 revision)
Create a “ds4sd--docling-models” repository in the resources folder of the document extraction code environment (or the code env you chose for the recipe), under: /code_env_resources_folder/document_extraction_models/ds4sd--docling-models
The folder “ds4sd--docling-models”, should contain the same tree structure as https://huggingface.co/ds4sd/docling-models/tree/v2.2.0/
If the models are not in this resources folder, then the huggingface cache will be checked and if the cache is empty, the models will be downloaded and placed in the huggingface cache.
Note
You can edit the run configuration of the text extraction engine in the Administration > Settings > Other > LLM Mesh > Configuration > Document extraction recipes.
VLM extraction¶
For complex documents, Dataiku implements another strategy based on Vision LLMs (VLM), i.e. LLMs that can take images as input. If your LLM Mesh is connected to one of these (see Multimodal capabilities for a list), you can instead use the VLM strategy.
Instead of extracting the text, the recipe transforms each page of the document into images.
Ranges of images are sent to the VLM, asking for a summary.
The summaries are then saved as dataset rows or aggregated to one row per document.
The images themselves can be stored to a managed folder and their paths will be referenced in corresponding dataset rows.
The advanced image understanding capabilities of the VLM allow for much more relevant answers than just using extracted text.
The “Extract content” recipe supports VLM strategy for DOCX/DOC, PPTX/PPT, PDF, ODT/ODP, JPG/JPEG and PNG files.
Initial setup¶
Document Extraction is automatically preinstalled when using Dataiku Cloud Stacks or Dataiku Cloud. If you are using Dataiku Custom, before using the VLM extraction, you need a server administrator with elevated (sudoers) privileges to run:
sudo -i "/home/dataiku/dataiku-dss-VERSION/scripts/install/install-deps.sh" -with-libreoffice
Text extraction on DOCX/PDF/PPTX requires to install and enable a dedicated code environment (see Code environments):
In Administration > Code envs > Internal envs setup, in the Document extraction code environment section, select a Python version from the list and click Create code environment.
OCR setup¶
When using the OCR mode of the text extraction, you can choose between EasyOCR and Tesseract. The AUTO mode will first use Tesseract if installed, else will use EasyOCR.
Tesseract¶
Tesseract is preinstalled on Dataiku Cloud and Dataiku Cloud Stacks. If you are using Dataiku Custom, Tesseract needs to be installed on the system. Dataiku uses the tesserocr python package as a wrapper around the tesseract-ocr API. It requires libtesseract (>=3.04 and libleptonica (>=1.71).
The English language and the OSD files must be installed. Additional languages can be downloaded and added to the tessdata repository. Here is the list of supported languages.
For example on Ubuntu/Debian:
sudo apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config
On AlmaLinux:
sudo dnf install tesseract
curl -L -o /usr/share/tesseract/tessdata/osd.traineddata https://github.com/tesseract-ocr/tessdata/raw/4.1.0/osd.traineddata
chmod 0644 /usr/share/tesseract/tessdata/osd.traineddata
At runtime, Tesseract relies on the TESSDATA_PREFIX environment variable to locate the tessdata folder. This folder should contain the language files and config. You can either:
Set the TESSDATA_PREFIX environment variable (must end with a slash /). It should point to the tessdata folder of the instance.
Leave it unset. During the Document Extraction internal code env resources initialization, DSS will look for possible locations of the folder, copy it to the resources folder of the code env, then set the TESSDATA_PREFIX accordingly.
Note
If run in a container execution configuration, DSS handles the installation of Tesseract during the build of the image.
EasyOCR¶
EasyOCR does not require any additional configuration. But it’s very slow if run on CPU. We recommend using an execution environment with GPU.
Note
By default EasyOCR will try to download missing language files. Any of the supported languages can be added in the UI of the recipe. If your instance does not have access to the internet, then all requested language models need to be directly accessible. DSS expects to find the language files in the resources folder of the code environment: /code_env_resources_folder/document_extraction_models/EasyOCR/model. You can retrieve the language files (*.pth) from here
Limitations¶
“Extract content” recipe only supports “Overwrite” as an update method. This means that all input documents will be re-extracted upon rebuild.
Output dataset partitioning is not supported.
Some DSS SQL connectors, notably MySQL, Oracle and Teradata, only support a limited number of characters in each column of data. The output of this recipe is likely to exceed those limits, resulting in an error message similar to:
Data too long for column 'extracted_content' at row 1orcan bind a LONG value only for insert into a LONG column.To work around this limitation, you can manually redefine the structure of the output dataset.
Warning
This will delete all previously extracted data.
Go to the recipe’s output dataset Settings > Advanced tab.
Change
Table creation modetoManually defineModify
Table creation SQLto allow for a longer content in columnssection_outline,extracted_contentandstructured_content. For example, you can increase the number of characters limit or switch to a different type (e.gLONGTEXTorNCLOB).SAVE, then click onDROP TABLEandCREATE TABLE NOWto confirm the new structure is in place.Re-run the recipe to take into account the new limits.