Embedding and searching documents

In addition to the traditional embedding of text and storage in Vector Stores, Dataiku can also work directly with unstructured documents.

The “Embed documents” recipe takes a managed folder of documents as input and outputs a Knowledge Bank that can directly be used to query the content of these documents.

To get started with document embedding, see our Tutorial: Build a multimodal knowledge bank for a RAG project.

Supported document types

The “Embed documents” recipe supports the following file types:

  • PDF

  • PPTX

  • DOCX

  • TXT

  • MD (Markdown)

Text Extraction vs. Vision LLM

The “Embed documents” recipe supports two ways of handling documents.

Text extraction

The simplest one is Text Extraction:

  • The text content of the documents is extracted

  • The extracted text is split into chunks to fit the embedding model. When parsing Markdown files, the splitting also respects the Markdown sections

  • The chunks are embedded

At query time, when asking a text question, using the embedding of the question, the relevant parts of the document are passed in the context of the LLM.

While this method works well on simple text documents, it does not work with complex documents, especially those that have images, figures, complex tables, …

VLM

For complex documents, Dataiku implements another strategy based on Vision LLMs (VLM), i.e. LLMs that can take images as input. If your LLM Mesh is connected to one of these (see Multimodal capabilities for a list), you can instead use the VLM strategy.

  • Instead of extracting the text, the recipe transforms each page of the document into images

  • Ranges of images are sent to the VLM, asking for a summary

  • The summary of each range of images is embedded

At query time, when asking a text question:

  • Using the embedding of the question, the relevant ranges are retrieved

  • The matching images are directly passed in the context of the VLM

  • The VLM then directly uses the images to answer

The advanced image understanding capabilities of the VLM allow for much more relevant answers than just using extracted text.

The “Embed Documents” recipe supports VLM strategy for DOCX, PPTX and PDF files.

If you use a VLM when creating the “Embed Documents” recipe, in addition to the knowledge bank, a managed folder with the images extracted from your documents is created as output of the recipe.

Initial setup

Document Extraction is automatically preinstalled when using Dataiku Cloud Stacks or Dataiku Cloud.

If you are using Dataiku Custom, before you can use the VLM strategy, you need a server administrator with elevated (sudoers) privileges to run:

sudo -i "/home/dataiku/dataiku-dss-VERSION/scripts/install/install-deps.sh" -with-libreoffice

Limitations

Text-based files (like TXT or MD) must use UTF-8 encoding for Text Extraction.