Embedding and searching documents

In addition to the traditional embedding of text and storage in Vector Stores, Dataiku can also work directly with unstructured documents.

The “Embed documents” recipe takes a managed folder of documents as input and outputs a Knowledge Bank that can directly be used to query the content of these documents.

To get started with document embedding, see our Tutorial: Build a multimodal knowledge bank for a RAG project.

Supported document types

The “Embed documents” recipe supports the following file types:

  • PDF

  • PPTX/PPT

  • DOCX/DOC

  • ODT/ODP

  • TXT

  • MD (Markdown)

  • PNG

  • JPG/JPEG

  • HTML

Text Extraction vs. VLM extraction

The “Embed documents” recipe supports two ways of handling documents.

Text extraction

The simplest one is Text Extraction, it extracts text from documents and uses headers when available to divide the content into meaningful extraction units. Supported file formats are PDF, DOC/DOCX, HTML, TXT, MD.

  • The text content is extracted from the document.

  • If headers are available, they are used to divide the content into meaningful units.

  • The extracted text is split into chunks if necessary to fit the embedding model.

  • The chunks are embedded.

While this method works well to extract text from documents, it does not extract non-text elements such as images, figures, or complex tables.

VLM extraction

For complex documents, Dataiku implements another strategy based on Vision LLMs (VLM), i.e. LLMs that can take images as input. If your LLM Mesh is connected to one of these (see Multimodal capabilities for a list), you can instead use the VLM strategy.

  • Instead of extracting the text, the recipe transforms each page of the document into images.

  • Ranges of images are sent to the VLM, asking for a summary.

  • The summary of each range of images is embedded.

At query time, when asking a text question:

  • Using the embedding of the question, the relevant ranges are retrieved

  • The matching images are directly passed in the context of the VLM

  • The VLM then directly uses the images to answer

The advanced image understanding capabilities of the VLM allow for much more relevant answers than just using extracted text.

The “Embed Documents” recipe supports VLM strategy for DOCX/DOC, PPTX/PPT, PDF, ODT/ODP, JPG/JPEG and PNG files.

When creating the “Embed Documents” recipe, in addition to the knowledge bank, a managed folder with the images extracted from your documents is created as output of the recipe.

Initial setup

  • Document Extraction is automatically preinstalled when using Dataiku Cloud Stacks or Dataiku Cloud. If you are using Dataiku Custom, before using the VLM extraction, you need a server administrator with elevated (sudoers) privileges to run:

sudo -i "/home/dataiku/dataiku-dss-VERSION/scripts/install/install-deps.sh" -with-libreoffice
  • Text extraction requires to install and enable a dedicated code environment (see Code environments):

    In Administration > Code envs > Internal envs setup, in the Document extraction code environment section, select a Python version from the list and click Create code environment.

“Embed documents” Update methods

There are four different methods that you can choose for updating your vector store and its associated folder (used for VLM extraction).

You can select the update method in the embed document recipe output settings.

Method

Description

Smart sync

Synchronizes the vector store to match the input folder documents, smartly deciding which documents to add/update or remove.

Upsert

Adds and updates the documents from the input folder into the vector store. Smartly avoids adding duplicate documents. Does not delete any existing document that is no-longer present in the input folder.

Overwrite

Deletes the existing vector store, and recreates it from scratch, using the input folder documents.

Append

Adds the documents from the input folder into the vector store, without deleting or updating any existing records. Can result in duplicated records in the vector store.

Documents are identified by their path in the input folder. Renaming or moving around documents will prevent smart modes from matching them with any pre-existing documents and can result in outdated or duplicated versions of documents in the vector store.

The update method also manages the output folder of the recipe to ensure its content synchronisation with the vector store. Non-managed deletions in the output folder is not recommended and can cause the vector store to point to a missing source.

Tip

If your folder changes frequently, and you need to frequently re-run your embed document recipe, choosing one of the smart update methods, Smart sync or Upsert, will be much more efficient than Overwrite or Append.

The smart update methods minimize the number of documents to be re-extracted and the calls to the embedding model, thus lowering the cost of running the recipe repeatedly.

Warning

When using one of the smart update methods, Smart sync or Upsert, all write operations on the vector store must be performed through DSS. This also means that you cannot provide a vector store that already contains data, when using one of the smart update methods.