Introduction to Knowledge and RAG

Adding Knowledge and performing RAG (Retrieval-Augmented Generation) in Dataiku is done using the Knowledge Banks.

RAG and Knowledge Banks primarily rely on Embeddings, i.e., a vector representation of chunks of text. These embeddings are generated by a specialized kind of LLM called an Embedding LLM.

To perform RAG in Dataiku, you must first create a Knowledge Bank, which contains an embedded version of your corpus stored in a Vector Store.

There are two ways to create a Knowledge Bank in Dataiku:

  • Embed Dataset recipe: to build a Knowledge Bank when your text corpus is already present in a structured dataset (e.g. a column with free text).

  • Embed Documents recipe: to build a Knowledge Bank directly from unstructured document files (DOCX/DOC, PPTX/PPT, PDF, HTML, ODT/ODP, JPG/JPEG, PNG, MD and TXT). This allows you to create a knowledge bank:

    • Without first importing or structuring the content into a dataset.

    • To leverage advanced image understanding capabilities of Visual LLMs to extract information from documents that may include text and visual elements like charts, graphics, and tables.

Once your Knowledge Bank is ready, you can define Retrieval-Augmented LLMs, a combination of a standard LLM and a Knowledge Bank, along with some search settings.

When you submit a query to the Retrieval-Augmented LLM (via Prompt Studio, Prompt Recipes, Agent connect, or the LLM Mesh API), Dataiku automatically:

  • Uses the Embedding LLM to convert the query into an embedded vector.

  • Searches the vector store in the Knowledge Bank to find the closest embedded chunks (documents or text segments) to your query.

  • Adds the most relevant content as context to the query.

  • Sends the enriched query to the LLM.

  • Returns a context-aware response, along with references to the documents that were used (the “sources” of the augmented query).