Retrieval-Augmented Generation¶
Retrieval-Augmented Generation, or RAG, is a standard technique used with LLMs, in order to give to standard LLMs the knowledge of your particular business problem.
RAG supposes that you already have a corpus of text knowledge. When you query a Retrieval-Augmented LLM, the most relevant elements of your corpus are automatically selected, and are added into the query that is sent to the LLM, so that the LLM can synthesize an answer using that contextual knowledge.
Concepts¶
RAG works using Embedding, i.e. a vector representation of a chunk of text. The Embedding is performed by a specialized kind of LLM, the Embedding LLM.
In order to perform RAG in Dataiku, you first must create an Embedding recipe. The Embedding recipe takes your text corpus as input, and outputs a Knowledge Bank.
The Knowledge Bank contains the embedded version of your corpus, that is stored in a Vector Store. A vector store is a specialized kind of database, that allows to quickly search for the “closest vectors”.
You then define Retrieval-augmented LLMs. A retrieval-augmented LLM is the combination of a standard LLM and a Knowledge Bank, with some search settings.
When you submit a query to the Retrieval-augmented LLM (either with the Prompt Studio, a Prompt Recipe, or using the LLM Mesh API), Dataiku will automatically:
Use the Embedding LLM in order to obtain the embedded representation of the query
Use the vector store in the Knowledge Bank in order to search for the embedded vectors (i.e. the documents of your corpus) that are the closest to your query
Add the relevant documents to the query
Query the actual LLM with the contextual document
Respond with the context-aware answer, as well as with information about which documents were used (the “sources” of the augmented query)
Initial setup¶
Install and enable the RAG code env¶
In order to perform RAG, you need a dedicated code environment (see Code environments) with the appropriate packages.
On self-managed DSS¶
In Administration > Code envs > Internal envs setup, in the Retrieval augmented generation code environment section, select a Python interpreter in the list and click Create code environment
In Administration > Settings > LLM Mesh, in the Retrieval augmented generation section, select Use internal code env
On Dataiku Cloud¶
Create a new Python 3.9 code env
In Packages to install, click Add sets of packages, and select Retrieval Augmented Generation models
Click Save and update
In the launchpad, go to the code env tab and set the code env you just created as default for Retrieval augmentation
(Legacy) If you are using this setup but are not on Dataiku Cloud, do the following instead: in Administration > Settings > LLM Mesh, in the Retrieval augmented generation section, select the code env you just created
Embedding LLMs¶
In order to use RAG, you must have at least one LLM connection that supports embedding LLMs. At the moment, embedding is supported on the following connection types:
OpenAI
Azure OpenAI
AWS Bedrock
Databricks Mosaic AI
Snowflake Cortex
Local Hugging Face
Mistral AI
Vertex Generative AI
Amazon Sagemaker LLM
Custom LLM Plugins
Your first RAG setup¶
In your project, select the dataset that will be used as your corpus. It needs to have at least one column of text
Create a new embedding recipe
Give a name to your knowledge bank
Select the embedding model to use
In the settings of the Embedding recipe, select the column of text
Optionally, select one or several metadata columns. These columns will be displayed in the Sources section of the answer
Run the embedding recipe
Open the Knowledge Bank
You will now define a Retrieval-Augmented LLM
Select the underlying LLM that will be queried
Optionally, tune the advanced settings for the search in the vector store
Click the Test in Prompt Studio button for your new Retrieval-Augmented LLM
This will automatically open Prompt Studio and create a new prompt for you, with your Retrieval-Augmented LLM pre-selected
Ask your question
You will now receive an answer that feeds on info gathered from your corpus dataset, with Sources indicating how this answer was generated
Vector store types¶
Out of the box, Knowledge Banks are created with a Vector Store called Chroma. This does not require any setup, and provides good performance even for quite large corpus.
As an alternative, other no-setup Vector Stores are available: Qdrant and FAISS.
For more advanced use cases, you may wish to use a dedicated Vector Store. Dataiku supports several third-party vector stores that require you to set up a dedicated connection beforehand:
Azure AI search
ElasticSearch
OpenSearch
Pinecone
Vertex Vector Search (based on a Google Cloud Storage Connection)
When creating the Embedding recipe, select the desired vector store type, then select your connection. You can also change the vector store type later, by editing the settings of the Knowledge Bank.
For Azure AI Search, Vertex Vector Search, ElasticSearch and OpenSearch, we provide a default index name that you can update if needed. For Pinecone, make sure to provide an existing index name.
Note
When setting up an ElasticSearch, an OpenSearch or a Google Cloud Storage connection, you must allow the connection to be used with Knowledge Banks. There is a setting in the connection panel to allow this.
Limitations:¶
Rebuilding a Pinecone-based Knowledge Bank may require that you manually delete and recreate the Pinecode index.
You need an ElasticSearch version >=7 to store a Knowledge Bank.
ElasticSearch >=8.0.0 and <8.8.0 supports only embeddings of size smaller than 1024. Embedding models generating larger embedding vectors will not work.
Only Private key authentication is supported for Google Cloud Storage connections used for Knowledge bank usage.
Smart update methods (Smart sync and Upsert) are currently not supported on Azure AI Search or Pinecone vector store types.
Note that, after running the embedding recipe, remote vector stores might take some time to update their indexing data in their respective user interfaces.
Update methods¶
There are four different methods that you can choose for updating your vector store. You can select the update method in the embedding recipe settings.
Method |
Description |
---|---|
Smart sync |
Synchronizes the vector store to match the input dataset, smartly deciding which rows to add/remove/update. |
Upsert |
Adds and updates the rows from the input dataset into the vector store. Smartly avoids adding duplicate rows. Does not delete any existing records that are not present in the input dataset. |
Overwrite |
Deletes the existing vector store, and recreates it from scratch, using the input dataset. |
Append |
Adds the rows from the input dataset into the vector store, without deleting existing records. Can result in duplicate records in the vector store. |
The two smart update methods, Smart sync and Upsert, require a Document unique ID parameter. This is the ID of the document before any chunking has been applied. This ID is used to avoid adding duplicate records and to smartly compute the minimum number of add/remove/update operations needed.
Tip
If your dataset changes frequently, and you need to frequently re-run your embedding recipe, choosing one of the smart update methods, Smart sync or Upsert, will be much more efficient than Overwrite or Append.
The smart update methods make fewer calls to the embedding model, and thus lower the cost of running the embedding recipe repeatedly.
Warning
When using one of the smart update methods, Smart sync or Upsert, all write operations on the vector store must be performed through DSS. This also means that you cannot provide a vector store that already contains data, when using one of the smart update methods.
Learn more¶
For more information on RAG, see also the following articles in the Knowledge Base: