Embedding & searching datasets¶
Your first RAG using a dataset:¶
In your project, select the dataset that will be used as your corpus. It needs to have at least one column of text
Create a new embed dataset recipe
Give a name to your knowledge bank
Select the embedding model to use
In the settings of the Embedding recipe, select the column of text
Optionally, select one or several metadata columns. These columns will be displayed in the Sources section of the answer
Run the embed dataset recipe
Open the Knowledge Bank
You will now define a Retrieval-Augmented LLM
Select the underlying LLM that will be queried
Optionally, tune the advanced settings for the search in the vector store
Click the Test in Prompt Studio button for your new Retrieval-Augmented LLM
This will automatically open Prompt Studio and create a new prompt for you, with your Retrieval-Augmented LLM pre-selected
Ask your question
You will now receive an answer that feeds on info gathered from your corpus dataset, with Sources indicating how this answer was generated
“Embed dataset” Update methods¶
There are four different methods that you can choose for updating your vector store.
You can select the update method in the embed dataset recipe settings.
Method |
Description |
---|---|
Smart sync |
Synchronizes the vector store to match the input dataset, smartly deciding which rows to add/remove/update. |
Upsert |
Adds and updates the rows from the input dataset into the vector store. Smartly avoids adding duplicate rows. Does not delete any existing records that are not present in the input dataset. |
Overwrite |
Deletes the existing vector store, and recreates it from scratch, using the input dataset. |
Append |
Adds the rows from the input dataset into the vector store, without deleting existing records. Can result in duplicate records in the vector store. |
The two smart update methods, Smart sync and Upsert, require a Document unique ID parameter. This is the ID of the document before any chunking has been applied. This ID is used to avoid adding duplicate records and to smartly compute the minimum number of add/remove/update operations needed.
Tip
If your dataset changes frequently, and you need to frequently re-run your embed dataset recipe, choosing one of the smart update methods, Smart sync or Upsert, will be much more efficient than Overwrite or Append.
The smart update methods make fewer calls to the embedding model, and thus lowering the cost of running the embedding recipe repeatedly.
Warning
When using one of the smart update methods, Smart sync or Upsert, all write operations on the vector store must be performed through DSS. This also means that you cannot provide a vector store that already contains data, when using one of the smart update methods.