Embedding & searching datasets

Your first RAG using a dataset:

  • In your project, select the dataset that will be used as your corpus. It needs to have at least one column of text

  • Create a new embed dataset recipe

  • Give a name to your knowledge bank

  • Select the embedding model to use

  • In the settings of the Embedding recipe, select the column of text

  • Optionally, select one or several metadata columns. These columns will be displayed in the Sources section of the answer

  • Run the embed dataset recipe

  • Open the Knowledge Bank

  • You will now define a Retrieval-Augmented LLM

    • Select the underlying LLM that will be queried

    • Optionally, tune the advanced settings for the search in the vector store

  • Click the Test in Prompt Studio button for your new Retrieval-Augmented LLM

    • This will automatically open Prompt Studio and create a new prompt for you, with your Retrieval-Augmented LLM pre-selected

  • Ask your question

  • You will now receive an answer that feeds on info gathered from your corpus dataset, with Sources indicating how this answer was generated

“Embed dataset” Update methods

There are four different methods that you can choose for updating your vector store.

You can select the update method in the embed dataset recipe settings.

Method

Description

Smart sync

Synchronizes the vector store to match the input dataset, smartly deciding which rows to add/remove/update.

Upsert

Adds and updates the rows from the input dataset into the vector store. Smartly avoids adding duplicate rows. Does not delete any existing records that are not present in the input dataset.

Overwrite

Deletes the existing vector store, and recreates it from scratch, using the input dataset.

Append

Adds the rows from the input dataset into the vector store, without deleting existing records. Can result in duplicate records in the vector store.

The two smart update methods, Smart sync and Upsert, require a Document unique ID parameter. This is the ID of the document before any chunking has been applied. This ID is used to avoid adding duplicate records and to smartly compute the minimum number of add/remove/update operations needed.

Tip

If your dataset changes frequently, and you need to frequently re-run your embed dataset recipe, choosing one of the smart update methods, Smart sync or Upsert, will be much more efficient than Overwrite or Append.

The smart update methods make fewer calls to the embedding model, and thus lowering the cost of running the embedding recipe repeatedly.

Warning

When using one of the smart update methods, Smart sync or Upsert, all write operations on the vector store must be performed through DSS. This also means that you cannot provide a vector store that already contains data, when using one of the smart update methods.