Working with Vector stores

Vector store types

Out of the box, Knowledge Banks are created with a Vector Store called Chroma. This does not require any setup, and provides good performance even for quite large corpus.

As an alternative, other no-setup Vector Stores are available: Qdrant and FAISS.

For more advanced use cases, you may wish to use a dedicated Vector Store. Dataiku supports several third-party vector stores that require you to set up a dedicated connection beforehand:

  • Azure AI search

  • Elasticsearch

  • OpenSearch, including AWS OpenSearch services (both managed cluster & serverless)

  • Pinecone

  • Vertex Vector Search (based on a Google Cloud Storage Connection)

When creating the Embedding recipe, select the desired vector store type, then select your connection. You can also change the vector store type later, by editing the settings of the Knowledge Bank.

For Azure AI Search, Vertex Vector Search, Elasticsearch and OpenSearch, we provide a default index name that you can update if needed. For Pinecone, make sure to provide an existing index name.

Note

When setting up an Elasticsearch, an OpenSearch or a Google Cloud Storage connection, you must allow the connection to be used with Knowledge Banks. There is a setting in the connection panel to allow this.

Limitations

  • Rebuilding a Pinecone-based Knowledge Bank may require that you manually delete and recreate the Pinecode index.

  • You need an Elasticsearch version >=7 to store a Knowledge Bank.

  • Elasticsearch >=8.0.0 and <8.8.0 supports only embeddings of size smaller than 1024. Embedding models generating larger embedding vectors will not work.

  • Only Private key authentication is supported for Google Cloud Storage connections used for Knowledge bank usage.

  • Smart update methods (Smart sync and Upsert) are not supported on the following vector store types: Azure AI Search, Pinecone, AWS OpenSearch serverless. Additionally, they are not supported for Vertex AI when using a Python 3.8 code environment for RAG purposes.

  • Note that, after running the embedding recipe, remote vector stores might take some time to update their indexing data in their respective user interfaces.

Update methods

There are four different methods that you can choose for updating your vector store.

You can select the update method in the embedding recipe settings.

Method

Description

Smart sync

Synchronizes the vector store to match the input dataset, smartly deciding which rows to add/remove/update.

Upsert

Adds and updates the rows from the input dataset into the vector store. Smartly avoids adding duplicate rows. Does not delete any existing records that are not present in the input dataset.

Overwrite

Deletes the existing vector store, and recreates it from scratch, using the input dataset.

Append

Adds the rows from the input dataset into the vector store, without deleting existing records. Can result in duplicate records in the vector store.

The two smart update methods, Smart sync and Upsert, require a Document unique ID parameter. This is the ID of the document before any chunking has been applied. This ID is used to avoid adding duplicate records and to smartly compute the minimum number of add/remove/update operations needed.

Tip

If your dataset changes frequently, and you need to frequently re-run your embedding recipe, choosing one of the smart update methods, Smart sync or Upsert, will be much more efficient than Overwrite or Append.

The smart update methods make fewer calls to the embedding model, and thus lower the cost of running the embedding recipe repeatedly.

Warning

When using one of the smart update methods, Smart sync or Upsert, all write operations on the vector store must be performed through DSS. This also means that you cannot provide a vector store that already contains data, when using one of the smart update methods.