Working with Vector stores¶
Vector store types¶
Out of the box, Knowledge Banks are created with a Vector Store called Chroma. This does not require any setup, and provides good performance even for quite a large corpus.
As an alternative, other no-setup Vector Stores are available: Milvus (local), Qdrant, and FAISS.
For more advanced use cases, you may wish to use a dedicated Vector Store. Dataiku supports several third-party vector stores that require you to set up a dedicated connection beforehand:
Azure AI search
Elasticsearch
OpenSearch, including AWS OpenSearch services (both managed cluster & serverless)
Milvus (remote)
Pinecone
Pgvector (based on a PostgreSQL Connection)
Vertex Vector Search (based on a Google Cloud Storage Connection)
When creating the Embedding recipe, select the desired vector store type, then select your connection. You can also change the vector store type later, by editing the settings of the Knowledge Bank.
DSS provides a default name for the index (may be known as “table”, “collection” or “service” depending on the type of vector store) that you can update if needed.
Note
When setting up an Elasticsearch, an OpenSearch, a Google Cloud Storage or a PostgreSQL connection, you must allow the connection to be used with Knowledge Banks. There is a setting in the connection panel to allow this.
Limitations¶
You need an Elasticsearch version >=7.14.0 to store a Knowledge Bank.
Elasticsearch >=8.0.0 and <8.8.0 supports only embeddings of size smaller than 1024. Embedding models generating larger embedding vectors will not work.
Milvus (local) does not support empty values in metadata. Dataiku fills empty values with defaults depending on the type (NaN for numbers, False for booleans, and empty string for other types).
Milvus (local and remote) does not support adding new metadata columns to already-built Knowledge Banks created from an Embed Documents recipe. New metadata columns are ignored until the Knowledge Bank is cleared and rebuilt.
Milvus (local and remote) supports switching update method from Smart sync to Append only after clearing the Knowledge Bank.
Only Private key authentication is supported for Google Cloud Storage connections used for Knowledge bank usage.
Smart update methods (Smart sync and Upsert) are not supported on the following vector store types: Pinecone, AWS OpenSearch serverless.
Note that, after running the embedding recipe, remote vector stores might take some time to update their indexing data in their respective user interfaces.
Pgvector supports filtering on metadata columns with names containing only alphanumeric characters and underscore
_.In pgvector, metadata columns with names containing only lowercase letters and underscore
_are stored in individual columns, while other columns (with names containing uppercase letters or other symbols) are nested into a json column calledmetadata.